Compare commits

...

257 Commits

Author SHA1 Message Date
hadaringonyama
5408a74bcb merge fix 2024-08-07 18:45:14 +03:00
Yuval Shekel
eb9326a5ae fix wrong cuda-backend branch pull 2024-08-07 16:59:16 +03:00
hadaringonyama
2835d8bd6f name change 2024-08-07 16:59:13 +03:00
Yuval Shekel
0797cb2d44 add error enumeartion for license error 2024-08-06 16:41:54 +03:00
Yuval Shekel
908a6c536f move backend-specific-config to open part to avoid installing it 2024-08-01 16:56:34 +03:00
Yuval Shekel
73dd80b63b update ntt api to accept config by const 2024-08-01 16:31:13 +03:00
Yuval Shekel
7b833afdd9 add rust ecntt benchmark 2024-08-01 15:27:42 +03:00
Yuval Shekel
7734676853 add rust ntt benchmark for fields too 2024-08-01 15:27:15 +03:00
Yuval Shekel
0bd5854d0e add rust benchmark for msm 2024-08-01 15:06:09 +03:00
Yuval Shekel
389ce1ce98 fix rust bw761 curve missing ICICLE backend install path and fix examples 2024-08-01 11:14:10 +03:00
Yuval Shekel
e95b5621e4 simplify C++ examples by loading backend from default install dir 2024-07-31 19:50:09 +03:00
Yuval Shekel
cccd6c1679 add ntt benchmark for rust 2024-07-31 19:35:40 +03:00
Yuval Shekel
d08088a4bb add runtime api to load backend from default installdir and use it everywhere instead of specifying the install dir in the code 2024-07-31 19:14:48 +03:00
Yuval Shekel
5332f4c8f8 minor update to rust poly example 2024-07-30 17:28:55 +03:00
Yuval Shekel
a45746fc3b fix rust examples calling load_backend() with removed param 2024-07-30 15:36:58 +03:00
Yuval Shekel
4a4b25445d rename template files to not format them 2024-07-30 12:11:27 +03:00
Yuval Shekel
dc1b2fce6f fix: bug where wrong polynomial factory is used to construct polynomial from wrong field 2024-07-30 12:03:57 +03:00
Yuval Shekel
65ca51ca5e update Rust examples to support installed backend 2024-07-28 22:10:00 +03:00
Yuval Shekel
f8661ac0ef update rust examples msm and ntt to support installed backend dir 2024-07-28 20:14:04 +03:00
Yuval Shekel
5d821d3db1 refactor C++ examples to work with install-dir, or build from source and choose device 2024-07-28 19:32:01 +03:00
Yuval Shekel
131e22d3c6 fix example C++ msm, missing G2 flag 2024-07-28 18:32:33 +03:00
Yuval Shekel
10963fbe0c refactor rust example msm 2024-07-28 17:08:12 +03:00
Yuval Shekel
2a3dcd776a fix format 2024-07-28 15:36:30 +03:00
Yuval Shekel
8d03fb6bc8 trigger examples build in ci 2024-07-28 15:33:00 +03:00
Yuval Shekel
445eec88af refactor C++ example risc0 2024-07-28 15:30:48 +03:00
Yuval Shekel
a5a8bf8e3d refactor C++ example poly-api 2024-07-28 15:20:23 +03:00
Yuval Shekel
68802bc14a refactor C++ example polynomial-multiplication for V3 2024-07-28 14:49:02 +03:00
Yuval Shekel
63dcd4ef5d remove multiply c++ example and fixed some readme 2024-07-28 13:01:01 +03:00
Yuval Shekel
c302aebeca refactor pedersen-commintment example to V3 2024-07-28 12:55:12 +03:00
Yuval Shekel
b07d02096f update ntt example for V3 2024-07-28 12:38:50 +03:00
Yuval Shekel
454fe5b922 update ntt-best-practice and msm examples to V3 2024-07-25 18:32:31 +03:00
Yuval Shekel
99e58532ad update readme of C++ msm example 2024-07-25 15:41:58 +03:00
Yuval Shekel
9618e6db74 split backend registration part to backend headers to avoid dep on concrete fields/curves 2024-07-24 20:34:05 +03:00
Yuval Shekel
fc2ac41710 refactor msm example 2024-07-24 20:27:26 +03:00
Yuval Shekel
49beb324a8 add script for generating C apis 2024-07-24 20:12:42 +03:00
Yuval Shekel
b52191f9dd field generation update from V2 2024-07-24 17:08:37 +03:00
Yuval Shekel
05d918fcda let PR choose the cuda-backend branch to use 2024-07-21 17:37:44 +03:00
Yuval Shekel
bdf3cd5672 add info logs to NTT test to find the issue 2024-07-21 11:36:44 +03:00
Yuval Shekel
58780f0115 fix compilation issues for mac 2024-07-18 18:40:01 +03:00
Yuval Shekel
4f47c9d4f3 remove redundant link to stdc++ in build.rs scripts 2024-07-18 18:11:11 +03:00
Yuval Shekel
1baabd93d6 add feature to rust crates to choose local/remote cuda backend 2024-07-18 18:03:42 +03:00
Yuval Shekel
a8e794fdfd use ssh-key in ci workflows to pull cuda backend 2024-07-18 17:27:49 +03:00
Yuval Shekel
934f934976 pull cuda backend when building in ci 2024-07-18 14:25:51 +03:00
Yuval Shekel
cd3353362a move cuda backend to private repo and clone it in ci 2024-07-18 12:41:27 +03:00
Yuval Shekel
afe48317a6 skip fast twiddles for domains smaller than 16 2024-07-18 12:02:22 +03:00
Yuval Shekel
7de94fff84 fix bug in cuda MSM when scalars are in montgomery form 2024-07-17 19:11:47 +03:00
Yuval Shekel
d3b27a7b68 randomize device for C++ polynomial tests 2024-07-17 19:11:47 +03:00
Yuval Shekel
0409c6f5d7 replace msm config bases_size with boolean flag indicating bases are shared for batch elements 2024-07-17 19:11:47 +03:00
Yuval Shekel
9c35aaa5c3 add doxygen comments to frontend APIs 2024-07-17 19:11:47 +03:00
Yuval Shekel
db19c3846d split default poly backend from cuda and support CPU too 2024-07-17 19:11:47 +03:00
Yuval Shekel
06243e5493 add polynomial evaluation api and cuda backend. Use it in poly backend 2024-07-17 19:11:47 +03:00
Yuval Shekel
8ade7ef393 use vector_add in polynomial backend to add monomial inplace 2024-07-17 19:11:47 +03:00
Yuval Shekel
749d3f9b6e add vec_op api for find highest non zero idx of vector, and CUDA backend 2024-07-17 19:11:47 +03:00
Yuval Shekel
0885195619 add vec_ops apis for scalar+-vector and use in polynomial backend 2024-07-17 19:11:47 +03:00
Yuval Shekel
6f8c480fd1 use vector_div in polynomial backend 2024-07-17 19:11:47 +03:00
Yuval Shekel
f13f24cff2 add vec_div api 2024-07-17 19:11:47 +03:00
Yuval Shekel
028f59cef0 use vec_mul in polynomial api insted of another kernel 2024-07-17 19:11:47 +03:00
Yuval Shekel
61af965b41 add api for mul scalar and use in poly backend 2024-07-17 19:11:47 +03:00
Yuval Shekel
091a19354e add slice api to vec ops 2024-07-17 19:11:46 +03:00
Yuval Shekel
0c8a849d7c move default polynomial context from cuda backend to icicle FE 2024-07-17 19:11:46 +03:00
Yuval Shekel
87a25afee7 use icicle runtime apis in polynomial implementation instead of cuda apis 2024-07-17 19:11:46 +03:00
Yuval Shekel
4d0b1a03a3 removed device_id from polynomial view 2024-07-17 19:11:46 +03:00
Yuval Shekel
de88ad144e minor: CPU_REF reuse device_api class like CPU 2024-07-17 19:11:46 +03:00
Yuval Shekel
9b46b8489f api for memset and tests for copy() and memset() 2024-07-17 19:11:46 +03:00
Yuval Shekel
caf3b4fab2 add generic copy and copy_async with auto device inference 2024-07-17 19:11:46 +03:00
Yuval Shekel
e49540de20 polynomial refactor to be multi-device 2024-07-17 19:11:46 +03:00
Yuval Shekel
74f7a3cbcd move tracking logic to runtime.cpp rather than device API 2024-07-17 19:11:46 +03:00
Shanie Winitz
36c879912a NTT simple cpu backend (#533) 2024-07-17 19:11:46 +03:00
Yuval Shekel
ca7fb84cf0 switch device when releasing memory of inactive device 2024-07-17 19:11:46 +03:00
Yuval Shekel
76cf23d02a ecntt bug fix from V2 2024-07-17 19:11:46 +03:00
Yuval Shekel
5a7c0ccd22 print domain size when ntt fails on that. same domain size for ntt and ecntt to avoid issues 2024-07-17 19:11:46 +03:00
Yuval Shekel
ec596c3372 rust polynomials 2024-07-17 19:11:46 +03:00
Yuval Shekel
be7cbbac89 polynomial tests fix regarding msm 2024-07-17 19:11:46 +03:00
Yuval Shekel
ef85fccd08 ecntt bug fix from v2 2024-07-17 19:11:46 +03:00
Yuval Shekel
40693a0e3e polynomial API V3 2024-07-17 19:11:46 +03:00
Yuval Shekel
6b095e6e29 track memory allocations via MemoryTracker and query pointer to device runtime APIs 2024-07-17 19:11:46 +03:00
Yuval Shekel
b8e9f90e87 fix field tests compilation and find nvcc via cmake 2024-07-17 19:11:46 +03:00
Yuval Shekel
8a6cf0258b msm test multi-device bug fix: allocate stream on correct device 2024-07-17 19:11:46 +03:00
Yuval Shekel
3e50854b61 field and curve host arithmetic tests 2024-07-17 19:11:46 +03:00
Yuval Shekel
fe28f8f160 reintroduced ntt tests against risc0 and lambdaworks 2024-07-17 19:11:46 +03:00
Yuval Shekel
1146f5620a rust ci merge build and test for faster ci 2024-07-17 19:11:46 +03:00
Yuval Shekel
401b5b8b33 bit reverse in rust 2024-07-17 19:11:46 +03:00
Yuval Shekel
48269e3a8e bit reverse api and test, including cpu and cuda backends 2024-07-17 19:11:46 +03:00
Yuval Shekel
f23051967d missing MSM cuda test 2024-07-17 19:11:46 +03:00
Yuval Shekel
7ed3cc71ce matrix ops consolidated to vec ops, some TODO fixed 2024-07-17 19:11:46 +03:00
Yuval Shekel
763b736c46 msm for cuda backend 2024-07-17 19:11:46 +03:00
Yuval Shekel
f2236562f2 rust msm 2024-07-17 19:11:45 +03:00
Yuval Shekel
45c0e3e4b9 rust bls12-377 build.rs fix 2024-07-17 19:11:45 +03:00
Yuval Shekel
2da454aa84 rust grumpkin curve 2024-07-17 19:11:45 +03:00
Yuval Shekel
0cdfb84a65 bw6-761 rust crate 2024-07-17 19:11:45 +03:00
Yuval Shekel
6578125a88 rust bls curves 2024-07-17 19:11:45 +03:00
Yuval Shekel
e14be6fc5b rust ecntt 2024-07-17 19:11:45 +03:00
Yuval Shekel
d9248b7110 bn254 curve crate 2024-07-17 19:11:45 +03:00
Yuval Shekel
86b757a6ce stark252 rust crate 2024-07-17 19:11:45 +03:00
Yuval Shekel
6627b7b5f5 rust missing ntt tests 2024-07-17 19:11:45 +03:00
Yuval Shekel
3072f6ce21 test utilities to facilitate loading backends and choosing devices atomically for all test suites 2024-07-17 19:11:45 +03:00
Yuval Shekel
29c83c7453 eplaced runtime_errors with icicle macro for throwing exceptions 2024-07-17 19:11:45 +03:00
Yuval Shekel
de650b8784 rust v3 ntt 2024-07-17 19:11:45 +03:00
Yuval Shekel
e67ac8608f rust matrix transpose 2024-07-17 19:11:45 +03:00
Yuval Shekel
df290c07a4 rust v3 vec ops, api to list registered devices and config-extension clone 2024-07-17 19:11:45 +03:00
Yuval Shekel
ffc68c25e0 wrap() errors and install icicle into the deps cargo build dir 2024-07-17 19:11:45 +03:00
Yuval Shekel
59ff2d9a11 refactored rust vec_ops config to avoid taking ownership of stream but work with ffi 2024-07-17 19:11:45 +03:00
Yuval Shekel
ca4281dafc refactored device class to own the type string 2024-07-17 19:11:45 +03:00
Yuval Shekel
a1c0c8eed8 rust icicle-core vec_ops config and default device API 2024-07-17 19:11:45 +03:00
Yuval Shekel
d8115c0404 ConfigExtension rust wrapper 2024-07-17 19:11:45 +03:00
Yuval Shekel
754f7bc2ae ConfigExtension is now pointer in config for easier bindings 2024-07-17 19:11:45 +03:00
Yuval Shekel
5c85599ac2 rust v3 babybear crate 2024-07-17 19:11:45 +03:00
Yuval Shekel
15abeb93c2 rust v3 icicle-core crate 2024-07-17 19:11:45 +03:00
Yuval Shekel
a86ebb47ec simple v3 rust example 2024-07-17 19:11:45 +03:00
Yuval Shekel
25d53e86f4 small cmake refactor, and install target 2024-07-17 19:11:45 +03:00
Yuval Shekel
c5b75f7868 minor rust ci fix 2024-07-17 19:11:45 +03:00
Yuval Shekel
42f774abbd fix identification of changed files 2024-07-17 19:11:45 +03:00
Yuval Shekel
ff6306c7a7 rust ci 2024-07-17 19:11:45 +03:00
Yuval Shekel
acb49eb598 rust runtime crate async ops 2024-07-17 19:11:45 +03:00
Yuval Shekel
9ae8d44cf1 rust runtime crate streams 2024-07-17 19:11:44 +03:00
Yuval Shekel
83cdf4145c rust runtime crate copy to/from device test 2024-07-17 19:11:44 +03:00
Yuval Shekel
3b92f60372 rust runtime crate progress 2024-07-17 19:11:44 +03:00
Yuval Shekel
da007d112e mac compilation fix and api to query device availability 2024-07-17 19:11:44 +03:00
Yuval Shekel
3309044a8c rust runtime crate 2024-07-17 19:11:44 +03:00
Yuval Shekel
baae668edf fixed cuda ntt headers 2024-07-17 19:11:44 +03:00
Yuval Shekel
59843ee2b1 no ECNTT for grumpkin 2024-07-17 19:11:44 +03:00
Yuval Shekel
77eb89fc2e get_device_properties() added to device api 2024-07-17 19:11:44 +03:00
Yuval Shekel
6fe6916050 CUDA ecntt 2024-07-17 19:11:44 +03:00
Yuval Shekel
091c9ba616 montgomery conversion for cuda 2024-07-17 19:11:44 +03:00
Yuval Shekel
41962b6d23 icicle curve links to icicle_field 2024-07-17 19:11:44 +03:00
Yuval Shekel
42490afdad fix order of loaded libs 2024-07-17 19:11:44 +03:00
Yuval Shekel
aea2dba1f0 CUDA ntt 2024-07-17 19:11:44 +03:00
Yuval Shekel
0e4af8f3ca montgomery conversion CUDA 2024-07-17 19:11:44 +03:00
Yuval Shekel
039065082a transpose and montgomery CUDA apis 2024-07-17 19:11:44 +03:00
Yuval Shekel
c94bcad4f6 CUDA vector ops 2024-07-17 19:11:44 +03:00
Yuval Shekel
6671c3b391 define options in main cmake 2024-07-17 19:11:44 +03:00
Yuval Shekel
4e7b8869da ecntt api 2024-07-17 19:11:44 +03:00
Yuval Shekel
5b5fbfb9ad avoid template specialization for curves where g2_affine is same type as affine 2024-07-17 19:11:44 +03:00
Yuval Shekel
c04a3ab182 github workflow bug fix 2024-07-17 19:11:44 +03:00
Yuval Shekel
e2237ec34d montgomery conversion G2 2024-07-17 19:11:44 +03:00
Yuval Shekel
4318aa1072 MSM G2 2024-07-17 19:11:44 +03:00
Yuval Shekel
729d1c0fd9 montgomery conversion G1 2024-07-17 19:11:44 +03:00
Yuval Shekel
632fb91a4b missing G2 ifdef 2024-07-17 19:11:44 +03:00
Yuval Shekel
bd436f2c00 missing matrix transpose for ext field and moved generate scalars api 2024-07-17 19:11:44 +03:00
Yuval Shekel
954b0e1891 ffi extern curve methods 2024-07-17 19:11:44 +03:00
Yuval Shekel
212068c196 icicle_device links to dl 2024-07-17 19:11:44 +03:00
Yuval Shekel
3c028bf44a msm precompute bases API 2024-07-17 19:11:44 +03:00
Yuval Shekel
e524db39d8 introduce log mechanism 2024-07-17 19:11:43 +03:00
Yuval Shekel
d3d296699d missing ext in msm config 2024-07-17 19:11:43 +03:00
Yuval Shekel
b43518138c fix symbol redefinition when registering an API for multiple backends in same unit 2024-07-17 19:11:43 +03:00
Yuval Shekel
53d73d3e69 disable NTT for unsupported fields 2024-07-17 19:11:43 +03:00
Yuval Shekel
d55dc491d0 curve tests should only be built for curves, not field 2024-07-17 19:11:43 +03:00
Yuval Shekel
87f725da9f msm API now returns projective type 2024-07-17 19:11:43 +03:00
Yuval Shekel
4faace4d44 small refactor to curve headers to avoid linkage errors with external libs 2024-07-17 19:11:43 +03:00
Yuval Shekel
aab5d0d6b5 macro for unique identifiers 2024-07-17 19:11:43 +03:00
Yuval Shekel
4f8fe8387f format files 2024-07-17 19:11:43 +03:00
Yuval Shekel
e5b12bbd50 fix mac linkage issue due to missing linkage to icicle_device 2024-07-17 19:11:43 +03:00
Yuval Shekel
0e0470d20a api to load backend libraries 2024-07-17 19:11:43 +03:00
Yuval Shekel
87323a68d5 msm api with issue for result type 2024-07-17 19:11:43 +03:00
Yuval Shekel
874ebf569b icicle and backends are now shared libs 2024-07-17 19:11:43 +03:00
Yuval Shekel
6d0b326f0b curve lib and test 2024-07-17 19:11:43 +03:00
Yuval Shekel
fe12616c5e extension field vector ops and ntt APIs 2024-07-17 19:11:43 +03:00
Yuval Shekel
eb5a6f95d8 removed field specific APIs from shared headers. Will move the field specific headers 2024-07-17 19:11:43 +03:00
Yuval Shekel
2b4a4c22e7 vec ops size i32->u64 2024-07-17 19:11:43 +03:00
Yuval Shekel
c2267714ad scalar_convert_montgomery() API 2024-07-17 19:11:43 +03:00
Yuval Shekel
c1e2a55919 generate_scalars() API 2024-07-17 19:11:43 +03:00
Yuval Shekel
c4c7744648 ntt init/release domain for cpu backend and test 2024-07-17 19:11:43 +03:00
Yuval Shekel
ba2a03fbc1 ntt release domain API 2024-07-17 19:11:43 +03:00
Yuval Shekel
c07eb08587 ntt init domain API 2024-07-17 19:11:43 +03:00
Yuval Shekel
eda61b2e28 matrix transpose 2024-07-17 19:11:43 +03:00
Yuval Shekel
e40c661dee removed extern C from backend APIs to avoid linkage issues 2024-07-17 19:11:43 +03:00
Yuval Shekel
42b659b2c6 removed redundant includes 2024-07-17 19:11:43 +03:00
Yuval Shekel
b1abf09896 refactored apis to be template with specialization 2024-07-17 19:11:43 +03:00
Yuval Shekel
0216068892 removed include/icicle from incdir to force include with icicle prefix 2024-07-17 19:11:43 +03:00
Yuval Shekel
7068723fad config extension 2024-07-17 19:11:43 +03:00
Yuval Shekel
4fcd5eb35c libs are built with names based on the field/curve 2024-07-17 19:11:43 +03:00
Yuval Shekel
608880db96 generic dispatcher class 2024-07-17 19:11:43 +03:00
Yuval Shekel
3ae2799cde add vec-sub and vec-mul apis 2024-07-17 19:11:42 +03:00
Yuval Shekel
0a48a5d49a refactored to snake case 2024-07-17 19:11:42 +03:00
Yuval Shekel
37e4fafad9 cmake update to force load on Clang and AppleClang too 2024-07-17 19:11:42 +03:00
Yuval Shekel
ade9651ba0 ntt and empty CPU backend 2024-07-17 19:11:42 +03:00
Yuval Shekel
0dcece404e CPU vec ops is template 2024-07-17 19:11:42 +03:00
Yuval Shekel
8b8d778294 use CONCAT_EXPAND() macro to generate symbols per field/curve 2024-07-17 19:11:42 +03:00
Yuval Shekel
ff1017e38b renamed icicle_fe to icicle_v3 2024-07-17 19:11:42 +03:00
Yuval Shekel
513b45f5cb ci compilation fix 2024-07-17 19:11:42 +03:00
Yuval Shekel
b50d6aa3cb spelling 2024-07-17 19:11:42 +03:00
Yuval Shekel
e4e1b3f8d7 refactored runtime APIs to use thread local device 2024-07-17 19:11:42 +03:00
Yuval Shekel
8d9c2b1014 cuda backend for vector add + test CPU vs CUDA 2024-07-17 19:11:42 +03:00
Yuval Shekel
d15edcb60e add config struct to vec ops 2024-07-17 19:11:42 +03:00
Yuval Shekel
fdec847061 vectorAdd operates on Field type 2024-07-17 19:11:42 +03:00
Yuval Shekel
7ade66b977 vectorAdd api for int type 2024-07-17 19:11:42 +03:00
Yuval Shekel
0af022ad49 verify valid device 2024-07-17 19:11:42 +03:00
Yuval Shekel
d5aac95ff9 CUDA backend device 2024-07-17 19:11:42 +03:00
Yuval Shekel
5bbdb0f501 device API tests: async, error, invalid device 2024-07-17 19:11:42 +03:00
Yuval Shekel
755a44d7e8 CPU backend implemented with basic test 2024-07-17 19:11:42 +03:00
Yuval Shekel
b3fad0fff3 mock cpu backend for device API 2024-07-17 19:11:42 +03:00
Yuval Shekel
d7f86e6942 errors and device_api interface 2024-07-17 19:11:42 +03:00
Yuval Shekel
5410d072ee basic cmake scripts and test 2024-07-17 19:11:42 +03:00
Yuval Shekel
4fa154dceb enable CI for V3 2024-07-17 19:11:42 +03:00
release-bot
aacec3f72f Bump rust crates' version
icicle-babybear@2.8.0
icicle-bls12-377@2.8.0
icicle-bls12-381@2.8.0
icicle-bn254@2.8.0
icicle-bw6-761@2.8.0
icicle-core@2.8.0
icicle-cuda-runtime@2.8.0
icicle-grumpkin@2.8.0
icicle-hash@2.8.0
icicle-m31@2.8.0
icicle-stark252@2.8.0

Generated by cargo-workspaces
2024-07-16 13:57:56 +00:00
ChickenLover
a8fa05d0e3 Feat/roman/hash docs (#556)
## Describe the changes

This PR...

## Linked Issues

Resolves #

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
2024-07-16 16:39:35 +03:00
ChickenLover
ea71faf1fa add keccak tree builder (#555) 2024-07-15 15:31:12 +07:00
ChickenLover
7fd9ed1b49 Feat/roman/tree builder (#525)
# Updates:

## Hashing

 - Added SpongeHasher class
 - Can be used to accept any hash function as an argument
 - Absorb and squeeze are now separated
- Memory management is now mostly done by SpongeHasher class, each hash
function only describes permutation kernels

## Tree builder

 - Tree builder is now hash-agnostic. 
 - Tree builder now supports 2D input (matrices)
- Tree builder can now use two different hash functions for layer 0 and
compression layers

## Poseidon1

 - Interface changed to classes
 - Now allows for any alpha
 - Now allows passing constants not in a single vector
 - Now allows for any domain tag
 - Constants are now released upon going out of scope
 - Rust wrappers changed to Poseidon struct
 
 ## Poseidon2
 
 - Interface changed to classes
 - Constants are now released upon going out of scope
 - Rust wrappers changed to Poseidon2 struct
 
## Keccak

 - Added Keccak class which inherits SpongeHasher
 - Now doesn't use gpu registers for storing states
 
 To do:
- [x] Update poseidon1 golang bindings
- [x] Update poseidon1 examples
- [x] Fix poseidon2 cuda test
- [x] Fix poseidon2 merkle tree builder test
- [x] Update keccak class with new design
- [x] Update keccak test
- [x] Check keccak correctness
- [x] Update tree builder rust wrappers
- [x] Leave doc comments

Future work:  
- [ ] Add keccak merkle tree builder externs
- [ ] Add keccak rust tree builder wrappers
- [ ] Write docs
- [ ] Add example
- [ ] Fix device output for tree builder

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
Co-authored-by: nonam3e <71525212+nonam3e@users.noreply.github.com>
2024-07-11 13:46:25 +07:00
DmytroTym
2d4059c61f Field creation automated through macros (#551)
Params files for fields now only require modulus specified by the user
(also twiddle generator and/or non-residue in case either or both are
needed). Everything else gets generated by a macro.
2024-07-08 10:39:50 +03:00
release-bot
73cd4c0a99 Bump rust crates' version
icicle-babybear@2.7.1
icicle-bls12-377@2.7.1
icicle-bls12-381@2.7.1
icicle-bn254@2.7.1
icicle-bw6-761@2.7.1
icicle-core@2.7.1
icicle-cuda-runtime@2.7.1
icicle-grumpkin@2.7.1
icicle-hash@2.7.1
icicle-m31@2.7.1
icicle-stark252@2.7.1

Generated by cargo-workspaces
2024-07-04 12:34:26 +00:00
yshekel
5516320ad7 fix large (>512 elements) ecntt issue (#553)
This PR solves an issue for large ecntt where cuda blocks are too large
and cannot be assigned to SMs. The fix is to reduce thread count per
block and increase block count in that case.
2024-07-04 15:33:49 +03:00
Vlad
a4b1eb3de9 Fix affine to projective zero point bug (#552)
## Describe the changes

This PR fixes affine to projective functions in bindings by adding a
condition if the point in affine form is zero then return the projective zero

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
2024-07-04 09:31:59 +03:00
release-bot
31083463be Bump rust crates' version
icicle-babybear@2.7.0
icicle-bls12-377@2.7.0
icicle-bls12-381@2.7.0
icicle-bn254@2.7.0
icicle-bw6-761@2.7.0
icicle-core@2.7.0
icicle-cuda-runtime@2.7.0
icicle-grumpkin@2.7.0
icicle-hash@2.7.0
icicle-m31@2.7.0
icicle-stark252@2.7.0

Generated by cargo-workspaces
2024-07-03 19:06:35 +00:00
nonam3e
b908053c0c Feat/m31 (#547)
This PR adds support of the m31 Field

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
2024-07-03 20:48:28 +07:00
Stas
29da36d7be RISC0 example using Polynomial API (#548)
## New Example

This new c++ example shows the basics of RISC0 protocol using our
Polynomial API
2024-07-02 08:00:03 -06:00
HadarIngonyama
4fef542346 MSM - fixed bug in reduction phase (#549)
This PR fixes a bug in the iterative reduction algorithm.
There were unsynchronized threads reading and writing to the same
addresses that caused MSM to fail a small percentage of the time - this is fixed now.
2024-06-30 12:05:55 +03:00
release-bot
f812f071fa Bump rust crates' version
icicle-babybear@2.6.0
icicle-bls12-377@2.6.0
icicle-bls12-381@2.6.0
icicle-bn254@2.6.0
icicle-bw6-761@2.6.0
icicle-core@2.6.0
icicle-cuda-runtime@2.6.0
icicle-grumpkin@2.6.0
icicle-hash@2.6.0
icicle-stark252@2.6.0

Generated by cargo-workspaces
2024-06-24 11:56:28 +00:00
Jeremy Felder
2b07513310 [FEAT]: Golang Bindings for pinned host memory (#519)
## Describe the changes

This PR adds the capability to pin host memory in golang bindings
allowing data transfers to be quicker. Memory can be pinned once for
multiple devices by passing the flag
`cuda_runtime.CudaHostRegisterPortable` or
`cuda_runtime.CudaHostAllocPortable` depending on how pinned memory is
called
2024-06-24 14:03:44 +03:00
HadarIngonyama
7831f7bd0f Msm/update docs (#545)
Updates MSM documentation

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
Co-authored-by: Leon Hibnik <107353745+LeonHibnik@users.noreply.github.com>
2024-06-19 11:38:24 +03:00
Otsar
de25b6e203 Added v2 paper (#544) 2024-06-18 15:19:49 +03:00
Otsar
69383e6c73 Update docusaurus.config.js
bold, added emoji
2024-06-18 15:04:26 +03:00
Otsar
c305aade5d Update overview.md 2024-06-18 15:00:24 +03:00
Otsar
87bdf04a19 Update docusaurus.config.js 2024-06-18 13:05:14 +03:00
Otsar
e152977843 Update overview.md
Added v2 paper
2024-06-18 12:23:03 +03:00
release-bot
3d01c09c82 Bump rust crates' version
icicle-babybear@2.5.0
icicle-bls12-377@2.5.0
icicle-bls12-381@2.5.0
icicle-bn254@2.5.0
icicle-bw6-761@2.5.0
icicle-core@2.5.0
icicle-cuda-runtime@2.5.0
icicle-grumpkin@2.5.0
icicle-hash@2.5.0
icicle-stark252@2.5.0

Generated by cargo-workspaces
2024-06-17 13:17:24 +00:00
HadarIngonyama
8936d9c800 MSM - supporting all window sizes (#534)
This PR enables using MSM with any value of c.

Note: default c isn't necessarily optimal, the user is expected to
choose c and the precomputation factor that give the best results for
the relevant case.

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
2024-06-17 15:57:24 +03:00
Jeremy Felder
af9ec76506 Fix link and correct path for running test deploy workflow (#542)
## Describe the changes

Fixes a link issue in docs preventing deployment
2024-06-17 15:44:15 +03:00
Otsar
cdd99d2a46 recreated images for poseidon.md (#541)
Fixed 3 images shown in low quality - i have recreated the 3 images -
please check me to see that i have not made a mistake
2024-06-17 12:16:26 +03:00
Jeremy Felder
3e551762c0 Updated alt text for images and fixed broken link 2024-06-16 18:35:42 +03:00
Otsar
37c22e81e7 Update poseidon.md
fixed - added arrows
2024-06-16 15:01:12 +03:00
Otsar
69e73ffa3e Update poseidon.md
Fixed image quality
2024-06-16 11:42:46 +03:00
cangqiaoyuzhuo
512e1ca372 chore: remove repeat word (#540)
## Describe the changes

remove repeat word

## Linked Issues

Resolves #

Signed-off-by: cangqiaoyuzhuo <850072022@qq.com>
2024-06-13 11:53:22 +03:00
VitaliiH
e19a869691 accumulate stwo (#535)
adds in-place vector addition and api as accumulate
2024-06-10 12:24:58 +02:00
yshekel
9c55d888ae workflow curve fix (#536) 2024-06-09 11:18:23 +03:00
release-bot
18f51de56c Bump rust crates' version
icicle-babybear@2.4.0
icicle-bls12-377@2.4.0
icicle-bls12-381@2.4.0
icicle-bn254@2.4.0
icicle-bw6-761@2.4.0
icicle-core@2.4.0
icicle-cuda-runtime@2.4.0
icicle-grumpkin@2.4.0
icicle-hash@2.4.0
icicle-stark252@2.4.0

Generated by cargo-workspaces
2024-06-06 14:42:36 +00:00
yshekel
33b1f3c794 perf: projective scalar multiplication use dbl() rather than + (#530) 2024-06-05 20:35:21 +03:00
Karthik Inbasekar
3a276ef23c added example cpp: example_commit_with_device_memory_view() (#532)
## Describe the changes

This PR...
Added an example for simple commit that makes use of polynomial views. 
Output attached

```
Example: a) commit with Polynomial views [(f1+f2)^2 + (f1-f2)^2 ]_1 = [4 (f1^2+ f_2^2)]_1
Example: b) commit with Polynomial views [(f1+f2)^2 - (f1-f2)^2 ]_1 = [4 f1 *f_2]_1
Setup: Generating mock SRS
Setup: SRS of length 1025 generated and loaded to device. Took: 19557 milliseconds
Setup: Generating polys (on device) f1,f2 of log degree 10
Setup: Gen poly done. Took: 7 milliseconds
Computing constraints..start 
Computing constraints..done. Took: 0 milliseconds
Computing Commitments with poly view
Commitments done. Took: 29 milliseconds
commitment [(f1+f2)^2 + (f1-f2)^2]_1:
[x: 0x1e35d81da10e5026dacdd907d6ed0dde673de449ff8c0137ec6acbfd6b1dfe1b, y: 0x21fc051415af35a781f84ebcf999313d489ae38ebefa561c9de2fb0b11091502]
commitment [[2 (f_1^2+f_2^2]_1:
[x: 0x1e35d81da10e5026dacdd907d6ed0dde673de449ff8c0137ec6acbfd6b1dfe1b, y: 0x21fc051415af35a781f84ebcf999313d489ae38ebefa561c9de2fb0b11091502]
commitment [(f1+f2)^2 - (f1-f2)^2]_1:
[x: 0x21e9dc012aef8d95107fbfe63f455d4345b9b21e37bcb0a49043b1066e211ffa, y: 0x2d6a3b2f1be1042a17c58ff595134b9cceb71d1af4f1c67a5696859cd4bafae3]
commitment [4 f_1*f_2]_1:
[x: 0x21e9dc012aef8d95107fbfe63f455d4345b9b21e37bcb0a49043b1066e211ffa, y: 0x2d6a3b2f1be1042a17c58ff595134b9cceb71d1af4f1c67a5696859cd4bafae3]
```

## Linked Issues

Resolves #
2024-06-05 18:25:12 +03:00
nonam3e
8e62bde16d bit reverse (#528)
This PR adds bit reverse operation support to icicle
2024-06-02 16:37:58 +07:00
Jeremy Felder
417ca77f61 precompute bug fix (#529)
This PR fixes 2 things:

1. Removes the assertion regarding the precompute factor needing to be a
power of 2. There is no such requirement and it works just fine for
other values too.
2. Fixes the average bucket size for the large buckets threshold - it
depends on the precompute factor.
2024-05-29 13:59:48 +03:00
hadaringonyama
8911a32135 precompute bug fix 2024-05-28 12:48:48 +03:00
release-bot
c6f6e61d60 Bump rust crates' version
icicle-babybear@2.3.1
icicle-bls12-377@2.3.1
icicle-bls12-381@2.3.1
icicle-bn254@2.3.1
icicle-bw6-761@2.3.1
icicle-core@2.3.1
icicle-cuda-runtime@2.3.1
icicle-grumpkin@2.3.1
icicle-hash@2.3.1
icicle-stark252@2.3.1

Generated by cargo-workspaces
2024-05-20 13:43:32 +00:00
yshekel
4e3aa63d2f fix: ntt mixed-radix bug for large ntts (>4G elements) (#523)
in some cases 32b values would wrap around and cause invalid accesses to
wrong elements and memory addresses
2024-05-20 16:42:44 +03:00
Leon Hibnik
db298aefc1 [HOTFIX] rust msm benchmarks (#521)
## Describe the changes

removes unused host to device copy, adds minimum limit to run MSM
benchmarks
2024-05-20 13:51:53 +03:00
yshekel
19a9b76d64 fix: cmake set_gpu_env() and windows build (#520) 2024-05-20 13:05:45 +03:00
Jeremy Felder
1e343f17a3 Allow overriding compiler's chosen GPU arch via cmake (#518)
## Describe the changes

This PR modifies icicle/cmake/Common.cmake to set
CMAKE_CUDA_ARCHITECTURES to ${CUDA_ARCH} if the user defines the arch,
to set CMAKE_CUDA_ARCHITECTURES to native if the cmake version is
greater than or equal to 3.24.0. This change has been successfully
tested with cmake 3.22.0 and 3.25.2.

## Linked Issues

Resolves #167.
2024-05-19 16:03:15 +03:00
liuhao230
cfea6ebb3b Merge branch 'ingonyama-zk:main' into main 2024-05-17 14:24:02 +08:00
release-bot
76a82bf88e Bump rust crates' version
icicle-babybear@2.3.0
icicle-bls12-377@2.3.0
icicle-bls12-381@2.3.0
icicle-bn254@2.3.0
icicle-bw6-761@2.3.0
icicle-core@2.3.0
icicle-cuda-runtime@2.3.0
icicle-grumpkin@2.3.0
icicle-hash@2.3.0
icicle-stark252@2.3.0

Generated by cargo-workspaces
2024-05-17 04:42:17 +00:00
Vlad
b8310d577e Feat/vlad/poseidon go binding (#513) 2024-05-17 07:20:15 +03:00
liu
49c7fa4b28 fix: add the PARENT_SCOPE
Signed-off-by: liu <liuhao2206@buaa.edu.cn>
2024-05-17 10:45:09 +08:00
Stas
02059fcfaa Stas/best-practice-ntt (#517)
## Describe the changes

Icicle examples:  Concurrent Data Transfer and NTT Computation

This PR introduces a Best Practice series of examples in c++.
Specifically, the example shows how to concurrently transfer data
to/from device and execute NTT

## Linked Issues

Resolves #
2024-05-16 23:51:49 +03:00
nonam3e
4496520a10 golang examples init (#516)
## Describe the changes

This PR adds golang examples

---------

Co-authored-by: Leon Hibnik <107353745+LeonHibnik@users.noreply.github.com>
Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
2024-05-16 19:40:13 +03:00
liu
88a6966a4b Allow overriding compiler's chosen GPU arch via cmake 2024-05-15 22:40:51 +08:00
yshekel
9c1afe8a44 Polynomial API views replaced by evaluation on rou domain (#514)
- removed poly API to access view of evaluations. This is a problematic API since it cannot handle small domains and for large domains requires the polynomial to use more memory than need to.
- added evaluate_on_rou_domain() API instead that supports any domain size (powers of two size).
- the new API can compute to HOST or DEVICE memory
- Rust wrapper for evaluate_on_rou_domain()
- updated documentation: overview and Rust wrappers
- faster division by vanishing poly for common case where numerator is 2N and vanishing poly is of degree N.
- allow division a/b where deg(a)<deg(b) instead of throwing an error.
2024-05-15 14:06:23 +03:00
Jeremy Felder
972b924bc0 Update CI to run on some non-code changes (#515)
## Describe the changes

This PR:
- Updates the CI to run on CI workflow file changes
- Updates examples CI to run on examples file changes
2024-05-15 13:17:13 +03:00
sukrucildirr
230a1da512 Fix broken link (#512)
## Describe the changes

There was a broken link is linked to ZKContainer word.

## Linked Issues

Resolves #
2024-05-14 08:36:39 +07:00
release-bot
940b283c47 Bump rust crates' version
icicle-babybear@2.2.0
icicle-bls12-377@2.2.0
icicle-bls12-381@2.2.0
icicle-bn254@2.2.0
icicle-bw6-761@2.2.0
icicle-core@2.2.0
icicle-cuda-runtime@2.2.0
icicle-grumpkin@2.2.0
icicle-hash@2.2.0
icicle-stark252@2.2.0

Generated by cargo-workspaces
2024-05-09 12:27:17 +00:00
Leon Hibnik
e0412183fd syntax highlight (#511)
## Describe the changes
adds syntax highlighting to `rust` and `go`
2024-05-09 15:23:20 +03:00
ChickenLover
9da52bc09f Feat/roman/poseidon2 (#510)
# This PR

1. Adds C++ API
2. Renames a lot of API functions
3. Adds inplace poseidon2
4. Makes input const at all poseidon functions
5. Adds benchmark for poseidon2
2024-05-09 19:19:55 +07:00
VitaliiH
49079d0d2a rust ecntt hotfix (#509)
## Describe the changes

This PR fixes Rust ECNTT benches and tests


---------

Co-authored-by: VitaliiH <Vitaliy@ingo>
2024-05-09 11:21:21 +03:00
ChickenLover
094683d291 Feat/roman/poseidon2 (#507)
This PR adds support for poseidon2 permutation function as described in
https://eprint.iacr.org/2023/323.pdf

Reference implementations used (and compared against):
https://github.com/HorizenLabs/poseidon2/tree/main
https://github.com/Plonky3/Plonky3/tree/main

Tasks:

- [x] Remove commented code and prints
- [ ] Add doc-comments to functions and structs
- [x] Fix possible issue with Plonky3 imports
- [x] Update NTT/Plonky3 test
- [x] Add Plonky3-bn254 test (impossible)
2024-05-09 15:13:43 +07:00
nonam3e
c30e333819 keccak docs (#508)
This PR adds keccak docs

---------

Co-authored-by: Leon Hibnik <107353745+LeonHibnik@users.noreply.github.com>
2024-05-08 23:18:59 +03:00
yshekel
2905d2a469 fix: bug regarding polynomial evaluations view in CUDA backend (#506)
fixing:
(1) not building polynomials and tests for grumpkin curve (no NTT)
(2) polynomial API C++ example compilation and (when compilation is
fixed) memory corruption
(3) bug fix in poly CUDA backend regarding transformation to evaluations
in some cases
2024-05-08 21:02:18 +03:00
Jeremy Felder
732ee51552 [CI]: Update Cpp CI to include build args (#503)
## Describe the changes

This PR adds build args to the Cpp CI and adds grumpkin curve and
stark252 field
2024-05-08 14:35:02 +03:00
Jeremy Felder
14997566ff [FIX]: Fix releasing device set on host thread during multigpu call (#501)
## Describe the changes

This PR fixes an issue when `RunOnDevice` is called for multi-gpu while
other goroutines calling device operations are run outside of
`RunOnDevice`. The issue comes from setting a device other than the
default device (device 0) on a host thread within `RunOnDevice` and not
unsetting that host threads device when `RunOnDevice` finishes.

When `RunOnDevice` locks a host thread to ensure that all other calls in
the go routine are on the same device, it never unsets that thread’s
device. Once the thread is unlocked, other go routines can get scheduled
to it but it still has the device set to whatever it was before while it
was locked so its possible that the following sequence happens:

1. NTT domain is initialized on thread 2 via a goroutine on device 0
2. MSM multiGPU test runs and is locked on thread 3 setting its device
to 1
3. Other tests run concurrently on threads other than 3 (since it is
locked)
4. MSM multiGPU test finishes and release thread 3 back to the pool but
its device is still 1
5. NTT test runs and is assigned to thread 3 --> this will fail because
the thread’s device wasn’t released back

We really only want to set a thread's device while the thread is locked.
But once we unlock a thread, it’s device should return to whatever it
was set at originally. In theory, it should always be 0 if `SetDevice`
is never used outside of `RunOnDevice` - which it shouldn’t be in most 
situations
2024-05-08 14:07:29 +03:00
Otsar
a56435d2e8 Updated hall of fame (#505)
## Describe the changes

Adds Patrick to Hall of fame
2024-05-07 14:41:38 +03:00
Stas
41294b12e0 Stas/example poly (#434)
## Describe the changes

Added examples for Poly API

---------

Co-authored-by: Yuval Shekel <yshekel@gmail.com>
2024-05-07 11:52:13 +03:00
Jeremy Felder
6134cfe177 [DOCS]: Tidy up docs (#502)
## Describe the changes

This PR tidies up docs and updates golang build instructions
2024-05-06 15:35:19 +03:00
629 changed files with 40134 additions and 6124 deletions

View File

@@ -3,13 +3,22 @@ golang:
- wrappers/golang/**/*.h
- wrappers/golang/**/*.tmpl
- go.mod
- .github/workflows/golang.yml
rust:
- wrappers/rust/**/*
- wrappers/rust_v3/**/*
- '!wrappers/rust/README.md'
- .github/workflows/v3_rust.yml
cpp:
- icicle/**/*.cu
- icicle/**/*.cuh
- icicle/**/*.cpp
- icicle/**/*.hpp
- icicle/**/*.c
- icicle/**/*.h
- icicle/CMakeLists.txt
- icicle_v3/**/*.cu
- icicle_v3/**/*.cuh
- icicle_v3/**/*.cpp
- icicle_v3/**/*.hpp
- icicle_v3/**/*.c
- icicle_v3/**/*.h
- icicle_v3/CMakeLists.txt
- .github/workflows/v3.yml
- icicle_v3/cmake/curve.cmake
- icicle_v3/cmake/field.cmake
examples:
- examples/**/*
- .github/workflows/examples.yml

View File

@@ -12,6 +12,9 @@ on:
cpp_cuda:
description: "Flag for if C++/CUDA files changed"
value: ${{ jobs.check-changed-files.outputs.cpp_cuda }}
examples:
description: "Flag for if example files changed"
value: ${{ jobs.check-changed-files.outputs.examples }}
jobs:
check-changed-files:
@@ -21,6 +24,7 @@ jobs:
golang: ${{ steps.changed_files.outputs.golang }}
rust: ${{ steps.changed_files.outputs.rust }}
cpp_cuda: ${{ steps.changed_files.outputs.cpp_cuda }}
examples: ${{ steps.changed_files.outputs.examples }}
steps:
- name: Checkout Repo
uses: actions/checkout@v4
@@ -37,3 +41,4 @@ jobs:
echo "golang=${{ steps.changed-files-yaml.outputs.golang_any_modified }}" >> "$GITHUB_OUTPUT"
echo "rust=${{ steps.changed-files-yaml.outputs.rust_any_modified }}" >> "$GITHUB_OUTPUT"
echo "cpp_cuda=${{ steps.changed-files-yaml.outputs.cpp_any_modified }}" >> "$GITHUB_OUTPUT"
echo "examples=${{ steps.changed-files-yaml.outputs.examples_any_modified }}" >> "$GITHUB_OUTPUT"

View File

@@ -4,7 +4,8 @@ on:
pull_request:
branches:
- main
- V2
- V3
- yshekel/V3
jobs:
spelling-checker:

View File

@@ -3,12 +3,10 @@ name: C++/CUDA
on:
pull_request:
branches:
- main
- V2
- main
push:
branches:
- main
- V2
- main
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@@ -35,7 +33,18 @@ jobs:
needs: [check-changed-files, check-format]
strategy:
matrix:
curve: [bn254, bls12_381, bls12_377, bw6_761]
curve:
- name: bn254
build_args: -DG2=ON -DECNTT=ON
- name: bls12_381
build_args: -DG2=ON -DECNTT=ON
- name: bls12_377
build_args: -DG2=ON -DECNTT=ON
- name: bw6_761
build_args: -DG2=ON -DECNTT=ON
- name: grumpkin
build_args:
steps:
- name: Checkout Repo
uses: actions/checkout@v4
@@ -44,7 +53,7 @@ jobs:
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: |
mkdir -p build && rm -rf build/*
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DCURVE=${{ matrix.curve }} -DG2=ON -S . -B build
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DCURVE=${{ matrix.curve.name }} ${{ matrix.curve.build_args }} -S . -B build
cmake --build build -j
- name: Run C++ curve Tests
working-directory: ./icicle/build/tests
@@ -57,7 +66,13 @@ jobs:
needs: [check-changed-files, check-format]
strategy:
matrix:
field: [babybear]
field:
- name: babybear
build_args: -DEXT_FIELD=ON
- name: stark252
build_args: -DEXT_FIELD=OFF
- name: m31
build_args: -DEXT_FIELD=ON
steps:
- name: Checkout Repo
uses: actions/checkout@v4
@@ -66,7 +81,7 @@ jobs:
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: |
mkdir -p build && rm -rf build/*
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DFIELD=${{ matrix.field }} -DEXT_FIELD=ON -S . -B build
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DFIELD=${{ matrix.field.name }} ${{ matrix.field.build_args }} -S . -B build
cmake --build build -j
- name: Run C++ field Tests
working-directory: ./icicle/build/tests

View File

@@ -1,8 +1,7 @@
# This workflow is a demo of how to run all examples in the Icicle repository.
# For each language directory (c++, Rust, etc.) the workflow
# (1) loops over all examples (msm, ntt, etc.) and
# (2) runs ./compile.sh and ./run.sh in each directory.
# The script ./compile.sh should compile the example and ./run.sh should run it.
# For each language directory (c++, Rust, etc.) the workflow
# (1) loops over all examples (msm, ntt, etc.) and
# (2) runs ./run.sh in each directory.
# Each script should return 0 for success and 1 otherwise.
name: Examples
@@ -10,12 +9,11 @@ name: Examples
on:
pull_request:
branches:
- main
- V2
- V3
- yshekel/V3 # TODO remove when merged to V3
push:
branches:
- main
- V2
- V3
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@@ -25,36 +23,65 @@ jobs:
check-changed-files:
uses: ./.github/workflows/check-changed-files.yml
extract-cuda-backend-branch:
name: Extract cuda branch name
runs-on: ubuntu-22.04
outputs:
cuda-backend-branch: ${{ steps.extract.outputs.cuda-backend-branch }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Extract Private Branch from PR Description
id: extract
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
DESCRIPTION=$(gh pr view ${{ github.event.pull_request.number }} --json body -q '.body')
echo "PR Description: $DESCRIPTION"
CUDA_BE_BRANCH=$(echo "$DESCRIPTION" | grep -oP 'cuda-backend-branch:\s*\K[^\s]+') || true
if [ -z "$CUDA_BE_BRANCH" ]; then
CUDA_BE_BRANCH="main" # Default branch if not specified
fi
echo "Extracted CUDA Backend Branch: $CUDA_BE_BRANCH"
echo "::set-output name=cuda-backend-branch::$CUDA_BE_BRANCH"
run-examples:
runs-on: [self-hosted, Linux, X64, icicle, examples]
needs: check-changed-files
needs: [check-changed-files, extract-cuda-backend-branch]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Checkout CUDA Backend
uses: actions/checkout@v4
with:
repository: ingonyama-zk/icicle-cuda-backend
path: ./icicle_v3/backend/cuda
token: ${{ secrets.GITHUB_TOKEN }}
ssh-key: ${{ secrets.CUDA_PULL_KEY }}
ref: ${{ needs.extract-branch.outputs.cuda-backend-branch }}
- name: c++ examples
working-directory: ./examples/c++
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
if: needs.check-changed-files.outputs.cpp_cuda == 'true' || needs.check-changed-files.outputs.examples == 'true'
run: |
# loop over all directories in the current directory
for dir in $(find . -mindepth 1 -maxdepth 1 -type d); do
if [ -d "$dir" ]; then
echo "Running command in $dir"
cd $dir
./compile.sh
./run.sh
./run.sh -d CUDA
cd -
fi
done
done
- name: Rust examples
working-directory: ./examples/rust
if: needs.check-changed-files.outputs.rust == 'true'
if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.examples == 'true'
run: |
# loop over all directories in the current directory
for dir in $(find . -mindepth 1 -maxdepth 1 -type d); do
if [ -d "$dir" ]; then
echo "Running command in $dir"
cd $dir
cargo run --release
./run.sh -d CUDA
cd -
fi
done
done

View File

@@ -3,12 +3,10 @@ name: GoLang
on:
pull_request:
branches:
- main
- V2
- main
push:
branches:
- main
- V2
- main
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@@ -34,7 +32,7 @@ jobs:
run: if [[ $(go list ./... | xargs go fmt) ]]; then echo "Please run go fmt"; exit 1; fi
build-curves-linux:
name: Build curves on Linux
name: Build and test curves on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, check-format]
strategy:
@@ -60,19 +58,18 @@ jobs:
- name: Build
working-directory: ./wrappers/golang
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
run: ./build.sh -curve=${{ matrix.curve.name }} ${{ matrix.curve.build_args }} # builds a single curve with G2 and ECNTT enabled
- name: Upload ICICLE lib artifacts
uses: actions/upload-artifact@v4
# builds a single curve with the curve's specified build args
run: ./build.sh -curve=${{ matrix.curve.name }} ${{ matrix.curve.build_args }}
- name: Test
working-directory: ./wrappers/golang/curves
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
with:
name: icicle-builds-${{ matrix.curve.name }}-${{ github.workflow }}-${{ github.sha }}
path: |
icicle/build/lib/libingo_curve_${{ matrix.curve.name }}.a
icicle/build/lib/libingo_field_${{ matrix.curve.name }}.a
retention-days: 1
run: |
CURVE=$(echo ${{ matrix.curve.name }} | sed -e 's/_//g')
export CPATH=$CPATH:/usr/local/cuda/include
go test ./$CURVE/tests -count=1 -failfast -p 2 -timeout 60m -v
build-fields-linux:
name: Build fields on Linux
name: Build and test fields on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, check-format]
strategy:
@@ -90,18 +87,18 @@ jobs:
- name: Build
working-directory: ./wrappers/golang
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
run: ./build.sh -field=${{ matrix.field.name }} ${{ matrix.field.build_args }} # builds a single field with field-ext enabled
- name: Upload ICICLE lib artifacts
uses: actions/upload-artifact@v4
# builds a single field with the fields specified build args
run: ./build.sh -field=${{ matrix.field.name }} ${{ matrix.field.build_args }}
- name: Test
working-directory: ./wrappers/golang/fields
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
with:
name: icicle-builds-${{ matrix.field.name }}-${{ github.workflow }}-${{ github.sha }}
path: |
icicle/build/lib/libingo_field_${{ matrix.field.name }}.a
retention-days: 1
run: |
FIELD=$(echo ${{ matrix.field.name }} | sed -e 's/_//g')
export CPATH=$CPATH:/usr/local/cuda/include
go test ./$FIELD/tests -count=1 -failfast -p 2 -timeout 60m -v
build-hashes-linux:
name: Build hashes on Linux
name: Build and test hashes on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, check-format]
strategy:
@@ -119,41 +116,15 @@ jobs:
- name: Build
working-directory: ./wrappers/golang
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
run: ./build.sh -hash=${{ matrix.hash.name }} ${{ matrix.hash.build_args }} # builds a single hash algorithm
- name: Upload ICICLE lib artifacts
uses: actions/upload-artifact@v4
# builds a single hash algorithm with the hash's specified build args
run: ./build.sh -hash=${{ matrix.hash.name }} ${{ matrix.hash.build_args }}
- name: Test
working-directory: ./wrappers/golang/hash
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
with:
name: icicle-builds-${{ matrix.hash.name }}-${{ github.workflow }}-${{ github.sha }}
path: |
icicle/build/lib/libingo_hash.a
retention-days: 1
test-linux:
name: Test on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, build-curves-linux, build-fields-linux, build-hashes-linux]
steps:
- name: Checkout Repo
uses: actions/checkout@v4
- name: Setup go
uses: actions/setup-go@v5
with:
go-version: '1.20.0'
- name: Download ICICLE lib artifacts
uses: actions/download-artifact@v4
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
with:
path: ./icicle/build/lib
merge-multiple: true
- name: Run Tests
working-directory: ./wrappers/golang
if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
# -count ensures the test results are not cached
# -p controls the number of programs that can be run in parallel
run: |
HASH=$(echo ${{ matrix.hash.name }} | sed -e 's/_//g')
export CPATH=$CPATH:/usr/local/cuda/include
go test ./... -count=1 -failfast -p 2 -timeout 60m
go test ./$HASH/tests -count=1 -failfast -p 2 -timeout 60m -v
# TODO: bw6 on windows requires more memory than the standard runner has
# Add a large runner and then enable this job

View File

@@ -3,12 +3,10 @@ name: Rust
on:
pull_request:
branches:
- main
- V2
- main
push:
branches:
- main
- V2
- main
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@@ -62,8 +60,8 @@ jobs:
# We need to limit the number of threads to avoid running out of memory on weaker machines
# ignored tests are polynomial tests. Since they conflict with NTT tests, they are executed separately
run: |
cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --release --verbose --features=g2 -- --test-threads=2 --ignored
cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --release --verbose --features=g2 -- --test-threads=2
cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --exclude icicle-m31 --release --verbose --features=g2 -- --test-threads=2 --ignored
cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --exclude icicle-m31 --release --verbose --features=g2 -- --test-threads=2
- name: Run baby bear tests
working-directory: ./wrappers/rust/icicle-fields/icicle-babybear
@@ -79,26 +77,34 @@ jobs:
cargo test --release --verbose -- --ignored
cargo test --release --verbose
build-windows:
name: Build on Windows
runs-on: windows-2022
needs: check-changed-files
steps:
- name: Checkout Repo
uses: actions/checkout@v4
- name: Download and Install Cuda
- name: Run m31 tests
working-directory: ./wrappers/rust/icicle-fields/icicle-m31
if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
id: cuda-toolkit
uses: Jimver/cuda-toolkit@v0.2.11
with:
cuda: '12.0.0'
method: 'network'
# https://docs.nvidia.com/cuda/archive/12.0.0/cuda-installation-guide-microsoft-windows/index.html
sub-packages: '["cudart", "nvcc", "thrust", "visual_studio_integration"]'
- name: Build targets
working-directory: ./wrappers/rust
if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
env:
CUDA_PATH: ${{ steps.cuda-toolkit.outputs.CUDA_PATH }}
# Building from the root workspace will build all members of the workspace by default
run: cargo build --release --verbose
run: |
cargo test --release --verbose -- --ignored
cargo test --release --verbose
# build-windows:
# name: Build on Windows
# runs-on: windows-2022
# needs: check-changed-files
# steps:
# - name: Checkout Repo
# uses: actions/checkout@v4
# - name: Download and Install Cuda
# if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
# id: cuda-toolkit
# uses: Jimver/cuda-toolkit@v0.2.11
# with:
# cuda: '12.0.0'
# method: 'network'
# # https://docs.nvidia.com/cuda/archive/12.0.0/cuda-installation-guide-microsoft-windows/index.html
# sub-packages: '["cudart", "nvcc", "thrust", "visual_studio_integration"]'
# - name: Build targets
# working-directory: ./wrappers/rust
# if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
# env:
# CUDA_PATH: ${{ steps.cuda-toolkit.outputs.CUDA_PATH }}
# CUDA_ARCH: 50 # Using CUDA_ARCH=50 env variable since the CI machines have no GPUs
# # Building from the root workspace will build all members of the workspace by default
# run: cargo build --release --verbose

View File

@@ -5,7 +5,7 @@ on:
branches:
- main
paths:
- 'docs/*'
- 'docs/**'
jobs:
test-deploy:

126
.github/workflows/v3.yml vendored Normal file
View File

@@ -0,0 +1,126 @@
name: C++/CUDA
on:
pull_request:
branches:
- V3
- yshekel/V3 # TODO remove when merged to V3
push:
branches:
- V3
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
check-changed-files:
uses: ./.github/workflows/check-changed-files.yml
check-format:
name: Check Code Format
runs-on: ubuntu-22.04
needs: check-changed-files
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Check clang-format
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: if [[ $(find ./ \( -path ./icicle/build -prune -o -path ./**/target -prune -o -path ./examples -prune \) -iname *.h -or -iname *.cuh -or -iname *.cu -or -iname *.c -or -iname *.cpp | xargs clang-format --dry-run -ferror-limit=1 -style=file 2>&1) ]]; then echo "Please run clang-format"; exit 1; fi
extract-cuda-backend-branch:
name: Extract cuda branch name
runs-on: ubuntu-22.04
outputs:
cuda-backend-branch: ${{ steps.extract.outputs.cuda-backend-branch }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Extract Private Branch from PR Description
id: extract
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
DESCRIPTION=$(gh pr view ${{ github.event.pull_request.number }} --json body -q '.body')
echo "PR Description: $DESCRIPTION"
CUDA_BE_BRANCH=$(echo "$DESCRIPTION" | grep -oP 'cuda-backend-branch:\s*\K[^\s]+') || true
if [ -z "$CUDA_BE_BRANCH" ]; then
CUDA_BE_BRANCH="main" # Default branch if not specified
fi
echo "Extracted CUDA Backend Branch: $CUDA_BE_BRANCH"
echo "::set-output name=cuda-backend-branch::$CUDA_BE_BRANCH"
test-linux-curve:
name: Test on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, check-format, extract-cuda-backend-branch]
strategy:
matrix:
curve:
- name: bn254
build_args: -DG2=ON -DECNTT=ON
- name: bls12_381
build_args: -DG2=ON -DECNTT=ON
- name: bls12_377
build_args: -DG2=ON -DECNTT=ON
- name: bw6_761
build_args: -DG2=ON -DECNTT=ON
- name: grumpkin
build_args:
steps:
- name: Checkout Repo
uses: actions/checkout@v4
- name: Checkout CUDA Backend
uses: actions/checkout@v4
with:
repository: ingonyama-zk/icicle-cuda-backend
path: ./icicle_v3/backend/cuda
token: ${{ secrets.GITHUB_TOKEN }}
ssh-key: ${{ secrets.CUDA_PULL_KEY }}
ref: ${{ needs.extract-cuda-backend-branch.outputs.cuda-backend-branch }}
- name: Build curve
working-directory: ./icicle_v3
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: |
mkdir -p build && rm -rf build/*
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DCURVE=${{ matrix.curve.name }} ${{ matrix.curve.build_args }} -DCUDA_BACKEND=local -S . -B build
cmake --build build -j
- name: Run C++ curve Tests
working-directory: ./icicle_v3/build/tests
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: ctest --verbose
test-linux-field:
name: Test on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, check-format, extract-cuda-backend-branch]
strategy:
matrix:
field:
- name: babybear
build_args: -DEXT_FIELD=ON
- name: stark252
build_args: -DEXT_FIELD=OFF
steps:
- name: Checkout Repo
uses: actions/checkout@v4
- name: Checkout CUDA Backend
uses: actions/checkout@v4
with:
repository: ingonyama-zk/icicle-cuda-backend
path: ./icicle_v3/backend/cuda
token: ${{ secrets.GITHUB_TOKEN }}
ssh-key: ${{ secrets.CUDA_PULL_KEY }}
ref: ${{ needs.extract-cuda-backend-branch.outputs.cuda-backend-branch }}
- name: Build field
working-directory: ./icicle_v3
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: |
mkdir -p build && rm -rf build/*
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DFIELD=${{ matrix.field.name }} ${{ matrix.field.build_args }} -DCUDA_BACKEND=local -S . -B build
cmake --build build -j
- name: Run C++ field Tests
working-directory: ./icicle_v3/build/tests
if: needs.check-changed-files.outputs.cpp_cuda == 'true'
run: ctest --verbose

82
.github/workflows/v3_rust.yml vendored Normal file
View File

@@ -0,0 +1,82 @@
name: Rust
on:
pull_request:
branches:
- V3
- yshekel/V3 # TODO remove when merged to V3
push:
branches:
- V3
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
check-changed-files:
uses: ./.github/workflows/check-changed-files.yml
check-format:
name: Check Code Format
runs-on: ubuntu-22.04
needs: check-changed-files
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Check rustfmt
if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
working-directory: ./wrappers/rust
# "-name target -prune" removes searching in any directory named "target"
# Formatting by single file is necessary due to generated files not being present
# before building the project.
# e.g. icicle-cuda-runtime/src/bindings.rs is generated and icicle-cuda-runtime/src/lib.rs includes that module
# causing rustfmt to fail.
run: if [[ $(find . -path ./icicle-curves/icicle-curve-template -prune -o -name target -prune -o -iname *.rs -print | xargs cargo fmt --check --) ]]; then echo "Please run cargo fmt"; exit 1; fi
extract-cuda-backend-branch:
name: Extract cuda branch name
runs-on: ubuntu-22.04
outputs:
cuda-backend-branch: ${{ steps.extract.outputs.cuda-backend-branch }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Extract Private Branch from PR Description
id: extract
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
DESCRIPTION=$(gh pr view ${{ github.event.pull_request.number }} --json body -q '.body')
echo "PR Description: $DESCRIPTION"
CUDA_BE_BRANCH=$(echo "$DESCRIPTION" | grep -oP 'cuda-backend-branch:\s*\K[^\s]+') || true
if [ -z "$CUDA_BE_BRANCH" ]; then
CUDA_BE_BRANCH="main" # Default branch if not specified
fi
echo "Extracted CUDA Backend Branch: $CUDA_BE_BRANCH"
echo "::set-output name=cuda-backend-branch::$CUDA_BE_BRANCH"
test-linux:
name: Test on Linux
runs-on: [self-hosted, Linux, X64, icicle]
needs: [check-changed-files, check-format, extract-cuda-backend-branch]
steps:
- name: Checkout Repo
uses: actions/checkout@v4
- name: Checkout CUDA Backend
uses: actions/checkout@v4
with:
repository: ingonyama-zk/icicle-cuda-backend
path: ./icicle_v3/backend/cuda
token: ${{ secrets.GITHUB_TOKEN }}
ssh-key: ${{ secrets.CUDA_PULL_KEY }}
ref: ${{ needs.extract-cuda-backend-branch.outputs.cuda-backend-branch }}
- name: Run tests
working-directory: ./wrappers/rust_v3
if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
# tests are split to phases since NTT domain is global but tests have conflicting requirements
run: |
cargo build --release --verbose --features=g2,ec_ntt
cargo test --workspace --release --verbose --features=g2,ec_ntt,cuda_backend -- --skip phase
cargo test phase2 --workspace --release --verbose --features=g2,ec_ntt,cuda_backend
cargo test phase3 --workspace --release --verbose --features=g2,ec_ntt,cuda_backend

5
.gitignore vendored
View File

@@ -8,6 +8,7 @@
*.so
*.nsys-rep
*.ncu-rep
*.sage.py
**/target
**/.vscode
**/.*lock*csv#
@@ -16,6 +17,4 @@
**/Cargo.lock
**/icicle/build/
**/wrappers/rust/icicle-cuda-runtime/src/bindings.rs
**/build*
**/icicle/appUtils/large_ntt/work
icicle/appUtils/large_ntt/work/test_ntt
**/build/*

View File

@@ -119,6 +119,7 @@ This will ensure our custom hooks are run and will make it easier to follow our
- [nonam3e](https://github.com/nonam3e), for adding Grumpkin curve support into ICICLE
- [alxiong](https://github.com/alxiong), for adding warmup for CudaStream
- [cyl19970726](https://github.com/cyl19970726), for updating go install source in Dockerfile
- [PatStiles](https://github.com/PatStiles), for adding Stark252 field
## Help & Support

View File

@@ -2,34 +2,54 @@
ICICLE Core is a library written in C++/CUDA. All the ICICLE primitives are implemented within ICICLE Core.
The Core is split into logical modules that can be compiled into static libraries using different [strategies](#compilation-strategies). You can then [link](#linking) these libraries with your C++ project or write your own [bindings](#writing-new-bindings-for-icicle) for other programming languages. If you want to use ICICLE with existing bindings please refer to [Rust](/icicle/rust-bindings) / [Golang](/icicle/golang-bindings).
The Core is split into logical modules that can be compiled into static libraries using different [strategies](#compilation-strategies). You can then [link](#linking) these libraries with your C++ project or write your own [bindings](#writing-new-bindings-for-icicle) for other programming languages. If you want to use ICICLE with existing bindings please refer to the [Rust](/icicle/rust-bindings) or [Golang](/icicle/golang-bindings) bindings documentation.
## Supported curves, fields and operations
### Supported curves and operations
| Operation\Curve | [bn254](https://neuromancer.sk/std/bn/bn254) | [bls12-377](https://neuromancer.sk/std/bls/BLS12-377) | [bls12-381](https://neuromancer.sk/std/bls/BLS12-381) | [bw6-761](https://eprint.iacr.org/2020/351) | grumpkin |
| --- | :---: | :---: | :---: | :---: | :---: |
| [MSM][MSM_DOCS] | ✅ | ✅ | ✅ | ✅ | ✅ |
| G2 | ✅ | ✅ | ✅ | ✅ | ❌ |
| [NTT][NTT_DOCS] | ✅ | ✅ | ✅ | ✅ | ❌ |
| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
| [VecOps][VECOPS_CODE] | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Polynomials][POLY_DOCS] | ✅ | ✅ | ✅ | ✅ | ❌ |
| [Poseidon](primitives/poseidon) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Merkle Tree](primitives/poseidon#the-tree-builder) | ✅ | ✅ | ✅ | ✅ | ✅ |
### Supported fields and operations
| Operation\Field | [babybear](https://eprint.iacr.org/2023/824.pdf) | [Stark252](https://docs.starknet.io/documentation/architecture_and_concepts/Cryptography/p-value/) |
| --- | :---: | :---: |
| [VecOps][VECOPS_CODE] | ✅ | ✅ |
| [Polynomials][POLY_DOCS] | ✅ | ✅ |
| [NTT][NTT_DOCS] | ✅ | ✅ |
| Extension Field | ✅ | ❌ |
### Supported hashes
| Hash | Sizes |
| --- | :---: |
| Keccak | 256, 512 |
## Compilation strategies
Most of the codebase is curve/field agnostic, which means it can be compiled for different curves and fields. When you build ICICLE Core you choose a single curve or field. If you need multiple curves or fields - you just compile ICICLE into multiple static libraries. It's that simple. Currently, the following choices are supported:
Most of the codebase is curve/field agnostic, which means it can be compiled for different curves and fields. When you build ICICLE Core you choose a single curve or field. If you need multiple curves or fields, you compile ICICLE once per curve or field that is needed. It's that simple. Currently, the following choices are supported:
- [Field mode](#compiling-for-a-field) - used for STARK fields like BabyBear / Mersenne / Goldilocks. Includes field arithmetic, NTT, Poseidon, Extension fields and other primitives.
- [Curve mode](#compiling-for-a-curve) - used for SNARK curves like BN254/ BLS curves / Grumpkin / etc. Curve mode is built upon field mode, so it includes everything that field does. It also includes curve operations / MSM / ECNTT / G2 and other curve-related primitives.
- [Field mode][COMPILE_FIELD_MODE] - used for STARK fields like BabyBear / Mersenne / Goldilocks. Includes field arithmetic, NTT, Poseidon, Extension fields and other primitives.
- [Curve mode][COMPILE_CURVE_MODE] - used for SNARK curves like BN254 / BLS curves / Grumpkin / etc. Curve mode is built upon field mode, so it includes everything that field does It also includes curve operations / MSM / ECNTT / G2 and other curve-related primitives.
:::info
If you only want to use curve's scalar/base field, you still need to go with a curve mode. You can disable MSM with [options](#compilation-options)
If you only want to use a curve's scalar or base field, you still need to use curve mode. You can disable MSM with [options](#compilation-options)
:::
### Compiling for a field
ICICLE supports the following STARK fields:
- [BabyBear](https://eprint.iacr.org/2023/824.pdf)
Field mode includes:
- [Field arithmetic](https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/fields/field.cuh) - field multiplication, addition, subtraction
- [NTT](icicle/primitives/ntt) - FFT / iFFT
- [Poseidon Hash](icicle/primitives/poseidon)
- [Vector operations](https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/vec_ops/vec_ops.cuh)
- [Polynomial](#) - structs and methods to work with polynomials
You can compile ICICLE for a STARK field using this command:
You can compile ICICLE for a field using this command:
```sh
cd icicle
@@ -38,24 +58,10 @@ cmake -DFIELD=<FIELD> -S . -B build
cmake --build build -j
```
Icicle Supports the following `<FIELD>` FIELDS:
- `babybear`
This command will output `libingo_field_<FIELD>.a` into `build/lib`.
### Compiling for a curve
ICICLE supports the following SNARK curves:
- [BN254](https://neuromancer.sk/std/bn/bn254)
- [BLS12-377](https://neuromancer.sk/std/bls/BLS12-377)
- [BLS12-381](https://neuromancer.sk/std/bls/BLS12-381)
- [BW6-761](https://eprint.iacr.org/2020/351)
- Grumpkin
Curve mode includes everything you can find in field mode with addition of:
- [MSM](icicle/primitives/msm) - MSM / Batched MSM
- [ECNTT](#)
:::note
Field related primitives will be compiled for the scalar field of the curve
@@ -81,7 +87,7 @@ There exist multiple options that allow you to customize your build or enable ad
#### EXT_FIELD
Used only in a [field mode](#compiling-for-a-field) to add Extension field into a build. Adds NTT for the extension field.
Used only in [field mode][COMPILE_FIELD_MODE] to add an Extension field. Adds all supported field operations for the extension field.
Default: `OFF`
@@ -89,7 +95,7 @@ Usage: `-DEXT_FIELD=ON`
#### G2
Used only in a [curve mode](#compiling-for-a-curve) to add G2 definitions into a build. Also adds G2 MSM.
Used only in [curve mode][COMPILE_CURVE_MODE] to add G2 definitions. Also adds G2 MSM.
Default: `OFF`
@@ -97,7 +103,7 @@ Usage: `-DG2=ON`
#### ECNTT
Used only in a [curve mode](#compiling-for-a-curve) to add ECNTT function into a build.
Used only in [curve mode][COMPILE_CURVE_MODE] to add ECNTT function.
Default: `OFF`
@@ -105,7 +111,7 @@ Usage: `-DECNTT=ON`
#### MSM
Used only in a [curve mode](#compiling-for-a-curve) to add MSM function into a build. As MSM takes a lot of time to build, you can disable it with this option to reduce compilation time.
Used only in [curve mode][COMPILE_CURVE_MODE] to add MSM function. As MSM takes a lot of time to build, you can disable it with this option to reduce compilation time.
Default: `ON`
@@ -149,14 +155,13 @@ To link ICICLE with your project you first need to compile ICICLE with options o
Refer to our [c++ examples](https://github.com/ingonyama-zk/icicle/tree/main/examples/c%2B%2B) for more info. Take a look at this [CMakeLists.txt](https://github.com/ingonyama-zk/icicle/blob/main/examples/c%2B%2B/msm/CMakeLists.txt#L22)
## Writing new bindings for ICICLE
Since ICICLE Core is written in CUDA / C++ its really simple to generate static libraries. These static libraries can be installed on any system and called by higher level languages such as Golang.
Static libraries can be loaded into memory once and used by multiple programs, reducing memory usage and potentially improving performance. They also allow you to separate functionality into distinct modules so your static library may need to compile only specific features that you want to use.
Let's review the [Golang bindings](golang-bindings.md) since its a pretty verbose example (compared to rust which hides it pretty well) of using static libraries. Golang has a library named `CGO` which can be used to link static libraries. Here's a basic example on how you can use cgo to link these libraries:
Let's review the [Golang bindings][GOLANG_BINDINGS] since its a pretty verbose example (compared to rust which hides it pretty well) of using static libraries. Golang has a library named `CGO` which can be used to link static libraries. Here's a basic example on how you can use cgo to link these libraries:
```go
/*
@@ -178,4 +183,14 @@ func main() {
The comments on the first line tell `CGO` which libraries to import as well as which header files to include. You can then call methods which are part of the static library and defined in the header file, `C.projective_from_affine_bn254` is an example.
If you wish to create your own bindings for a language of your choice we suggest you start by investigating how you can call static libraries.
If you wish to create your own bindings for a language of your choice we suggest you start by investigating how you can call static libraries.
<!-- Begin Links -->
[GOLANG_BINDINGS]: golang-bindings.md
[COMPILE_CURVE_MODE]: #compiling-for-a-curve
[COMPILE_FIELD_MODE]: #compiling-for-a-field
[NTT_DOCS]: primitives/ntt
[MSM_DOCS]: primitives/msm
[POLY_DOCS]: polynomials/overview
[VECOPS_CODE]: https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/vec_ops/vec_ops.cuh
<!-- End Links -->

View File

@@ -1,7 +1,7 @@
# Golang bindings
Golang bindings allow you to use ICICLE as a golang library.
The source code for all Golang libraries can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang).
The source code for all Golang packages can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang).
The Golang bindings are comprised of multiple packages.
@@ -9,7 +9,7 @@ The Golang bindings are comprised of multiple packages.
[`cuda-runtime`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/cuda_runtime) which defines abstractions for CUDA methods for allocating memory, initializing and managing streams, and `DeviceContext` which enables users to define and keep track of devices.
Each curve has its own package which you can find [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves). If your project uses BN254 you only need to install that single package named [`bn254`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves/bn254).
Each supported curve, field, and hash has its own package which you can find in the respective directories [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang). If your project uses BN254 you only need to import that single package named [`bn254`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves/bn254).
## Using ICICLE Golang bindings in your project
@@ -31,22 +31,30 @@ For a specific commit
go get github.com/ingonyama-zk/icicle@<commit_id>
```
To build the shared libraries you can run this script:
To build the shared libraries you can run [this](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/build.sh) script:
```bash
./build.sh [-curve=<curve> | -field=<field>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
```sh
./build.sh [-curve=<curve>] [-field=<field>] [-hash=<hash>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
curve - The name of the curve to build or "all" to build all supported curves
field - The name of the field to build or "all" to build all supported fields
hash - The name of the hash to build or "all" to build all supported hashes
-g2 - Optional - build with G2 enabled
-ecntt - Optional - build with ECNTT enabled
-devmode - Optional - build in devmode
-help - Optional - Displays usage information
```
- **`curve`** - The name of the curve to build or "all" to build all curves
- **`field`** - The name of the field to build or "all" to build all fields
- **`g2`** - Optional - build with G2 enabled
- **`ecntt`** - Optional - build with ECNTT enabled
- **`devmode`** - Optional - build in devmode
- Usage can be displayed with the flag `-help`
:::note
If more than one curve or more than one field or more than one hash is supplied, the last one supplied will be built
:::
To build ICICLE libraries for all supported curves with G2 and ECNTT enabled.
```bash
./build.sh all -g2 -ecntt
./build.sh -curve=all -g2 -ecntt
```
If you wish to build for a specific curve, for example bn254, without G2 or ECNTT enabled.
@@ -73,11 +81,9 @@ import (
To run all tests, for all curves:
```bash
go test --tags=g2 ./... -count=1
go test ./... -count=1
```
If you dont want to include g2 tests then drop `--tags=g2`.
If you wish to run test for a specific curve:
```bash
@@ -106,3 +112,25 @@ func main() {
```
Replace `/path/to/shared/libs` with the actual path where the shared libraries are located on your system.
## Supported curves, fields and operations
### Supported curves and operations
| Operation\Curve | bn254 | bls12_377 | bls12_381 | bw6-761 | grumpkin |
| --- | :---: | :---: | :---: | :---: | :---: |
| MSM | ✅ | ✅ | ✅ | ✅ | ✅ |
| G2 | ✅ | ✅ | ✅ | ✅ | ❌ |
| NTT | ✅ | ✅ | ✅ | ✅ | ❌ |
| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
| VecOps | ✅ | ✅ | ✅ | ✅ | ✅ |
| Polynomials | ✅ | ✅ | ✅ | ✅ | ❌ |
### Supported fields and operations
| Operation\Field | babybear |
| --- | :---: |
| VecOps | ✅ |
| Polynomials | ✅ |
| NTT | ✅ |
| Extension Field | ✅ |

View File

@@ -1,9 +1,5 @@
# ECNTT
### Supported curves
`bls12-377`, `bls12-381`, `bn254`
## ECNTT Method
The `ECNtt[T any]()` function performs the Elliptic Curve Number Theoretic Transform (EC-NTT) on the input points slice, using the provided dir (direction), cfg (configuration), and stores the results in the results slice.
@@ -12,14 +8,13 @@ The `ECNtt[T any]()` function performs the Elliptic Curve Number Theoretic Trans
func ECNtt[T any](points core.HostOrDeviceSlice, dir core.NTTDir, cfg *core.NTTConfig[T], results core.HostOrDeviceSlice) core.IcicleError
```
### Parameters:
### Parameters
- **`points`**: A slice of elliptic curve points (in projective coordinates) that will be transformed. The slice can be stored on the host or the device, as indicated by the `core.HostOrDeviceSlice` type.
- **`dir`**: The direction of the EC-NTT transform, either `core.KForward` or `core.KInverse`.
- **`cfg`**: A pointer to an `NTTConfig` object, containing configuration options for the NTT operation.
- **`results`**: A slice that will store the transformed elliptic curve points (in projective coordinates). The slice can be stored on the host or the device, as indicated by the `core.HostOrDeviceSlice` type.
### Return Value
- **`CudaError`**: A `core.IcicleError` value, which will be `core.IcicleErrorCode(0)` if the EC-NTT operation was successful, or an error if something went wrong.
@@ -94,4 +89,4 @@ func Main() {
panic("ECNTT operation failed")
}
}
```
```

View File

@@ -0,0 +1,94 @@
# Keccak
## Keccak Example
```go
package main
import (
"encoding/hex"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/hash/keccak"
)
func createHostSliceFromHexString(hexString string) core.HostSlice[uint8] {
byteArray, err := hex.DecodeString(hexString)
if err != nil {
panic("Not a hex string")
}
return core.HostSliceFromElements([]uint8(byteArray))
}
func main() {
input := createHostSliceFromHexString("1725b6")
outHost256 := make(core.HostSlice[uint8], 32)
cfg := keccak.GetDefaultHashConfig()
e := keccak.Keccak256(input, int32(input.Len()), 1, outHost256, &cfg)
if e.CudaErrorCode != cr.CudaSuccess {
panic("Keccak256 hashing failed")
}
outHost512 := make(core.HostSlice[uint8], 64)
e = keccak.Keccak512(input, int32(input.Len()), 1, outHost512, &cfg)
if e.CudaErrorCode != cr.CudaSuccess {
panic("Keccak512 hashing failed")
}
numberOfBlocks := 3
outHostBatch256 := make(core.HostSlice[uint8], 32*numberOfBlocks)
e = keccak.Keccak256(input, int32(input.Len()/numberOfBlocks), int32(numberOfBlocks), outHostBatch256, &cfg)
if e.CudaErrorCode != cr.CudaSuccess {
panic("Keccak256 batch hashing failed")
}
}
```
## Keccak Methods
```go
func Keccak256(input core.HostOrDeviceSlice, inputBlockSize, numberOfBlocks int32, output core.HostOrDeviceSlice, config *HashConfig) core.IcicleError
func Keccak512(input core.HostOrDeviceSlice, inputBlockSize, numberOfBlocks int32, output core.HostOrDeviceSlice, config *HashConfig) core.IcicleError
```
### Parameters
- **`input`**: A slice containing the input data for the Keccak256 hash function. It can reside in either host memory or device memory.
- **`inputBlockSize`**: An integer specifying the size of the input data for a single hash.
- **`numberOfBlocks`**: An integer specifying the number of results in the hash batch.
- **`output`**: A slice where the resulting hash will be stored. This slice can be in host or device memory.
- **`config`**: A pointer to a `HashConfig` object, which contains various configuration options for the Keccak256 operation.
### Return Value
- **`CudaError`**: Returns a CUDA error code indicating the success or failure of the Keccak256/Keccak512 operation.
## HashConfig
The `HashConfig` structure holds configuration parameters for the Keccak256/Keccak512 operation, allowing customization of its behavior to optimize performance based on the specifics of the operation or the underlying hardware.
```go
type HashConfig struct {
Ctx cr.DeviceContext
areInputsOnDevice bool
areOutputsOnDevice bool
IsAsync bool
}
```
### Fields
- **`Ctx`**: Device context containing details like device id and stream.
- **`areInputsOnDevice`**: Indicates if input data is located on the device.
- **`areOutputsOnDevice`**: Indicates if output hash is stored on the device.
- **`IsAsync`**: If true, runs the Keccak256/Keccak512 operation asynchronously.
### Default Configuration
Use `GetDefaultHashConfig` to obtain a default configuration, which can then be customized as needed.
```go
func GetDefaultHashConfig() HashConfig
```

View File

@@ -2,15 +2,11 @@
To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).
### Supported curves
`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `grumpkin`
## Core package
## MSM `PrecomputeBases`
### MSM PrecomputePoints
`PrecomputeBases` and `G2PrecomputeBases` exists for all supported curves.
`PrecomputePoints` and `G2PrecomputePoints` exists for all supported curves.
#### Description
@@ -18,21 +14,20 @@ This function extends each provided base point $(P)$ with its multiples $(2^lP,
The precomputation process is crucial for optimizing MSM operations, especially when dealing with large sets of points and scalars. By precomputing and storing multiples of the base points, the MSM function can more efficiently compute the scalar-point multiplications.
#### `PrecomputeBases`
#### `PrecomputePoints`
Precomputes bases for MSM by extending each base point with its multiples.
Precomputes points for MSM by extending each base point with its multiples.
```go
func PrecomputeBases(points core.HostOrDeviceSlice, precomputeFactor int32, c int32, ctx *cr.DeviceContext, outputBases core.DeviceSlice) cr.CudaError
func PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
```
##### Parameters
- **`points`**: A slice of the original affine points to be extended with their multiples.
- **`precomputeFactor`**: Determines the total number of points to precompute for each base point.
- **`c`**: Currently unused; reserved for future compatibility.
- **`ctx`**: CUDA device context specifying the execution environment.
- **`outputBases`**: The device slice allocated for storing the extended bases.
- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`outputBases`**: The device slice allocated for storing the extended points.
##### Example
@@ -54,28 +49,27 @@ func main() {
var precomputeOut core.DeviceSlice
precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())
err := bn254.PrecomputeBases(points, precomputeFactor, 0, &cfg.Ctx, precomputeOut)
err := bn254.PrecomputePoints(points, 1024, &cfg, precomputeOut)
if err != cr.CudaSuccess {
log.Fatalf("PrecomputeBases failed: %v", err)
}
}
```
#### `G2PrecomputeBases`
#### `G2PrecomputePoints`
This method is the same as `PrecomputeBases` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.
This method is the same as `PrecomputePoints` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.
```go
func G2PrecomputeBases(points core.HostOrDeviceSlice, precomputeFactor int32, c int32, ctx *cr.DeviceContext, outputBases core.DeviceSlice) cr.CudaError
func G2PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
```
##### Parameters
- **`points`**: A slice of G2 curve points to be extended.
- **`precomputeFactor`**: The total number of points to precompute for each base.
- **`c`**: Reserved for future use to ensure compatibility with MSM operations.
- **`ctx`**: Specifies the CUDA device context for execution.
- **`outputBases`**: Allocated device slice for the extended bases.
- **`points`**: A slice of the original affine points to be extended with their multiples.
- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`outputBases`**: The device slice allocated for storing the extended points.
##### Example
@@ -97,20 +91,9 @@ func main() {
var precomputeOut core.DeviceSlice
precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())
err := g2.G2PrecomputeBases(points, precomputeFactor, 0, &cfg.Ctx, precomputeOut)
err := g2.G2PrecomputePoints(points, 1024, 0, &cfg, precomputeOut)
if err != cr.CudaSuccess {
log.Fatalf("PrecomputeBases failed: %v", err)
}
}
```
### Benchmarks
Benchmarks where performed on a Nvidia RTX 3090Ti.
| Pre-computation factor | bn254 size `2^20` MSM, ms. | bn254 size `2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. | bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| 1 | 14.1 | 82.8 | 25.5 | 136.7 |
| 2 | 11.8 | 76.6 | 20.3 | 123.8 |
| 4 | 10.9 | 73.8 | 18.1 | 117.8 |
| 8 | 10.6 | 73.7 | 17.2 | 116.0 |

View File

@@ -1,10 +1,5 @@
# MSM
### Supported curves
`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `grumpkin`
## MSM Example
```go
@@ -13,12 +8,13 @@ package main
import (
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
bn254_msm "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/msm"
)
func main() {
// Obtain the default MSM configuration.
cfg := bn254.GetDefaultMSMConfig()
cfg := core.GetDefaultMSMConfig()
// Define the size of the problem, here 2^18.
size := 1 << 18
@@ -44,7 +40,7 @@ func main() {
cfg.IsAsync = true
// Perform the MSM operation.
e = bn254.Msm(scalars, points, &cfg, out)
e = bn254_msm.Msm(scalars, points, &cfg, out)
if e != cr.CudaSuccess {
panic(e)
@@ -124,10 +120,9 @@ Use `GetDefaultMSMConfig` to obtain a default configuration, which can then be c
func GetDefaultMSMConfig() MSMConfig
```
## How do I toggle between the supported algorithms?
When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle accumulation and `cfg.Ctx.IsBigTriangle = false` will activate Bucket accumulation.
When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle reduction and `cfg.Ctx.IsBigTriangle = false` will activate iterative reduction.
```go
...
@@ -157,17 +152,19 @@ out.Malloc(batchSize*p.Size(), p.Size())
...
```
## Parameters for optimal performance
Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)
## Support for G2 group
To activate G2 support first you must make sure you are building the static libraries with G2 feature enabled as described in the [Golang building instructions](../golang-bindings.md#using-icicle-golang-bindings-in-your-project).
Now you may import `g2` package of the specified curve.
```go
import (
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bls254/g2"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
)
```
@@ -183,7 +180,7 @@ import (
)
func main() {
cfg := bn254.GetDefaultMSMConfig()
cfg := core.GetDefaultMSMConfig()
size := 1 << 12
batchSize := 3
totalSize := size * batchSize

View File

@@ -2,8 +2,7 @@
To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation.
Here we will cover the core multi GPU apis and a [example](#a-multi-gpu-example)
Here we will cover the core multi GPU apis and an [example](#a-multi-gpu-example)
## A Multi GPU example
@@ -13,7 +12,6 @@ In this example we will display how you can
2. For every GPU launch a thread and set an active device per thread.
3. Execute a MSM on each GPU
```go
package main
@@ -79,13 +77,13 @@ To streamline device management we offer as part of `cuda_runtime` package metho
Runs a given function on a specific GPU device, ensuring that all CUDA calls within the function are executed on the selected device.
In Go, most concurrency can be done via Goroutines. However, there is no guarantee that a goroutine stays on a specific host thread.
In Go, most concurrency can be done via Goroutines. However, there is no guarantee that a goroutine stays on a specific host thread.
`RunOnDevice` was designed to solve this caveat and insure that the goroutine will stay on a specific host thread.
`RunOnDevice` was designed to solve this caveat and ensure that the goroutine will stay on a specific host thread.
`RunOnDevice` will lock a goroutine into a specific host thread, sets a current GPU device, runs a provided function, and unlocks the goroutine from the host thread after the provided function finishes.
`RunOnDevice` locks a goroutine into a specific host thread, sets a current GPU device, runs a provided function, and unlocks the goroutine from the host thread after the provided function finishes.
While the goroutine is locked to the host thread, the Go runtime will not assign other goroutine's to that host thread.
While the goroutine is locked to the host thread, the Go runtime will not assign other goroutines to that host thread.
**Parameters:**
@@ -96,7 +94,10 @@ While the goroutine is locked to the host thread, the Go runtime will not assign
**Behavior:**
- The function `funcToRun` is executed in a new goroutine that is locked to a specific OS thread to ensure that all CUDA calls within the function target the specified device.
- It's important to note that any goroutines launched within `funcToRun` are not automatically bound to the same GPU device. If necessary, `RunOnDevice` should be called again within such goroutines with the same `deviceId`.
:::note
Any goroutines launched within `funcToRun` are not automatically bound to the same GPU device. If necessary, `RunOnDevice` should be called again within such goroutines with the same `deviceId`.
:::
**Example:**
@@ -111,6 +112,10 @@ RunOnDevice(0, func(args ...any) {
Sets the active device for the current host thread. All subsequent CUDA calls made from this thread will target the specified device.
:::warning
This function should not be used directly in conjunction with goroutines. If you want to run multi-gpu scenarios with goroutines you should use [RunOnDevice](#runondevice)
:::
**Parameters:**
- **`device int`**: The ID of the device to set as the current device.
@@ -147,4 +152,4 @@ Retrieves the device associated with a given pointer.
- **`int`**: The device ID associated with the memory pointed to by `ptr`.
This documentation should provide a clear understanding of how to effectively manage multiple GPUs in Go applications using CUDA, with a particular emphasis on the `RunOnDevice` function for executing tasks on specific GPUs.
This documentation should provide a clear understanding of how to effectively manage multiple GPUs in Go applications using CUDA, with a particular emphasis on the `RunOnDevice` function for executing tasks on specific GPUs.

View File

@@ -1,58 +1,54 @@
# NTT
### Supported curves
`bls12-377`, `bls12-381`, `bn254`, `bw6-761`
## NTT Example
```go
package main
import (
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
"github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
"github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
)
func init() {
cfg := bn254.GetDefaultNttConfig()
initDomain(18, cfg)
cfg := bn254.GetDefaultNttConfig()
initDomain(18, cfg)
}
func initDomain[T any](largestTestSize int, cfg core.NTTConfig[T]) core.IcicleError {
rouMont, _ := fft.Generator(uint64(1 << largestTestSize))
rou := rouMont.Bits()
rouIcicle := bn254.ScalarField{}
rouMont, _ := fft.Generator(uint64(1 << largestTestSize))
rou := rouMont.Bits()
rouIcicle := bn254.ScalarField{}
rouIcicle.FromLimbs(rou[:])
e := bn254.InitDomain(rouIcicle, cfg.Ctx, false)
return e
rouIcicle.FromLimbs(rou[:])
e := bn254.InitDomain(rouIcicle, cfg.Ctx, false)
return e
}
func main() {
// Obtain the default NTT configuration with a predefined coset generator.
cfg := bn254.GetDefaultNttConfig()
// Obtain the default NTT configuration with a predefined coset generator.
cfg := bn254.GetDefaultNttConfig()
// Define the size of the input scalars.
size := 1 << 18
// Define the size of the input scalars.
size := 1 << 18
// Generate scalars for the NTT operation.
scalars := bn254.GenerateScalars(size)
// Generate scalars for the NTT operation.
scalars := bn254.GenerateScalars(size)
// Set the direction of the NTT (forward or inverse).
dir := core.KForward
// Set the direction of the NTT (forward or inverse).
dir := core.KForward
// Allocate memory for the results of the NTT operation.
results := make(core.HostSlice[bn254.ScalarField], size)
// Allocate memory for the results of the NTT operation.
results := make(core.HostSlice[bn254.ScalarField], size)
// Perform the NTT operation.
err := bn254.Ntt(scalars, dir, &cfg, results)
if err.CudaErrorCode != cr.CudaSuccess {
panic("NTT operation failed")
}
// Perform the NTT operation.
err := bn254.Ntt(scalars, dir, &cfg, results)
if err.CudaErrorCode != cr.CudaSuccess {
panic("NTT operation failed")
}
}
```
@@ -146,10 +142,10 @@ import (
)
func example() {
cfg := GetDefaultNttConfig()
err := ReleaseDomain(cfg.Ctx)
if err != nil {
// Handle the error
}
cfg := GetDefaultNttConfig()
err := ReleaseDomain(cfg.Ctx)
if err != nil {
// Handle the error
}
}
```

View File

@@ -1,12 +1,14 @@
# Vector Operations
## Overview
Icicle is exposing a number of vector operations which a user can control:
Icicle exposes a number of vector operations which a user can use:
* The VecOps API provides efficient vector operations such as addition, subtraction, and multiplication.
* MatrixTranspose API allows a user to perform a transpose on a vector representation of a matrix
## VecOps API Documentation
### Example
#### Vector addition
@@ -183,4 +185,4 @@ if err.IcicleErrorCode != core.IcicleErrorCode(0) {
// ...
```
In this example, the `TransposeMatrix` function is used to transpose a 5x4 matrix stored in a 1D slice. The input and output slices are stored on the host (CPU), and the operation is executed synchronously.
In this example, the `TransposeMatrix` function is used to transpose a 5x4 matrix stored in a 1D slice. The input and output slices are stored on the host (CPU), and the operation is executed synchronously.

View File

@@ -165,7 +165,36 @@ cargo bench
#### ICICLE Golang
Golang is WIP in v1, coming soon. Please checkout a previous [release v0.1.0](https://github.com/ingonyama-zk/icicle/releases/tag/v0.1.0) for golang bindings.
The Golang bindings require compiling ICICLE Core first. We supply a [build script](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/build.sh) to help build what you need.
Script usage:
```sh
./build.sh [-curve=<curve>] [-field=<field>] [-hash=<hash>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
curve - The name of the curve to build or "all" to build all supported curves
field - The name of the field to build or "all" to build all supported fields
hash - The name of the hash to build or "all" to build all supported hashes
-g2 - Optional - build with G2 enabled
-ecntt - Optional - build with ECNTT enabled
-devmode - Optional - build in devmode
```
:::note
If more than one curve or more than one field or more than one hash is supplied, the last one supplied will be built
:::
Once the library has been built, you can use and test the Golang bindings.
To test a specific curve, field or hash, change to it's directory and then run:
```sh
go test ./tests -count=1 -failfast -timeout 60m -p 2 -v
```
You will be able to see each test that runs, how long it takes and whether it passed or failed
### Running ICICLE examples
@@ -185,8 +214,8 @@ Read through the compile.sh and CMakeLists.txt to understand how to link your ow
:::
#### Running with Docker
In each example directory, ZK-container files are located in a subdirectory `.devcontainer`.
```sh
@@ -215,4 +244,4 @@ Inside the container you can run the same commands:
./run.sh
```
You can now experiment with our other examples, perhaps try to run a rust or golang example next.
You can now experiment with our other examples, perhaps try to run a rust or golang example next.

View File

@@ -2,7 +2,7 @@
:::info
If you are looking for the Multi GPU API documentation refer here for [Rust](./rust-bindings/multi-gpu.md).
If you are looking for the Multi GPU API documentation refer [here](./rust-bindings/multi-gpu.md) for Rust and [here](./golang-bindings/multi-gpu.md) for Golang.
:::
@@ -10,12 +10,11 @@ One common challenge with Zero-Knowledge computation is managing the large input
Multi-GPU programming involves developing software to operate across multiple GPU devices. Lets first explore different approaches to Multi-GPU programming then we will cover how ICICLE allows you to easily develop youR ZK computations to run across many GPUs.
## Approaches to Multi GPU programming
There are many [different strategies](https://github.com/NVIDIA/multi-gpu-programming-models) available for implementing multi GPU, however, it can be split into two categories.
### GPU Server approach
### GPU Server approach
This approach usually involves a single or multiple CPUs opening threads to read / write from multiple GPUs. You can think about it as a scaled up HOST - Device model.
@@ -23,8 +22,7 @@ This approach usually involves a single or multiple CPUs opening threads to read
This approach won't let us tackle larger computation sizes but it will allow us to compute multiple computations which we wouldn't be able to load onto a single GPU.
For example let's say that you had to compute two MSMs of size 2^26 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel.
For example let's say that you had to compute two MSMs of size 2^26 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel.
### Inter GPU approach
@@ -32,18 +30,17 @@ This approach involves a more sophisticated approach to multi GPU computation. U
This approach requires redesigning the algorithm at the software level to be compatible with splitting amongst devices. In some cases, to lower latency to a minimum, special inter GPU connections would be installed on a server to allow direct communication between multiple GPUs.
# Writing ICICLE Code for Multi GPUs
## Writing ICICLE Code for Multi GPUs
The approach we have taken for the moment is a GPU Server approach; we assume you have a machine with multiple GPUs and you wish to run some computation on each GPU.
To dive deeper and learn about the API check out the docs for our different ICICLE API
- [Rust Multi GPU APIs](./rust-bindings/multi-gpu.md)
- [Golang Multi GPU APIs](./golang-bindings/multi-gpu.md)
- C++ Multi GPU APIs
## Best practices
## Best practices
- Never hardcode device IDs, if you want your software to take advantage of all GPUs on a machine use methods such as `get_device_count` to support arbitrary number of GPUs.
@@ -57,7 +54,7 @@ Multi GPU support should work with ZK-Containers by simply defining which device
docker run -it --gpus '"device=0,2"' zk-container-image
```
If you wish to expose all GPUs
If you wish to expose all GPUs
```sh
docker run --gpus all zk-container-image

View File

@@ -2,10 +2,6 @@
[![GitHub Release](https://img.shields.io/github/v/release/ingonyama-zk/icicle)](https://github.com/ingonyama-zk/icicle/releases)
[ICICLE](https://github.com/ingonyama-zk/icicle) is a cryptography library for ZK using GPUs. ICICLE implements blazing fast cryptographic primitives such as EC operations, MSM, NTT, Poseidon hash and more on GPU.
ICICLE allows developers with minimal GPU experience to effortlessly accelerate their ZK application; from our experiments, even the most naive implementation may yield 10X improvement in proving times.
@@ -17,28 +13,26 @@ ICICLE has been used by many leading ZK companies such as [Celer Network](https:
We understand that not all developers have access to a GPU and we don't want this to limit anyone from developing with ICICLE.
Here are some ways we can help you gain access to GPUs:
:::note
If none of the following options suit your needs, contact us on [telegram](https://t.me/RealElan) for assistance. We're committed to ensuring that a lack of a GPU doesn't become a bottleneck for you. If you need help with setup or any other issues, we're here to help you.
:::
### Grants
At Ingonyama we are interested in accelerating the progress of ZK and cryptography. If you are an engineer, developer or an academic researcher we invite you to checkout [our grant program](https://www.ingonyama.com/blog/icicle-for-researchers-grants-challenges). We will give you access to GPUs and even pay you to do your dream research!
### Google Colab
This is a great way to get started with ICICLE instantly. Google Colab offers free GPU access to a NVIDIA T4 instance, it's acquired with 16 GB of memory which should be enough for experimenting and even prototyping with ICICLE.
This is a great way to get started with ICICLE instantly. Google Colab offers free GPU access to a NVIDIA T4 instance with 16 GB of memory which should be enough for experimenting and even prototyping with ICICLE.
For an extensive guide on how to setup Google Colab with ICICLE refer to [this article](./colab-instructions.md).
If none of these options are appropriate for you reach out to us on [telegram](https://t.me/RealElan) we will do our best to help you.
### Vast.ai
[Vast.ai](https://vast.ai/) is a global GPU marketplace where you can rent many different types of GPUs by the hour for [competitive pricing](https://vast.ai/pricing). They provide on-demand and interruptible rentals depending on your need or use case; you can learn more about their rental types [here](https://vast.ai/faq#rental-types).
:::note
If none of these options suit your needs, contact us on [telegram](https://t.me/RealElan) for assistance. We're committed to ensuring that a lack of a GPU doesn't become a bottleneck for you. If you need help with setup or any other issues, we're here to do our best to help you.
:::
## What can you do with ICICLE?
[ICICLE](https://github.com/ingonyama-zk/icicle) can be used in the same way you would use any other cryptography library. While developing and integrating ICICLE into many proof systems, we found some use case categories:

View File

@@ -1,5 +1,9 @@
# Polynomial API Overview
:::note
Read our paper on the Polynomials API in ICICLE v2 by clicking [here](https://eprint.iacr.org/2024/973).
:::
## Introduction
The Polynomial API offers a robust framework for polynomial operations within a computational environment. It's designed for flexibility and efficiency, supporting a broad range of operations like arithmetic, evaluation, and manipulation, all while abstracting from the computation and storage specifics. This enables adaptability to various backend technologies, employing modern C++ practices.
@@ -7,6 +11,7 @@ The Polynomial API offers a robust framework for polynomial operations within a
## Key Features
### Backend Agnostic Architecture
Our API is structured to be independent of any specific computational backend. While a CUDA backend is currently implemented, the architecture facilitates easy integration of additional backends. This capability allows users to perform polynomial operations without the need to tailor their code to specific hardware, enhancing code portability and scalability.
### Templating in the Polynomial API
@@ -27,15 +32,19 @@ In this template:
- **`Image`**: Defines the type of the output values of the polynomial. This is typically the same as the coefficients.
#### Default instantiation
```cpp
extern template class Polynomial<scalar_t>;
```
#### Extended use cases
The templated nature of the Polynomial API also supports more complex scenarios. For example, coefficients and images could be points on an elliptic curve (EC points), which are useful in cryptographic applications and advanced algebraic structures. This approach allows the API to be extended easily to support new algebraic constructions without modifying the core implementation.
### Supported Operations
The Polynomial class encapsulates a polynomial, providing a variety of operations:
- **Construction**: Create polynomials from coefficients or evaluations on roots-of-unity domains.
- **Arithmetic Operations**: Perform addition, subtraction, multiplication, and division.
- **Evaluation**: Directly evaluate polynomials at specific points or across a domain.
@@ -47,6 +56,7 @@ The Polynomial class encapsulates a polynomial, providing a variety of operation
This section outlines how to use the Polynomial API in C++. Bindings for Rust and Go are detailed under the Bindings sections.
### Backend Initialization
Initialization with an appropriate factory is required to configure the computational context and backend.
```cpp
@@ -57,10 +67,12 @@ Initialization with an appropriate factory is required to configure the computat
Polynomial::initialize(std::make_shared<CUDAPolynomialFactory>());
```
:::note Icicle is built to a library per field/curve. Initialization must be done per library. That is, applications linking to multiple curves/fields should do it per curve/field.
:::note
Initialization of a factory must be done per linked curve or field.
:::
### Construction
Polynomials can be constructed from coefficients, from evaluations on roots-of-unity domains, or by cloning existing polynomials.
```cpp
@@ -80,10 +92,11 @@ auto p_cloned = p.clone(); // p_cloned and p do not share memory
```
:::note
The coefficients or evaluations may be allocated either on host or device memory. In both cases the memory is copied to backend device.
The coefficients or evaluations may be allocated either on host or device memory. In both cases the memory is copied to the backend device.
:::
### Arithmetic
Constructed polynomials can be used for various arithmetic operations:
```cpp
@@ -105,7 +118,8 @@ Polynomial operator%(const Polynomial& rhs) const; // returns remainder R(x)
Polynomial divide_by_vanishing_polynomial(uint64_t degree) const; // sdivision by the vanishing polynomial V(x)=X^N-1
```
#### Example:
#### Example
Given polynomials A(x),B(x),C(x) and V(x) the vanishing polynomial.
$$
@@ -117,12 +131,14 @@ auto H = (A*B-C).divide_by_vanishing_polynomial(N);
```
### Evaluation
Evaluate polynomials at arbitrary domain points or across a domain.
Evaluate polynomials at arbitrary domain points, across a domain or on a roots-of-unity domain.
```cpp
Image operator()(const Domain& x) const; // evaluate f(x)
void evaluate(const Domain* x, Image* evals /*OUT*/) const;
void evaluate_on_domain(Domain* domain, uint64_t size, Image* evals /*OUT*/) const; // caller allocates memory
void evaluate_on_rou_domain(uint64_t domain_log_size, Image* evals /*OUT*/) const; // caller allocate memory
```
Example:
@@ -136,21 +152,19 @@ uint64_t domain_size = ...;
auto domain = /*build domain*/; // host or device memory
auto evaluations = std::make_unique<scalar_t[]>(domain_size); // can be device memory too
f.evaluate_on_domain(domain, domain_size, evaluations);
// evaluate f(x) on roots of unity domain
uint64_t domain_log_size = ...;
auto evaluations_rou_domain = std::make_unique<scalar_t[]>(1 << domain_log_size); // can be device memory too
f.evaluate_on_rou_domain(domain_log_size, evaluations_rou_domain);
```
:::note For special domains such as roots of unity this method is not the most efficient for two reasons:
- Need to build the domain of size N.
- The implementation is not trying to identify this special domain.
Therefore the computation is typically $O(n^2)$ rather than $O(nlogn)$.
See the 'device views' section for more details.
:::
### Manipulations
Beyond arithmetic, the API supports efficient polynomial manipulations:
#### Monomials
```cpp
// Monomial operations
Polynomial& add_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
@@ -160,31 +174,35 @@ Polynomial& sub_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
The ability to add or subtract monomials directly and in-place is an efficient way to manipualte polynomials.
Example:
```cpp
f.add_monomial_in_place(scalar_t::from(5)); // f(x) += 5
f.sub_monomial_in_place(scalar_t::from(3), 8); // f(x) -= 3x^8
```
#### Computing the degree of a Polynomial
```cpp
// Degree computation
int64_t degree();
```
The degree of a polynomial is a fundamental characteristic that describes the highest power of the variable in the polynomial expression with a non-zero coefficient.
The `degree()` function in the API returns the degree of the polynomial, corresponding to the highest exponent with a non-zero coefficient.
The `degree()` function in the API returns the degree of the polynomial, corresponding to the highest exponent with a non-zero coefficient.
- For the polynomial $f(x) = x^5 + 2x^3 + 4$, the degree is 5 because the highest power of $x$ with a non-zero coefficient is 5.
- For a scalar value such as a constant term (e.g., $f(x) = 7$, the degree is considered 0, as it corresponds to $x^0$.
- The degree of the zero polynomial, $f(x) = 0$, where there are no non-zero coefficients, is defined as -1. This special case often represents an "empty" or undefined state in many mathematical contexts.
Example:
```cpp
auto f = /*some expression*/;
auto degree_of_f = f.degree();
```
#### Slicing
```cpp
// Slicing and selecting even or odd components.
Polynomial slice(uint64_t offset, uint64_t stride, uint64_t size = 0 /*0 means take all elements*/);
@@ -195,6 +213,7 @@ Polynomial odd();
The Polynomial API provides methods for slicing polynomials and selecting specific components, such as even or odd indexed terms. Slicing allows extracting specific sections of a polynomial based on an offset, stride, and size.
The following examples demonstrate folding a polynomial's even and odd parts and arbitrary slicing;
```cpp
// folding a polynomials even and odd parts with randomness
auto x = rand();
@@ -207,13 +226,15 @@ auto first_quarter = f.slice(0 /*offset*/, 1 /*stride*/, f.degree()/4 /*size*/);
```
### Memory access (copy/view)
Access to the polynomial's internal state can be vital for operations like commitment schemes or when more efficient custom operations are necessary. This can be done in one of two ways:
- **Copy** the coefficients or evaluations to user allocated memory or
- **View** into the device memory without copying.
#### Copy
Copy the polynomial coefficients to either host or device allocated memory.
:::note copying to host memory is backend agnostic while copying to device memory requires the memory to be allocated on the corresponding backend.
Access to the polynomial's internal state can be vital for operations like commitment schemes or when more efficient custom operations are necessary. This can be done either by copying or viewing the polynomial
#### Copying
Copies the polynomial coefficients to either host or device allocated memory.
:::note
Copying to host memory is backend agnostic while copying to device memory requires the memory to be allocated on the corresponding backend.
:::
```cpp
@@ -222,6 +243,7 @@ uint64_t copy_coeffs(Coeff* coeffs, uint64_t start_idx, uint64_t end_idx) const;
```
Example:
```cpp
auto coeffs_device = /*allocate CUDA or host memory*/
f.copy_coeffs(coeffs_device, 0/*start*/, f.degree());
@@ -232,7 +254,8 @@ auto rv = msm::MSM(coeffs_device, points, msm_size, cfg, results);
```
#### Views
The Polynomial API supports efficient data handling through the use of memory views. These views provide direct access to the polynomial's internal state, such as coefficients or evaluations, without the need to copy data. This feature is particularly useful for operations that require direct access to device memory, enhancing both performance and memory efficiency.
The Polynomial API supports efficient data handling through the use of memory views. These views provide direct access to the polynomial's internal state without the need to copy data. This feature is particularly useful for operations that require direct access to device memory, enhancing both performance and memory efficiency.
##### What is a Memory View?
@@ -242,7 +265,7 @@ A memory view is essentially a pointer to data stored in device memory. By provi
Memory views are extremely versatile and can be employed in various computational contexts such as:
- **Commitments**: Views can be used to commit polynomial states in cryptographic schemes, such as Multi-Scalar Multiplications (MSM), or for constructing Merkle trees without duplicating the underlying data.
- **Commitments**: Views can be used to commit polynomial states in cryptographic schemes, such as Multi-Scalar Multiplications (MSM).
- **External Computations**: They allow external functions or algorithms to utilize the polynomial's data directly, facilitating operations outside the core polynomial API. This is useful for custom operations that are not covered by the API.
##### Obtaining and Using Views
@@ -252,9 +275,6 @@ To create and use views within the Polynomial API, functions are provided to obt
```cpp
// Obtain a view of the polynomial's coefficients
std::tuple<IntegrityPointer<Coeff>, uint64_t /*size*/, uint64_t /*device_id*/> get_coefficients_view();
// obtain a view of the evaluations. Can specify the domain size and whether to compute reversed evaluations.
std::tuple<IntegrityPointer<Image>, uint64_t /*size*/, uint64_t /*device_id*/>
get_rou_evaluations_view(uint64_t nof_evaluations = 0, bool is_reversed = false);
```
Example usage:
@@ -268,6 +288,7 @@ gpu_accelerated_function(coeffs_view.get(),...);
```
##### Integrity-Pointer: Managing Memory Views
Within the Polynomial API, memory views are managed through a specialized tool called the Integrity-Pointer. This pointer type is designed to safeguard operations by monitoring the validity of the memory it points to. It can detect if the memory has been modified or released, thereby preventing unsafe access to stale or non-existent data.
The Integrity-Pointer not only acts as a regular pointer but also provides additional functionality to ensure the integrity of the data it references. Here are its key features:
@@ -304,20 +325,7 @@ if (coeff_view.isValid()) {
}
```
#### Evaluations View: Accessing Polynomial Evaluations Efficiently
The Polynomial API offers a specialized method, `get_rou_evaluations_view(...)`, which facilitates direct access to the evaluations of a polynomial. This method is particularly useful for scenarios where polynomial evaluations need to be accessed frequently or manipulated externally without the overhead of copying data.
This method provides a memory view into the device memory where polynomial evaluations are stored. It allows for efficient interpolation on larger domains, leveraging the raw evaluations directly from memory.
:::warning
Invalid request: requesting evaluations on a domain smaller than the degree of the polynomial is not supported and is considered invalid.
:::
```cpp
// Assume a polynomial `p` of degree N
auto [evals_view, size, device_id] = p.get_rou_evaluations_view(4*N); // expanding the evaluation domain
// Use the evaluations view to perform further computations or visualizations
process_polynomial_evaluations(evals_view.get(), size, device_id);
```
## Multi-GPU Support with CUDA Backend
@@ -334,7 +342,9 @@ cudaSetDevice(int deviceID);
This function sets the active CUDA device. All subsequent operations that allocate or deal with polynomial data will be performed on this device.
### Allocation Consistency
Polynomials are always allocated on the current CUDA device at the time of their creation. It is crucial to ensure that the device context is correctly set before initiating any operation that involves memory allocation:
```cpp
// Set the device before creating polynomials
cudaSetDevice(0);
@@ -345,6 +355,7 @@ Polynomial p2 = Polynomial::from_coefficients(coeffs, size);
```
### Matching Devices for Operations
When performing operations that result in the creation of new polynomials (such as addition or multiplication), it is imperative that both operands are on the same CUDA device. If the operands reside on different devices, an exception is thrown:
```cpp
@@ -354,7 +365,9 @@ auto p3 = p1 + p2; // Throws an exception if p1 and p2 are not on the same devic
```
### Device-Agnostic Operations
Operations that do not involve the creation of new polynomials, such as computing the degree of a polynomial or performing in-place modifications, can be executed regardless of the current device setting:
```cpp
// 'degree' and in-place operations do not require device matching
int deg = p1.degree();
@@ -362,9 +375,11 @@ p1 += p2; // Valid if p1 and p2 are on the same device, throws otherwise
```
### Error Handling
The API is designed to throw exceptions if operations are attempted across polynomials that are not located on the same GPU. This ensures that all polynomial operations are performed consistently and without data integrity issues due to device mismatches.
### Best Practices
To maximize the performance and avoid runtime errors in a multi-GPU setup, always ensure that:
- The CUDA device is set correctly before polynomial allocation.

View File

@@ -0,0 +1,75 @@
# Keccak
[Keccak](https://keccak.team/files/Keccak-implementation-3.2.pdf) is a cryptographic hash function designed by Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. It was selected as the winner of the NIST hash function competition, becoming the basis for the [SHA-3 standard](https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf).
Keccak operates on a message input of any length and produces a fixed-size hash output. The hash function is built upon the sponge construction, which involves absorbing the input data followed by squeezing out the hash value.
At its core, Keccak consists of a permutation function operating on a state array. The permutation function employs a round function that operates iteratively on the state array. Each round consists of five main steps:
- **Theta:** This step introduces diffusion by performing a bitwise XOR operation between the state and a linear combination of its neighboring columns.
- **Rho:** This step performs bit rotation operations on each lane of the state array.
- **Pi:** This step rearranges the positions of the lanes in the state array.
- **Chi:** This step applies a nonlinear mixing operation to each lane of the state array.
- **Iota:** This step introduces a round constant to the state array.
## Using Keccak
ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree or running multiple hashes in parallel.
### Supported Bindings
- [Golang](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/hash/keccak)
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
### Example usage
This is an example of running 1024 Keccak-256 hashes in parallel, where input strings are of size 136 bytes:
```rust
use icicle_core::hash::HashConfig;
use icicle_cuda_runtime::memory::HostSlice;
use icicle_hash::keccak::keccak256;
let config = HashConfig::default();
let input_block_len = 136;
let number_of_hashes = 1024;
let preimages = vec![1u8; number_of_hashes * input_block_len];
let mut digests = vec![0u8; number_of_hashes * 64];
let preimages_slice = HostSlice::from_slice(&preimages);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
keccak256(
preimages_slice,
input_block_len as u32,
number_of_hashes as u32,
digests_slice,
&config,
)
.unwrap();
```
### Merkle Tree
You can build a keccak merkle tree using the corresponding functions:
```rust
use icicle_core::tree::{merkle_tree_digests_len, TreeBuilderConfig};
use icicle_cuda_runtime::memory::HostSlice;
use icicle_hash::keccak::build_keccak256_merkle_tree;
let mut config = TreeBuilderConfig::default();
config.arity = 2;
let height = 22;
let input_block_len = 136;
let leaves = vec![1u8; (1 << height) * input_block_len];
let mut digests = vec![0u64; merkle_tree_digests_len((height + 1) as u32, 2, 1)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
build_keccak256_merkle_tree(leaves_slice, digests_slice, height, input_block_len, &config).unwrap();
```
In the example above, a binary tree of height 22 is being built. Each leaf is considered to be a 136 byte long array. The leaves and digests are aligned in a flat array. You can also use keccak512 in `build_keccak512_merkle_tree` function.

View File

@@ -49,48 +49,147 @@ Accelerating MSM is crucial to a ZK protocol's performance due to the [large per
You can learn more about how MSMs work from this [video](https://www.youtube.com/watch?v=Bl5mQA7UL2I) and from our resource list on [Ingopedia](https://www.ingonyama.com/ingopedia/msm).
## Supported curves
MSM supports the following curves:
`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `grumpkin`
## Supported Bindings
- [Golang](../golang-bindings/msm.md)
- [Rust](../rust-bindings//msm.md)
## Supported algorithms
## Algorithm description
Our MSM implementation supports two algorithms `Bucket accumulation` and `Large triangle accumulation`.
We follow the bucket method algorithm. The GPU implementation consists of four phases:
### Bucket accumulation
1. Preparation phase - The scalars are split into smaller scalars of `c` bits each. These are the bucket indices. The points are grouped according to their corresponding bucket index and the buckets are sorted by size.
2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected average bucket size times the `large_bucket_factor` parameter.
3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default, this is done by an iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.
The Bucket Accumulation algorithm is a method of dividing the overall MSM task into smaller, more manageable sub-tasks. It involves partitioning scalars and their corresponding points into different "buckets" based on the scalar values.
## Batched MSM
Bucket Accumulation can be more parallel-friendly because it involves dividing the computation into smaller, independent tasks, distributing scalar-point pairs into buckets and summing points within each bucket. This division makes it well suited for parallel processing on GPUs.
The MSM supports batch mode - running multiple MSMs in parallel. It's always better to use the batch mode instead of running single msms in serial as long as there is enough memory available. We support running a batch of MSMs that share the same points as well as a batch of MSMs that use different points.
#### When should I use Bucket accumulation?
## MSM configuration
In scenarios involving large MSM computations with many scalar-point pairs, the ability to parallelize operations makes Bucket Accumulation more efficient. The larger the MSM task, the more significant the potential gains from parallelization.
```cpp
/**
* @struct MSMConfig
* Struct that encodes MSM parameters to be passed into the [MSM](@ref MSM) function. The intended use of this struct
* is to create it using [default_msm_config](@ref default_msm_config) function and then you'll hopefully only need to
* change a small number of default values for each of your MSMs.
*/
struct MSMConfig {
device_context::DeviceContext ctx; /**< Details related to the device such as its id and stream id. */
int points_size; /**< Number of points in the MSM. If a batch of MSMs needs to be computed, this should be
* a number of different points. So, if each MSM re-uses the same set of points, this
* variable is set equal to the MSM size. And if every MSM uses a distinct set of
* points, it should be set to the product of MSM size and [batch_size](@ref
* batch_size). Default value: 0 (meaning it's equal to the MSM size). */
int precompute_factor; /**< The number of extra points to pre-compute for each point. See the
* [precompute_msm_points](@ref precompute_msm_points) function, `precompute_factor` passed
* there needs to be equal to the one used here. Larger values decrease the
* number of computations to make, on-line memory footprint, but increase the static
* memory footprint. Default value: 1 (i.e. don't pre-compute). */
int c; /**< \f$ c \f$ value, or "window bitsize" which is the main parameter of the "bucket
* method" that we use to solve the MSM problem. As a rule of thumb, larger value
* means more on-line memory footprint but also more parallelism and less computational
* complexity (up to a certain point). Currently pre-computation is independent of
* \f$ c \f$, however in the future value of \f$ c \f$ here and the one passed into the
* [precompute_msm_points](@ref precompute_msm_points) function will need to be identical.
* Default value: 0 (the optimal value of \f$ c \f$ is chosen automatically). */
int bitsize; /**< Number of bits of the largest scalar. Typically equals the bitsize of scalar field,
* but if a different (better) upper bound is known, it should be reflected in this
* variable. Default value: 0 (set to the bitsize of scalar field). */
int large_bucket_factor; /**< Variable that controls how sensitive the algorithm is to the buckets that occur
* very frequently. Useful for efficient treatment of non-uniform distributions of
* scalars and "top windows" with few bits. Can be set to 0 to disable separate
* treatment of large buckets altogether. Default value: 10. */
int batch_size; /**< The number of MSMs to compute. Default value: 1. */
bool are_scalars_on_device; /**< True if scalars are on device and false if they're on host. Default value:
* false. */
bool are_scalars_montgomery_form; /**< True if scalars are in Montgomery form and false otherwise. Default value:
* true. */
bool are_points_on_device; /**< True if points are on device and false if they're on host. Default value: false. */
bool are_points_montgomery_form; /**< True if coordinates of points are in Montgomery form and false otherwise.
* Default value: true. */
bool are_results_on_device; /**< True if the results should be on device and false if they should be on host. If set
* to false, `is_async` won't take effect because a synchronization is needed to
* transfer results to the host. Default value: false. */
bool is_big_triangle; /**< Whether to do "bucket accumulation" serially. Decreases computational complexity
* but also greatly decreases parallelism, so only suitable for large batches of MSMs.
* Default value: false. */
bool is_async; /**< Whether to run the MSM asynchronously. If set to true, the MSM function will be
* non-blocking and you'd need to synchronize it explicitly by running
* `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the MSM
* function will block the current CPU thread. */
};
```
### Large triangle accumulation
## Choosing optimal parameters
Large Triangle Accumulation is a method for optimizing MSM which focuses on reducing the number of point doublings in the computation. This algorithm is based on the observation that the number of point doublings can be minimized by structuring the computation in a specific manner.
`is_big_triangle` should be `false` in almost all cases. It might provide better results only for very small MSMs (smaller than 2^8^) with a large batch (larger than 100) but this should be tested per scenario.
Large buckets exist in two cases:
1. When the scalar distribution isn't uniform.
2. When `c` does not divide the scalar bit-size.
#### When should I use Large triangle accumulation?
`large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around $log_2(msmSize) - 1$.
The Large Triangle Accumulation algorithm is more sequential in nature, as it builds upon each step sequentially (accumulating sums and then performing doubling). This structure can make it less suitable for parallelization but potentially more efficient for a <b>large batch of smaller MSM computations</b>.
## Memory usage estimation
## MSM Modes
The main memory requirements of the MSM are the following:
ICICLE MSM also supports two different modes `Batch MSM` and `Single MSM`
- Scalars - `sizeof(scalar_t) * msm_size * batch_size`
- Scalar indices - `~6 * sizeof(unsigned) * nof_bucket_modules * msm_size * batch_size`
- Points - `sizeof(affine_t) * msm_size * precomp_factor * batch_size`
- Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`
Batch MSM allows you to run many MSMs with a single API call, Single MSM will launch a single MSM computation.
where `nof_bucket_modules = ceil(ceil(bitsize / c) / precompute_factor)`
### Which mode should I use?
During the MSM computation first the memory for scalars and scalar indices is allocated, then the indices are freed and points and buckets are allocated. This is why a good estimation for the required memory is the following formula:
This decision is highly dependent on your use case and design. However, if your design allows for it, using batch mode can significantly improve efficiency. Batch processing allows you to perform multiple MSMs leveraging the parallel processing capabilities of GPUs.
$max(scalars + scalarIndices, scalars + points + buckets)$
Single MSM mode should be used when batching isn't possible or when you have to run a single MSM.
This gives a good approximation within 10% of the actual required memory for most cases.
## Example parameters
Here is a useful table showing optimal parameters for different MSMs. They are optimal for BLS12-377 curve when running on NVIDIA GeForce RTX 3090 Ti. This is the configuration used:
```cpp
msm::MSMConfig config = {
ctx, // DeviceContext
N, // points_size
precomp_factor, // precompute_factor
user_c, // c
0, // bitsize
10, // large_bucket_factor
batch_size, // batch_size
false, // are_scalars_on_device
false, // are_scalars_montgomery_form
true, // are_points_on_device
false, // are_points_montgomery_form
true, // are_results_on_device
false, // is_big_triangle
true // is_async
};
```
Here are the parameters and the results for the different cases:
| MSM size | Batch size | Precompute factor | c | Memory estimation (GB) | Actual memory (GB) | Single MSM time (ms) |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | 1 | 1 | 9 | 0.00227 | 0.00277 | 9.2 |
| 10 | 1 | 23 | 11 | 0.00259 | 0.00272 | 1.76 |
| 10 | 1000 | 1 | 7 | 0.94 | 1.09 | 0.051 |
| 10 | 1000 | 23 | 11 | 2.59 | 2.74 | 0.025 |
| 15 | 1 | 1 | 11 | 0.011 | 0.019 | 9.9 |
| 15 | 1 | 16 | 16 | 0.061 | 0.065 | 2.4 |
| 15 | 100 | 1 | 11 | 1.91 | 1.92 | 0.84 |
| 15 | 100 | 19 | 14 | 6.32 | 6.61 | 0.56 |
| 18 | 1 | 1 | 14 | 0.128 | 0.128 | 14.4 |
| 18 | 1 | 15 | 17 | 0.40 | 0.42 | 5.9 |
| 22 | 1 | 1 | 17 | 1.64 | 1.65 | 68 |
| 22 | 1 | 13 | 21 | 5.67 | 5.94 | 54 |
| 24 | 1 | 1 | 18 | 6.58 | 6.61 | 232 |
| 24 | 1 | 7 | 21 | 12.4 | 13.4 | 199 |
The optimal values can vary per GPU and per curve. It is best to try a few combinations until you get the best results for your specific case.

View File

@@ -11,24 +11,19 @@ A_k = \sum_{n=0}^{N-1} a_n \cdot \omega^{nk} \mod p
$$
where:
- $N$ is the size of the input sequence and is a power of 2,
- $p$ is a prime number such that $p = kN + 1$ for some integer $k$, ensuring that $p$ supports the existence of $N$th roots of unity,
- $\omega$ is a primitive $N$th root of unity modulo $p$, meaning $\omega^N \equiv 1 \mod p$ and no smaller positive power of $\omega$ is congruent to 1 modulo $p$,
- $k$ ranges from 0 to $N-1$, and it indexes the output sequence.
The NTT is particularly useful because it enables efficient polynomial multiplication under modulo arithmetic, crucial for algorithms in cryptographic protocols, and other areas requiring fast modular arithmetic operations.
NTT is particularly useful because it enables efficient polynomial multiplication under modulo arithmetic, crucial for algorithms in cryptographic protocols and other areas requiring fast modular arithmetic operations.
There exists also INTT which is the inverse operation of NTT. INTT can take as input an output sequence of integers from an NTT and reconstruct the original sequence.
# Using NTT
## Using NTT
### Supported curves
NTT supports the following curves:
`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`
## Supported Bindings
### Supported Bindings
- [Golang](../golang-bindings/ntt.md)
- [Rust](../rust-bindings/ntt.md)
@@ -61,19 +56,17 @@ Choosing an algorithm is heavily dependent on your use case. For example Cooley-
NTT also supports two different modes `Batch NTT` and `Single NTT`
Batch NTT allows you to run many NTTs with a single API call, Single MSM will launch a single MSM computation.
Deciding weather to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
**Single NTT Mode**
#### Single NTT
- Choose this mode when your application requires processing individual NTT operations in isolation.
Single NTT will launch a single NTT computation.
**Batch NTT Mode**
Choose this mode when your application requires processing individual NTT operations in isolation.
- Batch NTT mode can significantly reduce read/write as well as computation overhead by executing multiple NTT operations in parallel.
#### Batch NTT Mode
- Batch mode may also offer better utilization of computational resources (memory and compute).
Batch NTT allows you to run many NTTs with a single API call. Batch NTT mode can significantly reduce read/write times as well as computation overhead by executing multiple NTT operations in parallel. Batch mode may also offer better utilization of computational resources (memory and compute).
## Supported algorithms
@@ -90,8 +83,8 @@ At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-prob
The algorithm recursively divides the input sequence into smaller sequences. At each step, it separates the sequence into even-indexed and odd-indexed elements, forming two subsequences that are then processed independently.
3. **Butterfly Operations:**
The core computational element of the Radix-2 NTT is the "butterfly" operation, which combines pairs of elements from the sequences obtained in the decomposition step.
The core computational element of the Radix-2 NTT is the "butterfly" operation, which combines pairs of elements from the sequences obtained in the decomposition step.
Each butterfly operation involves multiplication by a "twiddle factor," which is a root of unity in the finite field, and addition or subtraction of the results, all performed modulo the prime modulus.
$$
@@ -108,7 +101,6 @@ At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-prob
$k$ - The index of the current operation within the butterfly or the transform stage
The twiddle factors are precomputed to save runtime and improve performance.
4. **Bit-Reversal Permutation:**
@@ -116,7 +108,7 @@ At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-prob
### Mixed Radix
The Mixed Radix NTT algorithm extends the concepts of the Radix-2 algorithm by allowing the decomposition of the input sequence based on various factors of its length. Specifically ICICLEs implementation splits the input into blocks of sizes 16,32,64 compared to radix2 which is always splitting such that we end with NTT of size 2. This approach offers enhanced flexibility and efficiency, especially for input sizes that are composite numbers, by leveraging the "divide and conquer" strategy across multiple radixes.
The Mixed Radix NTT algorithm extends the concepts of the Radix-2 algorithm by allowing the decomposition of the input sequence based on various factors of its length. Specifically ICICLEs implementation splits the input into blocks of sizes 16, 32, or 64 compared to radix2 which is always splitting such that we end with NTT of size 2. This approach offers enhanced flexibility and efficiency, especially for input sizes that are composite numbers, by leveraging the "divide and conquer" strategy across multiple radices.
The NTT blocks in Mixed Radix are implemented more efficiently based on winograd NTT but also optimized memory and register usage is better compared to Radix-2.
@@ -126,11 +118,11 @@ Mixed Radix can reduce the number of stages required to compute for large inputs
The input to the Mixed Radix NTT is a sequence of integers $a_0, a_1, \ldots, a_{N-1}$, where $N$ is not strictly required to be a power of two. Instead, $N$ can be any composite number, ideally factorized into primes or powers of primes.
2. **Factorization and Decomposition:**
Unlike the Radix-2 algorithm, which strictly divides the computational problem into halves, the Mixed Radix NTT algorithm implements a flexible decomposition approach which isn't limited to prime factorization.
Unlike the Radix-2 algorithm, which strictly divides the computational problem into halves, the Mixed Radix NTT algorithm implements a flexible decomposition approach which isn't limited to prime factorization.
For example, an NTT of size 256 can be decomposed into two stages of $16 \times \text{NTT}_{16}$, leveraging a composite factorization strategy rather than decomposing into eight stages of $\text{NTT}_{2}$. This exemplifies the use of composite factors (in this case, $256 = 16 \times 16$) to apply smaller NTT transforms, optimizing computational efficiency by adapting the decomposition strategy to the specific structure of $N$.
3. **Butterfly Operations with Multiple Radixes:**
3. **Butterfly Operations with Multiple Radices:**
The Mixed Radix algorithm utilizes butterfly operations for various radix sizes. Each sub-transform involves specific butterfly operations characterized by multiplication with twiddle factors appropriate for the radix in question.
The generalized butterfly operation for a radix-$r$ element can be expressed as:
@@ -139,7 +131,15 @@ Mixed Radix can reduce the number of stages required to compute for large inputs
X_{k,r} = \sum_{j=0}^{r-1} (A_{j,k} \cdot W^{jk}) \mod p
$$
where $X_{k,r}$ is the output of the $radix-r$ butterfly operation for the $k-th$ set of inputs, $A_{j,k}$ represents the $j-th$ input element for the $k-th$ operation, $W$ is the twiddle factor, and $p$ is the prime modulus.
where:
$X_{k,r}$ - is the output of the $radix-r$ butterfly operation for the $k-th$ set of inputs
$A_{j,k}$ - represents the $j-th$ input element for the $k-th$ operation
$W$ - is the twiddle factor
$p$ - is the prime modulus
4. **Recombination and Reordering:**
After applying the appropriate butterfly operations across all decomposition levels, the Mixed Radix algorithm recombines the results into a single output sequence. Due to the varied sizes of the sub-transforms, a more complex reordering process may be required compared to Radix-2. This involves digit-reversal permutations to ensure that the final output sequence is correctly ordered.
@@ -154,6 +154,6 @@ Mixed radix on the other hand works better for larger NTTs with larger input siz
Performance really depends on logn size, batch size, ordering, inverse, coset, coeff-field and which GPU you are using.
For this reason we implemented our [heuristic auto-selection](https://github.com/ingonyama-zk/icicle/blob/774250926c00ffe84548bc7dd97aea5227afed7e/icicle/appUtils/ntt/ntt.cu#L474) which should choose the most efficient algorithm in most cases.
For this reason we implemented our [heuristic auto-selection](https://github.com/ingonyama-zk/icicle/blob/main/icicle/src/ntt/ntt.cu#L573) which should choose the most efficient algorithm in most cases.
We still recommend you benchmark for your specific use case if you think a different configuration would yield better results.

View File

@@ -8,4 +8,5 @@ This section of the documentation is dedicated to the ICICLE primitives, we will
- [MSM](./msm.md)
- [NTT](./ntt.md)
- [Keccak Hash](./keccak.md)
- [Poseidon Hash](./poseidon.md)

View File

@@ -8,43 +8,42 @@ Poseidon has been used in many popular ZK protocols such as Filecoin and [Plonk]
Our implementation of Poseidon is implemented in accordance with the optimized [Filecoin version](https://spec.filecoin.io/algorithms/crypto/poseidon/).
Let understand how Poseidon works.
Lets understand how Poseidon works.
### Initialization
## Initialization
Poseidon starts with the initialization of its internal state, which is composed of the input elements and some pregenerated constants. An initial round constant is added to each element of the internal state. Adding The round constants ensure the state is properly mixed from the outset.
Poseidon starts with the initialization of its internal state, which is composed of the input elements and some pre-generated constants. An initial round constant is added to each element of the internal state. Adding the round constants ensures the state is properly mixed from the beginning.
This is done to prevent collisions and to prevent certain cryptographic attacks by ensuring that the internal state is sufficiently mixed and unpredictable.
![Alt text](image.png)
![Poseidon initialization of internal state added with pre-generated round constants](https://github.com/ingonyama-zk/icicle/assets/122266060/52257f5d-6097-47c4-8f17-7b6449b9d162)
### Applying full and partial rounds
## Applying full and partial rounds
To generate a secure hash output, the algorithm goes through a series of "full rounds" and "partial rounds" as well as transformations between these sets of rounds.
To generate a secure hash output, the algorithm goes through a series of "full rounds" and "partial rounds" as well as transformations between these sets of rounds in the following order:
First full rounds => apply SBox and Round constants => partial rounds => Last full rounds => Apply SBox
```First full rounds -> apply S-box and Round constants -> partial rounds -> Last full rounds -> Apply S-box```
#### Full rounds
### Full rounds
![Alt text](image-1.png)
![Full round iterations consisting of S box operations, adding round constants, and a Full MDS matrix multiplication](https://github.com/ingonyama-zk/icicle/assets/122266060/e4ce0e98-b90b-4261-b83e-3cd8cce069cb)
**Uniform Application of S-Box:** In full rounds, the S-box (a non-linear transformation) is applied uniformly to every element of the hash function's internal state. This ensures a high degree of mixing and diffusion, contributing to the hash function's security. The functions S-box involves raising each element of the state to a certain power denoted by `α` a member of the finite field defined by the prime `p`, `α` can be different depending on the the implementation and user configuration.
**Uniform Application of S-box:** In full rounds, the S-box (a non-linear transformation) is applied uniformly to every element of the hash function's internal state. This ensures a high degree of mixing and diffusion, contributing to the hash function's security. The functions S-box involves raising each element of the state to a certain power denoted by `α` a member of the finite field defined by the prime `p`; `α` can be different depending on the implementation and user configuration.
**Linear Transformation:** After applying the S-box, a linear transformation is performed on the state. This involves multiplying the state by a MDS (Maximum Distance Separable) Matrix. which further diffuses the transformations applied by the S-box across the entire state.
**Addition of Round Constants:** Each element of the state is then modified by adding a unique round constant. These constants are different for each round and are precomputed as part of the hash function's initialization. The addition of round constants ensures that even minor changes to the input produce significant differences in the output.
#### Partial Rounds
### Partial Rounds
![Partial round iterations consisting of selective S box operation, adding a round constant and performing an MDS multiplication with a sparse matrix](https://github.com/ingonyama-zk/icicle/assets/122266060/e8c198b4-7aa4-4b4d-9ec4-604e39e07692)
**Selective Application of S-Box:** Partial rounds apply the S-box transformation to only one element of the internal state per round, rather than to all elements. This selective application significantly reduces the computational complexity of the hash function without compromising its security. The choice of which element to apply the S-box to can follow a specific pattern or be fixed, depending on the design of the hash function.
**Linear Transformation and Round Constants:** A linear transformation is performed and round constants are added. The linear transformation in partial rounds can be designed to be less computationally intensive (this is done by using a sparse matrix) than in full rounds, further optimizing the function's efficiency.
The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance is highly dependent on the use case.
![Alt text](image-2.png)
## Using Poseidon
ICICLE Poseidon is implemented for GPU and parallelization is performed for each element of the state rather than for each state.
@@ -52,25 +51,21 @@ What that means is we calculate multiple hash-sums over multiple pre-images in p
So for Poseidon of arity 2 and input of size 1024 * 2, we would expect 1024 elements of output. Which means each block would be of size 2 and that would result in 1024 Poseidon hashes being performed.
### Supported API
### Supported Bindings
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon), [`C++`](https://github.com/ingonyama-zk/icicle/tree/main/icicle/appUtils/poseidon)
### Supported curves
Poseidon supports the following curves:
`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`
[`Go`](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/curves/bn254/poseidon/poseidon.go)
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon)
### Constants
Poseidon is extremely customizable and using different constants will produce different hashes, security levels and performance results.
We support pre-calculated and optimized constants for each of the [supported curves](#supported-curves).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/appUtils/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/blob/b6dded89cdef18348a5d4e2748b71ce4211c63ad/icicle/appUtils/poseidon/constants/generate_parameters.py#L1).
If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants/generate_parameters.py).
Prerequisites:
- Install python 3
- `pip install poseidon-hash`
- `pip install galois==0.3.7`
@@ -97,8 +92,6 @@ primitive_element = 7 # bls12-381
# primitive_element = 15 # bw6-761
```
We only support `alpha = 5` so if you want to use another alpha for SBox please reach out on discord or open a github issue.
### Rust API
This is the most basic way to use the Poseidon API.
@@ -107,97 +100,95 @@ This is the most basic way to use the Poseidon API.
let test_size = 1 << 10;
let arity = 2u32;
let ctx = get_default_device_context();
let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();
let config = PoseidonConfig::default();
let poseidon = Poseidon::load(arity, &ctx).unwrap();
let config = HashConfig::default();
let inputs = vec![F::one(); test_size * arity as usize];
let outputs = vec![F::zero(); test_size];
let mut input_slice = HostOrDeviceSlice::on_host(inputs);
let mut output_slice = HostOrDeviceSlice::on_host(outputs);
poseidon_hash_many::<F>(
poseidon.hash_many::<F>(
&mut input_slice,
&mut output_slice,
test_size as u32,
arity as u32,
&constants,
1, // Output length
&config,
)
.unwrap();
```
The `PoseidonConfig::default()` can be modified, by default the inputs and outputs are set to be on `Host` for example.
The `HashConfig` can be modified, by default the inputs and outputs are set to be on `Host` for example.
```
impl<'a> Default for PoseidonConfig<'a> {
```rust
impl<'a> Default for HashConfig<'a> {
fn default() -> Self {
let ctx = get_default_device_context();
Self {
ctx,
are_inputs_on_device: false,
are_outputs_on_device: false,
input_is_a_state: false,
aligned: false,
loop_state: false,
is_async: false,
}
}
}
```
In the example above `load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. Its possible to [generate](#constants) your own constants and load them.
In the example above `Poseidon::load(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. Its possible to [generate](#constants) your own constants and load them.
```rust
let ctx = get_default_device_context();
let cargo_manifest_dir = env!("CARGO_MANIFEST_DIR");
let constants_file = PathBuf::from(cargo_manifest_dir)
.join("tests")
.join(format!("{}_constants.bin", field_prefix));
let mut constants_buf = vec![];
File::open(constants_file)
.unwrap()
.read_to_end(&mut constants_buf)
.unwrap();
let mut custom_constants = vec![];
for chunk in constants_buf.chunks(field_bytes) {
custom_constants.push(F::from_bytes_le(chunk));
}
let custom_constants = create_optimized_poseidon_constants::<F>(
arity as u32,
&ctx,
full_rounds_half,
partial_rounds,
&mut custom_constants,
)
.unwrap();
let custom_poseidon = Poseidon::new(
arity, // The arity of poseidon hash. The width will be equal to arity + 1
alpha, // The S-box power
full_rounds_half,
partial_rounds,
round_constants,
mds_matrix,
non_sparse_matrix,
sparse_matrices,
domain_tag,
ctx,
)
.unwrap();
```
For more examples using different configurations refer here.
## The Tree Builder
The tree builder allows you to build Merkle trees using Poseidon.
The tree builder allows you to build Merkle trees using Poseidon.
You can define both the tree's `height` and its `arity`. The tree `height` determines the number of layers in the tree, including the root and the leaf layer. The `arity` determines how many children each internal node can have.
```rust
let height = 20;
let arity = 2;
let leaves = vec![F::one(); 1 << (height - 1)];
let mut digests = vec![F::zero(); merkle_tree_digests_len(height, arity)];
let mut leaves_slice = HostOrDeviceSlice::on_host(leaves);
let ctx = get_default_device_context();
let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap()
use icicle_bn254::tree::Bn254TreeBuilder;
use icicle_bn254::poseidon::Poseidon;
let mut config = TreeBuilderConfig::default();
config.keep_rows = 1;
build_poseidon_merkle_tree::<F>(&mut leaves_slice, &mut digests, height, arity, &constants, &config).unwrap();
let arity = 2;
config.arity = arity as u32;
let input_block_len = arity;
let leaves = vec![F::one(); (1 << height) * arity];
let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
println!("Root: {:?}", digests[0..1][0]);
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
let ctx = device_context::DeviceContext::default();
let hash = Poseidon::load(2, &ctx).unwrap();
let mut config = TreeBuilderConfig::default();
config.keep_rows = 5;
Bn254TreeBuilder::build_merkle_tree(
leaves_slice,
digests_slice,
height,
input_block_len,
&hash,
&hash,
&config,
)
.unwrap();
```
Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig::default()`
@@ -206,9 +197,9 @@ Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig:
- `are_inputs_on_device`: Have the inputs been loaded to device memory ?
- `is_async`: Should the TreeBuilder run asynchronously? `False` will block the current CPU thread. `True` will require you call `cudaStreamSynchronize` or `cudaDeviceSynchronize` to retrieve the result.
### Benchmarks
### Benchmarks
We ran the Poseidon tree builder on:
We ran the Poseidon tree builder on:
**CPU**: 12th Gen Intel(R) Core(TM) i9-12900K/
@@ -218,9 +209,8 @@ We ran the Poseidon tree builder on:
The benchmarks include copying data from and to the device.
| Rows to keep parameter | Run time, Icicle | Supranational PC2
| ----------- | ----------- | ----------- |
| ----------- | ----------- | -----------
| 10 | 9.4 seconds | 13.6 seconds
| 20 | 9.5 seconds | 13.6 seconds
| 29 | 13.7 seconds | 13.6 seconds

View File

@@ -0,0 +1,88 @@
# Poseidon2
[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
## Using Poseidon2
ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
### Supported Bindings
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
### Constants
Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
We support pre-calculated constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon2/constants) and are labeled clearly per curve `<curve_name>_poseidon2.h`.
You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
### Rust API
This is the most basic way to use the Poseidon2 API.
```rust
let test_size = 1 << 10;
let width = 16;
let rate = 8;
let ctx = get_default_device_context();
let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
let config = HashConfig::default();
let inputs = vec![F::one(); test_size * rate as usize];
let outputs = vec![F::zero(); test_size];
let mut input_slice = HostOrDeviceSlice::on_host(inputs);
let mut output_slice = HostOrDeviceSlice::on_host(outputs);
poseidon.hash_many::<F>(
&mut input_slice,
&mut output_slice,
test_size as u32,
rate as u32,
8, // Output length
&config,
)
.unwrap();
```
In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
## The Tree Builder
Similar to Poseidon1, you can use Poseidon2 in a tree builder.
```rust
use icicle_bn254::tree::Bn254TreeBuilder;
use icicle_bn254::poseidon2::Poseidon2;
let mut config = TreeBuilderConfig::default();
let arity = 2;
config.arity = arity as u32;
let input_block_len = arity;
let leaves = vec![F::one(); (1 << height) * arity];
let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
let ctx = device_context::DeviceContext::default();
let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
let mut config = TreeBuilderConfig::default();
config.keep_rows = 5;
Bn254TreeBuilder::build_merkle_tree(
leaves_slice,
digests_slice,
height,
input_block_len,
&hash,
&hash,
&config,
)
.unwrap();
```

View File

@@ -12,7 +12,7 @@ Rust bindings allow you to use ICICLE as a rust library.
Simply add the following to your `Cargo.toml`.
```
```toml
# GPU Icicle integration
icicle-cuda-runtime = { git = "https://github.com/ingonyama-zk/icicle.git" }
icicle-core = { git = "https://github.com/ingonyama-zk/icicle.git" }
@@ -25,7 +25,7 @@ If you wish to point to a specific ICICLE branch add `branch = "<name_of_branch>
When you build your project ICICLE will be built as part of the build command.
# How do the rust bindings work?
## How do the rust bindings work?
The rust bindings are just rust wrappers for ICICLE Core static libraries which can be compiled. We integrate the compilation of the static libraries into rusts toolchain to make usage seamless and easy. This is achieved by [extending rusts build command](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-curves/icicle-bn254/build.rs).
@@ -55,3 +55,33 @@ fn main() {
println!("cargo:rustc-link-lib=cudart");
}
```
## Supported curves, fields and operations
### Supported curves and operations
| Operation\Curve | bn254 | bls12_377 | bls12_381 | bw6-761 | grumpkin |
| --- | :---: | :---: | :---: | :---: | :---: |
| MSM | ✅ | ✅ | ✅ | ✅ | ✅ |
| G2 | ✅ | ✅ | ✅ | ✅ | ❌ |
| NTT | ✅ | ✅ | ✅ | ✅ | ❌ |
| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
| VecOps | ✅ | ✅ | ✅ | ✅ | ✅ |
| Polynomials | ✅ | ✅ | ✅ | ✅ | ❌ |
| Poseidon | ✅ | ✅ | ✅ | ✅ | ✅ |
| Merkle Tree | ✅ | ✅ | ✅ | ✅ | ✅ |
### Supported fields and operations
| Operation\Field | babybear | stark252 |
| --- | :---: | :---: |
| VecOps | ✅ | ✅ |
| Polynomials | ✅ | ✅ |
| NTT | ✅ | ✅ |
| Extension Field | ✅ | ❌ |
### Supported hashes
| Hash | Sizes |
| --- | :---: |
| Keccak | 256, 512 |

View File

@@ -1,9 +1,5 @@
# ECNTT
### Supported curves
`bls12-377`, `bls12-381`, `bn254`
## ECNTT Method
The `ecntt` function computes the Elliptic Curve Number Theoretic Transform (EC-NTT) or its inverse on a batch of points of a curve.
@@ -25,7 +21,7 @@ where
## Parameters
- **`input`**: The input data as a slice of `Projective<C>`. This represents points on a specific elliptic curve `C`.
- **`input`**: The input data as a slice of `Projective<C>`. This represents points on a specific elliptic curve `C`.
- **`dir`**: The direction of the NTT. It can be `NTTDir::kForward` for forward NTT or `NTTDir::kInverse` for inverse NTT.
- **`cfg`**: The NTT configuration object of type `NTTConfig<C::ScalarField>`. This object specifies parameters for the NTT computation, such as the batch size and algorithm to use.
- **`output`**: The output buffer to write the results into. This should be a slice of `Projective<C>` with the same size as the input.

View File

@@ -0,0 +1,96 @@
# Keccak
## Keccak Example
```rust
use icicle_cuda_runtime::memory::{DeviceVec, HostSlice};
use icicle_hash::keccak::{keccak256, HashConfig};
use rand::{self, Rng};
fn main() {
let mut rng = rand::thread_rng();
let initial_data: Vec<u8> = (0..120).map(|_| rng.gen::<u8>()).collect();
println!("initial data: {}", hex::encode(&initial_data));
let input = HostSlice::<u8>::from_slice(initial_data.as_slice());
let mut output = DeviceVec::<u8>::cuda_malloc(32).unwrap();
let mut config = HashConfig::default();
keccak256(input, initial_data.len() as i32, 1, &mut output[..], &mut config).expect("Failed to execute keccak256 hashing");
let mut output_host = vec![0_u8; 32];
output.copy_to_host(HostSlice::from_mut_slice(&mut output_host[..])).unwrap();
println!("keccak256 result: {}", hex::encode(&output_host));
}
```
## Keccak Methods
```rust
pub fn keccak256(
input: &(impl HostOrDeviceSlice<u8> + ?Sized),
input_block_size: i32,
number_of_blocks: i32,
output: &mut (impl HostOrDeviceSlice<u8> + ?Sized),
config: &mut HashConfig,
) -> IcicleResult<()>
pub fn keccak512(
input: &(impl HostOrDeviceSlice<u8> + ?Sized),
input_block_size: i32,
number_of_blocks: i32,
output: &mut (impl HostOrDeviceSlice<u8> + ?Sized),
config: &mut HashConfig,
) -> IcicleResult<()>
```
### Parameters
- **`input`**: A slice containing the input data for the Keccak256 hash function. It can reside in either host memory or device memory.
- **`input_block_size`**: An integer specifying the size of the input data for a single hash.
- **`number_of_blocks`**: An integer specifying the number of results in the hash batch.
- **`output`**: A slice where the resulting hash will be stored. This slice can be in host or device memory.
- **`config`**: A pointer to a `HashConfig` object, which contains various configuration options for the Keccak256 operation.
### Return Value
- **`IcicleResult`**: Returns a CUDA error code indicating the success or failure of the Keccak256/Keccak512 operation.
## HashConfig
The `HashConfig` structure holds configuration parameters for the Keccak256/Keccak512 operation, allowing customization of its behavior to optimize performance based on the specifics of the operation or the underlying hardware.
```rust
pub struct HashConfig<'a> {
pub ctx: DeviceContext<'a>,
pub are_inputs_on_device: bool,
pub are_outputs_on_device: bool,
pub is_async: bool,
}
```
### Fields
- **`ctx`**: Device context containing details like device id and stream.
- **`are_inputs_on_device`**: Indicates if input data is located on the device.
- **`are_outputs_on_device`**: Indicates if output hash is stored on the device.
- **`is_async`**: If true, runs the Keccak256/Keccak512 operation asynchronously.
### Usage
Example initialization with default settings:
```rust
let default_config = HashConfig::default();
```
Customizing the configuration:
```rust
let custom_config = NTTConfig {
ctx: custom_device_context,
are_inputs_on_device: true,
are_outputs_on_device: true,
is_async: false,
};
```

View File

@@ -2,31 +2,24 @@
To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).
### Supported curves
`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `Grumpkin`
### `precompute_bases`
## `precompute_points`
Precomputes bases for the multi-scalar multiplication (MSM) by extending each base point with its multiples, facilitating more efficient MSM calculations.
```rust
pub fn precompute_bases<C: Curve + MSM<C>>(
points: &HostOrDeviceSlice<Affine<C>>,
precompute_factor: i32,
_c: i32,
ctx: &DeviceContext,
output_bases: &mut HostOrDeviceSlice<Affine<C>>,
pub fn precompute_points<C: Curve + MSM<C>>(
points: &(impl HostOrDeviceSlice<Affine<C>> + ?Sized),
msm_size: i32,
cfg: &MSMConfig,
output_bases: &mut DeviceSlice<Affine<C>>,
) -> IcicleResult<()>
```
#### Parameters
### Parameters
- **`points`**: The original set of affine points (\(P_1, P_2, ..., P_n\)) to be used in the MSM. For batch MSM operations, this should include all unique points concatenated together.
- **`precompute_factor`**: Specifies the total number of points to precompute for each base, including the base point itself. This parameter directly influences the memory requirements and the potential speedup of the MSM operation.
- **`_c`**: Currently unused. Intended for future use to align with the `c` parameter in `MSMConfig`, ensuring the precomputation is compatible with the bucket method's window size used in MSM.
- **`ctx`**: The device context specifying the device ID and stream for execution. This context determines where the precomputation is performed (e.g., on a specific GPU).
- **`msm_size`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`output_bases`**: The output buffer for the extended bases. Its size must be `points.len() * precompute_factor`. This buffer should be allocated on the device for GPU computations.
#### Returns
@@ -42,22 +35,11 @@ The precomputation process is crucial for optimizing MSM operations, especially
#### Example Usage
```rust
let device_context = DeviceContext::default_for_device(0); // Use the default device
let cfg = MSMConfig::default();
let precompute_factor = 4; // Number of points to precompute
let mut extended_bases = HostOrDeviceSlice::cuda_malloc(expected_size).expect("Failed to allocate memory for extended bases");
// Precompute the bases using the specified factor
precompute_bases(&points, precompute_factor, 0, &device_context, &mut extended_bases)
precompute_points(&points, msm_size, &cfg, &mut extended_bases)
.expect("Failed to precompute bases");
```
### Benchmarks
Benchmarks where performed on a Nvidia RTX 3090Ti.
| Pre-computation factor | bn254 size `2^20` MSM, ms. | bn254 size `2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. | bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| 1 | 14.1 | 82.8 | 25.5 | 136.7 |
| 2 | 11.8 | 76.6 | 20.3 | 123.8 |
| 4 | 10.9 | 73.8 | 18.1 | 117.8 |
| 8 | 10.6 | 73.7 | 17.2 | 116.0 |

View File

@@ -1,9 +1,5 @@
# MSM
### Supported curves
`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`, `grumpkin`
## Example
```rust
@@ -84,7 +80,7 @@ pub struct MSMConfig<'a> {
```
- **`ctx: DeviceContext`**: Specifies the device context, device id and the CUDA stream for asynchronous execution.
- **`point_size: i32`**:
- **`point_size: i32`**:
- **`precompute_factor: i32`**: Determines the number of extra points to pre-compute for each point, affecting memory footprint and performance.
- **`c: i32`**: The "window bitsize," a parameter controlling the computational complexity and memory footprint of the MSM operation.
- **`bitsize: i32`**: The number of bits of the largest scalar, typically equal to the bit size of the scalar field.
@@ -104,7 +100,7 @@ When performing MSM operations, it's crucial to match the size of the `scalars`
## How do I toggle between the supported algorithms?
When creating your MSM Config you may state which algorithm you wish to use. `is_big_triangle=true` will activate Large triangle accumulation and `is_big_triangle=false` will activate Bucket accumulation.
When creating your MSM Config you may state which algorithm you wish to use. `is_big_triangle=true` will activate Large triangle reduction and `is_big_triangle=false` will activate iterative reduction.
```rust
...
@@ -120,7 +116,6 @@ msm::msm(&scalars, &points, &cfg, &mut msm_results).unwrap();
You may reference the rust code [here](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/wrappers/rust/icicle-core/src/msm/mod.rs#L54).
## How do I toggle between MSM modes?
Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `msm::msm` function. If you are expecting an array of `msm_results`, ICICLE will automatically split `scalars` and `points` into equal parts and run them as multiple MSMs in parallel.
@@ -136,7 +131,6 @@ msm::msm(&scalars, &points, &cfg, &mut msm_result).unwrap();
In the example above we allocate a single expected result which the MSM method will interpret as `batch_size=1` and run a single MSM.
In the next example, we are expecting 10 results which sets `batch_size=10` and runs 10 MSMs in batch mode.
```rust
@@ -150,9 +144,13 @@ msm::msm(&scalars, &points, &cfg, &mut msm_results).unwrap();
Here is a [reference](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/wrappers/rust/icicle-core/src/msm/mod.rs#L108) to the code which automatically sets the batch size. For more MSM examples have a look [here](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/examples/rust/msm/src/main.rs#L1).
## Parameters for optimal performance
Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)
## Support for G2 group
MSM also supports G2 group.
MSM also supports G2 group.
Using MSM in G2 requires a G2 config, and of course your Points should also be G2 Points.

View File

@@ -1,10 +1,6 @@
# NTT
### Supported curves
`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`
## Example
## Example
```rust
use icicle_bn254::curve::{ScalarCfg, ScalarField};
@@ -61,14 +57,13 @@ pub fn ntt<F>(
`ntt:ntt` expects:
- **`input`** - buffer to read the inputs of the NTT from. <br/>
- **`dir`** - whether to compute forward or inverse NTT. <br/>
- **`cfg`** - config used to specify extra arguments of the NTT. <br/>
- **`input`** - buffer to read the inputs of the NTT from.
- **`dir`** - whether to compute forward or inverse NTT.
- **`cfg`** - config used to specify extra arguments of the NTT.
- **`output`** - buffer to write the NTT outputs into. Must be of the same size as input.
The `input` and `output` buffers can be on device or on host. Being on host means that they will be transferred to device during runtime.
### NTT Config
```rust
@@ -107,8 +102,7 @@ The `NTTConfig` struct is a configuration object used to specify parameters for
- **`ntt_algorithm: NttAlgorithm`**: Can be one of `Auto`, `Radix2`, `MixedRadix`.
`Auto` will select `Radix 2` or `Mixed Radix` algorithm based on heuristics.
`Radix2` and `MixedRadix` will force the use of an algorithm regardless of the input size or other considerations. You should use one of these options when you know for sure that you want to
`Radix2` and `MixedRadix` will force the use of an algorithm regardless of the input size or other considerations. You should use one of these options when you know for sure that you want to
#### Usage
@@ -134,7 +128,6 @@ let custom_config = NTTConfig {
};
```
### Modes
NTT supports two different modes `Batch NTT` and `Single NTT`
@@ -187,7 +180,7 @@ where
- **`IcicleResult<()>`**: Will return an error if the operation fails.
### Releaseing the domain
### Releasing the domain
The `release_domain` function is responsible for releasing the resources associated with a specific domain in the CUDA device context.
@@ -205,4 +198,3 @@ where
#### Returns
The function returns an `IcicleResult<()>`, which represents the result of the operation. If the operation is successful, the function returns `Ok(())`, otherwise it returns an error.

View File

@@ -1,14 +1,16 @@
:::note Please refer to the Polynomials overview page for a deep overview. This section is a brief description of the Rust FFI bindings.
# Rust FFI Bindings for Univariate Polynomial
:::note
Please refer to the Polynomials overview page for a deep overview. This section is a brief description of the Rust FFI bindings.
:::
# Rust FFI Bindings for Univariate Polynomial
This documentation is designed to provide developers with a clear understanding of how to utilize the Rust bindings for polynomial operations efficiently and effectively, leveraging the robust capabilities of both Rust and C++ in their applications.
## Introduction
The Rust FFI bindings for the Univariate Polynomial serve as a "shallow wrapper" around the underlying C++ implementation. These bindings provide a straightforward Rust interface that directly calls functions from a C++ library, effectively bridging Rust and C++ operations. The Rust layer handles simple interface translations without delving into complex logic or data structures, which are managed on the C++ side. This design ensures efficient data handling, memory management, and execution of polynomial operations directly via C++.
Currently, these bindings are tailored specifically for polynomials where the coefficients, domain, and images are represented as scalar fields.
## Initialization Requirements
Before utilizing any functions from the polynomial API, it is mandatory to initialize the appropriate polynomial backend (e.g., CUDA). Additionally, the NTT (Number Theoretic Transform) domain must also be initialized, as the CUDA backend relies on this for certain operations. Failing to properly initialize these components can result in errors.
@@ -19,12 +21,12 @@ Before utilizing any functions from the polynomial API, it is mandatory to initi
The ICICLE library is structured such that each field or curve has its dedicated library implementation. As a result, initialization must be performed individually for each field or curve to ensure the correct setup and functionality of the library.
:::
## Core Trait: `UnivariatePolynomial`
The `UnivariatePolynomial` trait encapsulates the essential functionalities required for managing univariate polynomials in the Rust ecosystem. This trait standardizes the operations that can be performed on polynomials, regardless of the underlying implementation details. It allows for a unified approach to polynomial manipulation, providing a suite of methods that are fundamental to polynomial arithmetic.
### Trait Definition
```rust
pub trait UnivariatePolynomial
where
@@ -65,6 +67,9 @@ where
evals: &mut E,
);
// Method to evaluate the polynomial over the roots-of-unity domain for power-of-two sized domain
fn eval_on_rou_domain<E: HostOrDeviceSlice<Self::Field> + ?Sized>(&self, domain_log_size: u64, evals: &mut E);
// Method to retrieve a coefficient at a specific index.
fn get_coeff(&self, idx: u64) -> Self::Field;
@@ -77,6 +82,7 @@ where
```
## `DensePolynomial` Struct
The DensePolynomial struct represents a dense univariate polynomial in Rust, leveraging a handle to manage its underlying memory within the CUDA device context. This struct acts as a high-level abstraction over complex C++ memory management practices, facilitating the integration of high-performance polynomial operations through Rust's Foreign Function Interface (FFI) bindings.
```rust
@@ -88,15 +94,19 @@ pub struct DensePolynomial {
### Traits implementation and methods
#### `Drop`
Ensures proper resource management by releasing the CUDA memory when a DensePolynomial instance goes out of scope. This prevents memory leaks and ensures that resources are cleaned up correctly, adhering to Rust's RAII (Resource Acquisition Is Initialization) principles.
#### `Clone`
Provides a way to create a new instance of a DensePolynomial with its own unique handle, thus duplicating the polynomial data in the CUDA context. Cloning is essential since the DensePolynomial manages external resources, which cannot be safely shared across instances without explicit duplication.
#### Operator Overloading: `Add`, `Sub`, `Mul`, `Rem`, `Div`
These traits are implemented for references to DensePolynomial (i.e., &DensePolynomial), enabling natural mathematical operations such as addition (+), subtraction (-), multiplication (*), division (/), and remainder (%). This syntactic convenience allows users to compose complex polynomial expressions in a way that is both readable and expressive.
#### Key Methods
In addition to the traits, the following methods are implemented:
```rust
@@ -107,16 +117,16 @@ impl DensePolynomial {
}
```
:::note Might be consolidated with `UnivariatePolynomial` trait
:::
## Flexible Memory Handling With `HostOrDeviceSlice`
The DensePolynomial API is designed to accommodate a wide range of computational environments by supporting both host and device memory through the `HostOrDeviceSlice` trait. This approach ensures that polynomial operations can be seamlessly executed regardless of where the data resides, making the API highly adaptable and efficient for various hardware configurations.
### Overview of `HostOrDeviceSlice`
The HostOrDeviceSlice is a Rust trait that abstracts over slices of memory that can either be on the host (CPU) or the device (GPU), as managed by CUDA. This abstraction is crucial for high-performance computing scenarios where data might need to be moved between different memory spaces depending on the operations being performed and the specific hardware capabilities available.
### Usage in API Functions
Functions within the DensePolynomial API that deal with polynomial coefficients or evaluations use the HostOrDeviceSlice trait to accept inputs. This design allows the functions to be agnostic of the actual memory location of the data, whether it's in standard system RAM accessible by the CPU or in GPU memory accessible by CUDA cores.
```rust
@@ -132,10 +142,13 @@ let p_from_evals = PolynomialBabyBear::from_rou_evals(&evals, evals.len());
```
## Usage
This section outlines practical examples demonstrating how to utilize the `DensePolynomial` Rust API. The API is flexible, supporting multiple scalar fields. Below are examples showing how to use polynomials defined over different fields and perform a variety of operations.
### Initialization and Basic Operations
First, choose the appropriate field implementation for your polynomial operations, initializing the CUDA backend if necessary
```rust
use icicle_babybear::polynomials::DensePolynomial as PolynomialBabyBear;
@@ -151,10 +164,10 @@ use icicle_bn254::polynomials::DensePolynomial as PolynomialBn254;
```
### Creation
Polynomials can be created from coefficients or evaluations:
```rust
// Assume F is the field type (e.g. icicle_bn254::curve::ScalarField or a type parameter)
let coeffs = ...;
let p_from_coeffs = PolynomialBabyBear::from_coeffs(HostSlice::from_slice(&coeffs), size);
@@ -164,6 +177,7 @@ let p_from_evals = PolynomialBabyBear::from_rou_evals(HostSlice::from_slice(&eva
```
### Arithmetic Operations
Utilize overloaded operators for intuitive mathematical expressions:
```rust
@@ -174,6 +188,7 @@ let mul_scalar = &f * &scalar; // Scalar multiplication
```
### Division and Remainder
Compute quotient and remainder or perform division by a vanishing polynomial:
```rust
@@ -186,6 +201,7 @@ let h = f.div_by_vanishing(N); // Division by V(x) = X^N - 1
```
### Monomial Operations
Add or subtract monomials in-place for efficient polynomial manipulation:
```rust
@@ -194,6 +210,7 @@ f.sub_monomial_inplace(&one, 0 /*monmoial*/); // Subtracts 1 from f
```
### Slicing
Extract specific components:
```rust
@@ -203,6 +220,7 @@ let arbitrary_slice = f.slice(offset, stride, size);
```
### Evaluate
Evaluate the polynoomial:
```rust
@@ -213,9 +231,15 @@ let f_x = f.eval(&x); // Evaluate f at x
let domain = [one, two, three];
let mut host_evals = vec![ScalarField::zero(); domain.len()];
f.eval_on_domain(HostSlice::from_slice(&domain), HostSlice::from_mut_slice(&mut host_evals));
// Evaluate on roots-of-unity-domain
let domain_log_size = 4;
let mut device_evals = DeviceVec::<ScalarField>::cuda_malloc(1 << domain_log_size).unwrap();
f.eval_on_rou_domain(domain_log_size, &mut device_evals[..]);
```
### Read coefficients
Read or copy polynomial coefficients for further processing:
```rust
@@ -227,6 +251,7 @@ f.copy_coeffs(0, &mut device_mem[..]);
```
### Polynomial Degree
Determine the highest power of the variable with a non-zero coefficient:
```rust
@@ -234,6 +259,7 @@ let deg = f.degree(); // Degree of the polynomial
```
### Memory Management: Views (rust slices)
Rust enforces correct usage of views at compile time, eliminating the need for runtime checks:
```rust

View File

@@ -1,13 +1,6 @@
# Vector Operations API
Our vector operations API which is part of `icicle-cuda-runtime` package, includes fundamental methods for addition, subtraction, and multiplication of vectors, with support for both host and device memory.
## Supported curves
Vector operations are supported on the following curves:
`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`, `grumpkin`
Our vector operations API which is part of `icicle-cuda-runtime` package, includes fundamental methods for addition, subtraction, and multiplication of vectors, with support for both host and device memory.
## Examples
@@ -59,7 +52,6 @@ let cfg = VecOpsConfig::default();
mul_scalars(&a, &ones, &mut result, &cfg).unwrap();
```
## Vector Operations Configuration
The `VecOpsConfig` struct encapsulates the settings for vector operations, including device context and operation modes.
@@ -90,7 +82,7 @@ pub struct VecOpsConfig<'a> {
`VecOpsConfig` can be initialized with default settings tailored for a specific device:
```
```rust
let cfg = VecOpsConfig::default();
```
@@ -118,7 +110,7 @@ impl<'a> VecOpsConfig<'a> {
## Vector Operations
Vector operations are implemented through the `VecOps` trait, these traits are implemented for all [supported curves](#supported-curves) providing methods for addition, subtraction, and multiplication of vectors.
Vector operations are implemented through the `VecOps` trait, providing methods for addition, subtraction, and multiplication of vectors.
### `VecOps` Trait
@@ -155,7 +147,6 @@ All operations are element-wise operations, and the results placed into the `res
- **`sub`**: Computes the element-wise difference between two vectors.
- **`mul`**: Performs element-wise multiplication of two vectors.
## MatrixTranspose API Documentation
This section describes the functionality of the `TransposeMatrix` function used for matrix transposition.
@@ -186,8 +177,8 @@ where
- **`column_size`**: The number of columns in the input matrix.
- **`output`**: A mutable slice to store the transposed matrix. The slice can be stored on either the host or the device.
- **`ctx`**: A reference to the `DeviceContext`, which provides information about the device where the operation will be performed.
- **`on_device`**: A boolean flag indicating whether the inputs and outputs are on the device.
- **`is_async`**: A boolean flag indicating whether the operation should be performed asynchronously.
- **`on_device`**: A boolean flag indicating whether the inputs and outputs are on the device.
- **`is_async`**: A boolean flag indicating whether the operation should be performed asynchronously.
### Return Value
@@ -209,9 +200,8 @@ transpose_matrix(&input, 5, 4, &mut output, &ctx, true, false)
.expect("Failed to transpose matrix");
```
The function takes a matrix represented as a 1D slice, transposes it, and stores the result in another 1D slice. The input and output slices can be stored on either the host or the device, and the operation can be performed synchronously or asynchronously.
The function is generic and can work with any type `F` that implements the `FieldImpl` trait. The `<F as FieldImpl>::Config` type must also implement the `VecOps<F>` trait, which provides the `transpose` method used to perform the actual transposition.
The function returns an `IcicleResult<()>`, indicating whether the operation was successful or not.
The function returns an `IcicleResult<()>`, indicating whether the operation was successful or not.

View File

@@ -11,7 +11,7 @@ Ingonyama is a next-generation semiconductor company, focusing on Zero-Knowledge
Currently our flagship products are:
- **ICICLE**:
[ICICLE](https://github.com/ingonyama-zk/icicle) is a fully featured GPU accelerated cryptography library for building ZK provers. ICICLE allows you to accelerate your ZK existing protocols in a matter of hours or implement your protocol from scratch on GPU.
[ICICLE](https://github.com/ingonyama-zk/icicle) is a fully featured GPU accelerated cryptography library for building ZK provers. ICICLE allows you to accelerate your existing ZK protocols in a matter of hours or implement your protocol from scratch on GPU.
---
@@ -39,7 +39,7 @@ Learn more about ICICLE and GPUs [here][ICICLE-OVERVIEW].
## Get in Touch
If you have any questions, ideas, or are thinking of building something in this space join the discussion on [Discord]. You can explore our code on [github](https://github.com/ingonyama-zk) or read some of [our research papers](https://github.com/ingonyama-zk/papers).
If you have any questions, ideas, or are thinking of building something in this space, join the discussion on [Discord]. You can explore our code on [github](https://github.com/ingonyama-zk) or read some of [our research papers](https://github.com/ingonyama-zk/papers).
Follow us on [Twitter](https://x.com/Ingo_zk) and [YouTube](https://www.youtube.com/@ingo_ZK) and sign up for our [mailing list](https://wkf.ms/3LKCbdj) to get our latest announcements.

View File

@@ -163,8 +163,17 @@ const config = {
prism: {
theme: lightCodeTheme,
darkTheme: darkCodeTheme,
additionalLanguages: ['rust', 'go'],
},
image: 'img/logo.png',
announcementBar: {
id: 'announcement', // Any value that will identify this message.
content:
'<strong>🎉 Read our paper on the Polynomials API in ICICLE v2 by clicking <a target="_blank" rel="noopener noreferrer" href="https://eprint.iacr.org/2024/973">here</a>! 🎉</strong>',
backgroundColor: '#ADD8E6', // Light blue background color.
textColor: '#000000', // Black text color.
isCloseable: true, // Defaults to `true`.
},
}),
};

View File

@@ -24,6 +24,52 @@ module.exports = {
label: "ICICLE Core",
id: "icicle/core",
},
{
type: "category",
label: "Primitives",
link: {
type: `doc`,
id: 'icicle/primitives/overview',
},
collapsed: true,
items: [
{
type: "doc",
label: "MSM",
id: "icicle/primitives/msm",
},
{
type: "doc",
label: "NTT",
id: "icicle/primitives/ntt",
},
{
type: "doc",
label: "Keccak Hash",
id: "icicle/primitives/keccak",
},
{
type: "doc",
label: "Poseidon Hash",
id: "icicle/primitives/poseidon",
},
{
type: "doc",
label: "Poseidon2 Hash",
id: "icicle/primitives/poseidon2",
},
],
},
{
type: "doc",
label: "Polynomials",
id: "icicle/polynomials/overview",
},
{
type: "doc",
label: "Multi GPU Support",
id: "icicle/multi-gpu",
},
{
type: "category",
label: "Golang bindings",
@@ -64,6 +110,11 @@ module.exports = {
label: "Vector operations",
id: "icicle/golang-bindings/vec-ops",
},
{
type: "doc",
label: "Keccak Hash",
id: "icicle/golang-bindings/keccak",
},
{
type: "doc",
label: "Multi GPU Support",
@@ -111,6 +162,11 @@ module.exports = {
label: "Vector operations",
id: "icicle/rust-bindings/vec-ops",
},
{
type: "doc",
label: "Keccak Hash",
id: "icicle/rust-bindings/keccak",
},
{
type: "doc",
label: "Multi GPU Support",
@@ -123,42 +179,6 @@ module.exports = {
},
],
},
{
type: "category",
label: "Primitives",
link: {
type: `doc`,
id: 'icicle/primitives/overview',
},
collapsed: true,
items: [
{
type: "doc",
label: "MSM",
id: "icicle/primitives/msm",
},
{
type: "doc",
label: "NTT",
id: "icicle/primitives/ntt",
},
{
type: "doc",
label: "Poseidon Hash",
id: "icicle/primitives/poseidon",
},
],
},
{
type: "doc",
label: "Polynomials",
id: "icicle/polynomials/overview",
},
{
type: "doc",
label: "Multi GPU Support",
id: "icicle/multi-gpu",
},
{
type: "doc",
label: "Google Colab Instructions",
@@ -190,6 +210,7 @@ module.exports = {
type: "category",
label: "Additional Resources",
collapsed: false,
collapsible: false,
items: [
{
type: "link",

View File

@@ -1,6 +1,6 @@
# ZKContainer
We recommend using [ZKContainer](https://ingonyama.com/blog/Immanuel-ZKDC), where we have already preinstalled all the required dependencies, to run Icicle examples.
We recommend using [ZKContainer](https://www.ingonyama.com/blog/product-announcement-zk-containers), where we have already preinstalled all the required dependencies, to run Icicle examples.
To use our containers you will need [Docker](https://www.docker.com/) and [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html).
In each example directory, ZKContainer files are located in a subdirectory `.devcontainer`.

View File

@@ -0,0 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_curve_bn254 icicle_field_bn254 icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()

View File

@@ -0,0 +1,35 @@
# ICICLE best practices: Concurrent Data Transfer and NTT Computation
The [Number Theoretic Transform (NTT)](https://dev.ingonyama.com/icicle/primitives/ntt) is an integral component of many cryptographic algorithms, such as polynomial multiplication in Zero Knowledge Proofs. The performance bottleneck of NTT on GPUs is the data transfer between the host (CPU) and the device (GPU). In a typical NVIDIA GPU this transfer dominates the total NTT execution time.
## Key-Takeaway
When you have to run several NTTs, consider Concurrent Data Download, Upload, and Computation to improve data bus (PCIe) and GPU utilization, and get better total execution time.
Typically, you concurrently
1. Download the output of a previous NTT back to the host
2. Upload the input for a next NTT on the device
3. Run current NTT
> [!NOTE]
> This approach requires two on-device memory vectors, decreasing the maximum size of NTT by 2x.
## Best-Practices
1. Use three separate streams for Download to device, Upload from device, and Compute operations
2. Future: Use pinned (page-locked) memory on host to speed data bus transfers.
3. Compute in-place NTT.
## Running the example
To change the default curve BN254, edit `run.sh` and `CMakeLists.txt`
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
To compare with ICICLE baseline (i.e. non-concurrent) NTT, you can run [this example](../ntt/README.md).

View File

@@ -0,0 +1,126 @@
#include <stdio.h>
#include <iostream>
#include <string>
#include <chrono>
#include "icicle/runtime.h"
#include "icicle/api/bn254.h"
using namespace bn254;
#include "examples_utils.h"
#include "icicle/backend/ntt_config.h"
void initialize_input(const unsigned ntt_size, const unsigned batch_size, scalar_t* elements)
{
for (unsigned i = 0; i < ntt_size * batch_size; i++) {
elements[i] = scalar_t::from(i + 1);
}
}
int main(int argc, char* argv[])
{
try_load_and_set_backend_device(argc, argv);
// set these parameters to match the desired NTT size and batch size
const unsigned log_ntt_size = 20;
const unsigned batch_size = 16;
scalar_t basic_root = scalar_t::omega(log_ntt_size);
const unsigned ntt_size = 1 << log_ntt_size;
std::cout << "log NTT size: " << log_ntt_size << std::endl;
std::cout << "Batch size: " << batch_size << std::endl;
// Create separate streams for overlapping data transfers and kernel execution.
icicleStreamHandle stream_compute, stream_h2d, stream_d2h;
ICICLE_CHECK(icicle_create_stream(&stream_compute));
ICICLE_CHECK(icicle_create_stream(&stream_h2d));
ICICLE_CHECK(icicle_create_stream(&stream_d2h));
// Initialize NTT domain
std::cout << "Init NTT domain" << std::endl;
auto ntt_init_domain_cfg = default_ntt_init_domain_config();
// set CUDA backend specific flag for init_domain
ConfigExtension backend_cfg_ext;
backend_cfg_ext.set(CudaBackendConfig::CUDA_NTT_FAST_TWIDDLES_MODE, true);
ntt_init_domain_cfg.ext = &backend_cfg_ext;
ICICLE_CHECK(bn254_ntt_init_domain(&basic_root, ntt_init_domain_cfg));
std::cout << "Concurrent Download, Upload, and Compute In-place NTT" << std::endl;
int nof_blocks = 32;
int block_size = ntt_size * batch_size / nof_blocks;
std::cout << "Number of blocks: " << nof_blocks << ", block size: " << block_size << " Bytes" << std::endl;
// on-host pinned data
scalar_t* h_inp[2];
scalar_t* h_out[2];
for (int i = 0; i < 2; i++) {
h_inp[i] = new scalar_t[ntt_size * batch_size];
h_out[i] = new scalar_t[ntt_size * batch_size];
}
// on-device in-place data
// we need two on-device vectors to overlap data transfers with NTT kernel execution
scalar_t* d_vec[2];
for (int i = 0; i < 2; i++) {
ICICLE_CHECK(icicle_malloc((void**)&d_vec[i], sizeof(scalar_t) * ntt_size * batch_size));
}
// initialize input data
initialize_input(ntt_size, batch_size, h_inp[0]);
initialize_input(ntt_size, batch_size, h_inp[1]);
// ntt configuration
NTTConfig<scalar_t> config_compute = default_ntt_config<scalar_t>();
config_compute.batch_size = batch_size;
config_compute.are_inputs_on_device = true;
config_compute.are_outputs_on_device = true;
config_compute.is_async = true;
config_compute.stream = stream_compute;
// backend specific config extension
ConfigExtension ntt_cfg_ext;
ntt_cfg_ext.set(CudaBackendConfig::CUDA_NTT_ALGORITHM, CudaBackendConfig::NttAlgorithm::MixedRadix);
config_compute.ext = &ntt_cfg_ext;
for (int run = 0; run < 10; run++) {
int vec_compute = run % 2;
int vec_transfer = (run + 1) % 2;
std::cout << "Run: " << run << std::endl;
std::cout << "Compute Vector: " << vec_compute << std::endl;
std::cout << "Transfer Vector: " << vec_transfer << std::endl;
START_TIMER(inplace);
bn254_ntt(d_vec[vec_compute], ntt_size, NTTDir::kForward, config_compute, d_vec[vec_compute]);
// we have to delay upload to device relative to download from device by one block: preserve write after read
for (int i = 0; i <= nof_blocks; i++) {
if (i < nof_blocks) {
// copy result back from device to host
ICICLE_CHECK(icicle_copy_async(
&h_out[vec_transfer][i * block_size], &d_vec[vec_transfer][i * block_size], sizeof(scalar_t) * block_size,
stream_d2h));
}
if (i > 0) {
// copy next input from host to device to alternate buffer
ICICLE_CHECK(icicle_copy_async(
&d_vec[vec_transfer][(i - 1) * block_size], &h_inp[vec_transfer][(i - 1) * block_size],
sizeof(scalar_t) * block_size, stream_h2d));
}
// synchronize upload and download at the end of the block to ensure data integrity
ICICLE_CHECK(icicle_stream_synchronize(stream_d2h));
ICICLE_CHECK(icicle_stream_synchronize(stream_h2d));
}
// synchronize compute stream with the end of the computation
ICICLE_CHECK(icicle_stream_synchronize(stream_compute));
END_TIMER(inplace, "Concurrent In-Place NTT");
}
// Clean-up
for (int i = 0; i < 2; i++) {
ICICLE_CHECK(icicle_free(d_vec[i]));
delete[] (h_inp[i]);
delete[] (h_out[i]);
}
ICICLE_CHECK(icicle_destroy_stream(stream_compute));
ICICLE_CHECK(icicle_destroy_stream(stream_d2h));
ICICLE_CHECK(icicle_destroy_stream(stream_h2d));
return 0;
}

View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b ICICLE_BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b ICICLE_BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DMSM=OFF -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DMSM=OFF -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -0,0 +1,34 @@
#pragma once
#include <chrono>
#include "icicle/runtime.h"
// Timer
using FpMilliseconds = std::chrono::duration<float, std::chrono::milliseconds::period>;
#define START_TIMER(timer) auto timer##_start = std::chrono::high_resolution_clock::now();
#define END_TIMER(timer, msg) \
printf("%s: %.0f ms\n", msg, FpMilliseconds(std::chrono::high_resolution_clock::now() - timer##_start).count());
// Load and choose backend
void try_load_and_set_backend_device(int argc = 0, char** argv = nullptr)
{
icicle_load_backend_from_env_or_default();
const char* selected_device = argc > 1 ? argv[1] : nullptr;
if (selected_device) {
ICICLE_LOG_INFO << "selecting " << selected_device << " device";
ICICLE_CHECK(icicle_set_device(selected_device));
return;
}
// trying to choose CUDA if available, or fallback to CPU otherwise (default device)
const bool is_cuda_device_available = (eIcicleError::SUCCESS == icicle_is_device_avialable("CUDA"));
if (is_cuda_device_available) {
Device device = {"CUDA", 0}; // GPU-0
ICICLE_LOG_INFO << "setting " << device;
ICICLE_CHECK(icicle_set_device(device));
return;
}
ICICLE_LOG_INFO << "CUDA device not available, falling back to CPU";
}

View File

@@ -1,23 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_curve_bn254 icicle_field_bn254 icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()
add_executable(
example
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_curve_bn254.a)
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -1,46 +1,43 @@
# Icicle example: Muli-Scalar Multiplication (MSM)
## Best-Practices
We recommend to run our examples in [ZK-containers](../../ZK-containers.md) to save your time and mental energy.
## Key-Takeaway
`Icicle` provides CUDA C++ template function `MSM` to accelerate [Multi-Scalar Multiplication](https://github.com/ingonyama-zk/ingopedia/blob/master/src/msm.md).
## Concise Usage Explanation
1. Select the curve
2. Include an MSM template
3. Configure MSM
4. Call the template
1. Include the curve api
2. Configure MSM
3. Call msm api
```c++
#define CURVE_ID 1
#include "icicle/appUtils/msm/msm.cu"
#include "icicle/api/bn254.h"
...
msm::MSMConfig config = {...};
MSMConfig config = default_msm_config();
...
msm::MSM<scalar_t, affine_t, projective_t>(scalars, points, size, config, &result);
bn254_msm(scalars, points, size, config, &result);
```
In this example we use `BN254` curve (`CURVE_ID=1`). The function computes $result = \sum_{i=0}^{size-1} scalars[i] \cdot points[i]$, where input `points[]` use affine coordinates, and `result` uses projective coordinates.
In this example we use `BN254` curve. The function computes $result = \sum_{i=0}^{size-1} scalars[i] \cdot points[i]$, where input `points[]` use affine coordinates, and `result` uses projective coordinates.
**Parameters:**
The configuration is passed to the kernel as a structure of type `msm::MSMConfig`. Some of the most important fields are listed below:
The configuration is passed to the kernel as a structure of type `MSMConfig`. Some of the most important fields are listed below:
- `are_scalars_on_device`, `are_points_on_device`, `are_results_on_device`: location of the data
- `is_async`: blocking vs. non-blocking kernel call
- `large_bucket_factor`: distinguishes between large bucket and normal bucket sizes. If there is a scalar distribution that is skewed heavily to a few values we can operate on those separately from the rest of the values. The ideal value here can vary by circuit (based on the distribution of scalars) but start with 10 and adjust to see if it improves performance.
- In addition can pass backend-specific params via config.extConfig. For example CUDA backend accepts a `large_bucket_factor` param.
## Running the example
- `cd` to your example directory
- compile with `./compile.sh`
- run with `./run.sh`
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
## What's in the example
@@ -49,4 +46,4 @@ The configuration is passed to the kernel as a structure of type `msm::MSMConfig
3. Configure and execute MSM using on-host data
4. Copy inputs on-device
5. Configure and execute MSM using on-device data
6. Repeat the above steps for G2 points
6. Repeat step 3 G2 msm points

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DG2=ON
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -0,0 +1,101 @@
#include <fstream>
#include <iostream>
#include <iomanip>
#include "icicle/runtime.h"
#include "icicle/api/bn254.h"
using namespace bn254;
#include "examples_utils.h"
int main(int argc, char* argv[])
{
try_load_and_set_backend_device(argc, argv);
std::cout << "\nIcicle example: Muli-Scalar Multiplication (MSM)" << std::endl;
std::cout << "Example parameters" << std::endl;
int batch_size = 1;
unsigned msm_size = 1 << 10;
int N = batch_size * msm_size;
std::cout << "Batch size: " << batch_size << std::endl;
std::cout << "MSM size: " << msm_size << std::endl;
std::cout << "\nPart I: use G1 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
auto scalars = std::make_unique<scalar_t[]>(N);
auto points = std::make_unique<affine_t[]>(N);
projective_t result;
scalar_t::rand_host_many(scalars.get(), N);
projective_t::rand_host_many(points.get(), N);
std::cout << "Using default MSM configuration with on-host inputs" << std::endl;
auto config = default_msm_config();
config.batch_size = batch_size;
std::cout << "\nRunning MSM kernel with on-host inputs" << std::endl;
// Execute the MSM kernel
START_TIMER(MSM_host_mem);
ICICLE_CHECK(bn254_msm(scalars.get(), points.get(), msm_size, config, &result));
END_TIMER(MSM_host_mem, "MSM from host-memory took");
std::cout << projective_t::to_affine(result) << std::endl;
DeviceProperties device_props;
ICICLE_CHECK(icicle_get_device_properties(device_props));
// If device does not share memory with host, copy inputs explicitly and execute msm with device pointers
if (!device_props.using_host_memory) {
std::cout << "\nReconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
config.are_scalars_on_device = true;
config.are_points_on_device = true;
std::cout << "Copying inputs to-device" << std::endl;
scalar_t* scalars_d;
affine_t* points_d;
projective_t* result_d;
ICICLE_CHECK(icicle_malloc((void**)&scalars_d, sizeof(scalar_t) * N));
ICICLE_CHECK(icicle_malloc((void**)&points_d, sizeof(affine_t) * N));
ICICLE_CHECK(icicle_malloc((void**)&result_d, sizeof(projective_t)));
ICICLE_CHECK(icicle_copy(scalars_d, scalars.get(), sizeof(scalar_t) * N));
ICICLE_CHECK(icicle_copy(points_d, points.get(), sizeof(affine_t) * N));
std::cout << "Running MSM kernel with on-device inputs" << std::endl;
// Execute the MSM kernel
START_TIMER(MSM_device_mem);
ICICLE_CHECK(msm(scalars_d, points_d, msm_size, config, result_d));
END_TIMER(MSM_device_mem, "MSM from device-memory took");
// Copy the result back to the host
icicle_copy(&result, result_d, sizeof(projective_t));
// Print the result
std::cout << projective_t::to_affine(result) << std::endl;
// Free the device memory
icicle_free(scalars_d);
icicle_free(points_d);
icicle_free(result_d);
}
std::cout << "\nPart II: use G2 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
// use the same scalars
auto g2_points = std::make_unique<g2_affine_t[]>(N);
g2_projective_t::rand_host_many(g2_points.get(), N);
std::cout << "Reconfiguring MSM to use on-host inputs" << std::endl;
config.are_results_on_device = false;
config.are_scalars_on_device = false;
config.are_points_on_device = false;
g2_projective_t g2_result;
START_TIMER(MSM_g2);
ICICLE_CHECK(bn254_g2_msm(scalars.get(), g2_points.get(), msm_size, config, &g2_result));
END_TIMER(MSM_g2, "MSM G2 from host-memory took");
std::cout << g2_projective_t::to_affine(g2_result) << std::endl;
// Similar to G1 MSM, can explicitly copy to device and execute the G2 MSM using device pointers
return 0;
}

View File

@@ -1,124 +0,0 @@
#include <fstream>
#include <iostream>
#include <iomanip>
#include "api/bn254.h"
using namespace bn254;
int main(int argc, char* argv[])
{
std::cout << "Icicle example: Muli-Scalar Multiplication (MSM)" << std::endl;
std::cout << "Example parameters" << std::endl;
int batch_size = 1;
std::cout << "Batch size: " << batch_size << std::endl;
unsigned msm_size = 1048576;
std::cout << "MSM size: " << msm_size << std::endl;
int N = batch_size * msm_size;
std::cout << "Part I: use G1 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
scalar_t* scalars = new scalar_t[N];
affine_t* points = new affine_t[N];
projective_t result;
scalar_t::rand_host_many(scalars, N);
projective_t::rand_host_many_affine(points, N);
std::cout << "Using default MSM configuration with on-host inputs" << std::endl;
device_context::DeviceContext ctx = device_context::get_default_device_context();
msm::MSMConfig config = {
ctx, // ctx
0, // points_size
1, // precompute_factor
0, // c
0, // bitsize
10, // large_bucket_factor
1, // batch_size
false, // are_scalars_on_device
false, // are_scalars_montgomery_form
false, // are_points_on_device
false, // are_points_montgomery_form
false, // are_results_on_device
false, // is_big_triangle
false, // is_async
};
config.batch_size = batch_size;
std::cout << "Running MSM kernel with on-host inputs" << std::endl;
cudaStream_t stream = config.ctx.stream;
// Execute the MSM kernel
bn254_msm_cuda(scalars, points, msm_size, config, &result);
std::cout << projective_t::to_affine(result) << std::endl;
std::cout << "Copying inputs on-device" << std::endl;
scalar_t* scalars_d;
affine_t* points_d;
projective_t* result_d;
cudaMalloc(&scalars_d, sizeof(scalar_t) * N);
cudaMalloc(&points_d, sizeof(affine_t) * N);
cudaMalloc(&result_d, sizeof(projective_t));
cudaMemcpy(scalars_d, scalars, sizeof(scalar_t) * N, cudaMemcpyHostToDevice);
cudaMemcpy(points_d, points, sizeof(affine_t) * N, cudaMemcpyHostToDevice);
std::cout << "Reconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
config.are_scalars_on_device = true;
config.are_points_on_device = true;
std::cout << "Running MSM kernel with on-device inputs" << std::endl;
// Execute the MSM kernel
bn254_msm_cuda(scalars_d, points_d, msm_size, config, result_d);
// Copy the result back to the host
cudaMemcpy(&result, result_d, sizeof(projective_t), cudaMemcpyDeviceToHost);
// Print the result
std::cout << projective_t::to_affine(result) << std::endl;
// Free the device memory
cudaFree(scalars_d);
cudaFree(points_d);
cudaFree(result_d);
// Free the host memory, keep scalars for G2 example
delete[] points;
std::cout << "Part II: use G2 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
// use the same scalars
g2_affine_t* g2_points = new g2_affine_t[N];
g2_projective_t::rand_host_many_affine(g2_points, N);
std::cout << "Reconfiguring MSM to use on-host inputs" << std::endl;
config.are_results_on_device = false;
config.are_scalars_on_device = false;
config.are_points_on_device = false;
g2_projective_t g2_result;
bn254_g2_msm_cuda(scalars, g2_points, msm_size, config, &g2_result);
std::cout << g2_projective_t::to_affine(g2_result) << std::endl;
std::cout << "Copying inputs on-device" << std::endl;
g2_affine_t* g2_points_d;
g2_projective_t* g2_result_d;
cudaMalloc(&scalars_d, sizeof(scalar_t) * N);
cudaMalloc(&g2_points_d, sizeof(g2_affine_t) * N);
cudaMalloc(&g2_result_d, sizeof(g2_projective_t));
cudaMemcpy(scalars_d, scalars, sizeof(scalar_t) * N, cudaMemcpyHostToDevice);
cudaMemcpy(g2_points_d, g2_points, sizeof(g2_affine_t) * N, cudaMemcpyHostToDevice);
std::cout << "Reconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
config.are_scalars_on_device = true;
config.are_points_on_device = true;
std::cout << "Running MSM kernel with on-device inputs" << std::endl;
bn254_g2_msm_cuda(scalars_d, g2_points_d, msm_size, config, g2_result_d);
cudaMemcpy(&g2_result, g2_result_d, sizeof(g2_projective_t), cudaMemcpyDeviceToHost);
std::cout << g2_projective_t::to_affine(g2_result) << std::endl;
cudaFree(scalars_d);
cudaFree(g2_points_d);
cudaFree(g2_result_d);
delete[] g2_points;
delete[] scalars;
cudaStreamDestroy(stream);
return 0;
}

View File

@@ -1,2 +1,65 @@
#!/bin/bash
./build/example/example
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b ICICLE_BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b ICICLE_BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DG2=ON -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DG2=ON -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -1,15 +1,18 @@
#!/bin/bash
# Exit immediately on error
set -e
# TODO update for V3
mkdir -p build/example
mkdir -p build/icicle
# # Exit immediately on error
# set -e
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# mkdir -p build/example
# mkdir -p build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example
# # Configure and build Icicle
# cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
# cmake --build build/icicle
# # Configure and build the example application
# cmake -S . -B build/example
# cmake --build build/example

View File

@@ -6,140 +6,148 @@
#include "api/bn254.h"
#include "gpu-utils/error_handler.cuh"
#include "poseidon/poseidon.cuh"
#include "hash/hash.cuh"
using namespace poseidon;
using namespace bn254;
void checkCudaError(cudaError_t error) {
if (error != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
// Handle the error, e.g., exit the program or throw an exception.
}
void checkCudaError(cudaError_t error)
{
if (error != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
// Handle the error, e.g., exit the program or throw an exception.
}
}
// these global constants go into template calls
const int size_col = 11;
// this function executes the Poseidon thread
void threadPoseidon(device_context::DeviceContext ctx, unsigned size_partition, scalar_t * layers, scalar_t * column_hashes, PoseidonConstants<scalar_t> * constants) {
cudaError_t err_result = CHK_STICKY(cudaSetDevice(ctx.device_id));
if (err_result != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(err_result) << std::endl;
return;
}
// CHK_IF_RETURN(); I can't use it in a standard thread function
PoseidonConfig column_config = {
ctx, // ctx
false, // are_inputes_on_device
false, // are_outputs_on_device
false, // input_is_a_state
false, // aligned
false, // loop_state
false, // is_async
};
cudaError_t err = bn254_poseidon_hash_cuda(layers, column_hashes, (size_t) size_partition, size_col, *constants, column_config);
checkCudaError(err);
void threadPoseidon(
device_context::DeviceContext ctx,
unsigned size_partition,
scalar_t* layers,
scalar_t* column_hashes,
Poseidon<scalar_t>* poseidon)
{
cudaError_t err_result = CHK_STICKY(cudaSetDevice(ctx.device_id));
if (err_result != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(err_result) << std::endl;
return;
}
HashConfig column_config = default_hash_config(ctx);
cudaError_t err = poseidon->hash_many(layers, column_hashes, (size_t)size_partition, size_col, 1, column_config);
checkCudaError(err);
}
using FpMilliseconds = std::chrono::duration<float, std::chrono::milliseconds::period>;
#define START_TIMER(timer) auto timer##_start = std::chrono::high_resolution_clock::now();
#define END_TIMER(timer, msg) printf("%s: %.0f ms\n", msg, FpMilliseconds(std::chrono::high_resolution_clock::now() - timer##_start).count());
#define END_TIMER(timer, msg) \
printf("%s: %.0f ms\n", msg, FpMilliseconds(std::chrono::high_resolution_clock::now() - timer##_start).count());
#define CHECK_ALLOC(ptr) \
if ((ptr) == nullptr) { \
std::cerr << "Memory allocation for '" #ptr "' failed." << std::endl; \
exit(EXIT_FAILURE); \
}
#define CHECK_ALLOC(ptr) if ((ptr) == nullptr) { \
std::cerr << "Memory allocation for '" #ptr "' failed." << std::endl; \
exit(EXIT_FAILURE); \
}
int main() {
const unsigned size_row = (1<<30);
const unsigned nof_partitions = 64;
const unsigned size_partition = size_row / nof_partitions;
// layers is allocated only for one partition, need to reuse for different partitions
const uint32_t size_layers = size_col * size_partition;
nvmlInit();
unsigned int deviceCount;
nvmlDeviceGetCount(&deviceCount);
std::cout << "Available GPUs: " << deviceCount << std::endl;
for (unsigned int i = 0; i < deviceCount; ++i) {
nvmlDevice_t device;
nvmlMemory_t memory;
char name[NVML_DEVICE_NAME_BUFFER_SIZE];
nvmlDeviceGetHandleByIndex(i, &device);
nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE);
nvmlDeviceGetMemoryInfo(device, &memory);
std::cout << "Device ID: " << i << ", Type: " << name << ", Memory Total/Free (MiB) " << memory.total/1024/1024 << "/" << memory.free/1024/1024 << std::endl;
}
const unsigned memory_partition = sizeof(scalar_t)*(size_col+1)*size_partition/1024/1024;
std::cout << "Required Memory (MiB) " << memory_partition << std::endl;
//===============================================================================
// Key: multiple devices are supported by device context
//===============================================================================
device_context::DeviceContext ctx0 = device_context::get_default_device_context();
ctx0.device_id=0;
device_context::DeviceContext ctx1 = device_context::get_default_device_context();
ctx1.device_id=1;
std::cout << "Allocate and initialize the memory for layers and hashes" << std::endl;
scalar_t* layers0 = static_cast<scalar_t*>(malloc(size_layers * sizeof(scalar_t)));
CHECK_ALLOC(layers0);
scalar_t s = scalar_t::zero();
for (unsigned i = 0; i < size_col*size_partition ; i++) {
layers0[i] = s;
s = s + scalar_t::one();
}
scalar_t* layers1 = static_cast<scalar_t*>(malloc(size_layers * sizeof(scalar_t)));
CHECK_ALLOC(layers1);
s = scalar_t::zero() + scalar_t::one();
for (unsigned i = 0; i < size_col*size_partition ; i++) {
layers1[i] = s;
s = s + scalar_t::one();
}
scalar_t* column_hash0 = static_cast<scalar_t*>(malloc(size_partition * sizeof(scalar_t)));
CHECK_ALLOC(column_hash0);
scalar_t* column_hash1 = static_cast<scalar_t*>(malloc(size_partition * sizeof(scalar_t)));
CHECK_ALLOC(column_hash1);
PoseidonConstants<scalar_t> column_constants0, column_constants1;
bn254_init_optimized_poseidon_constants_cuda(size_col, ctx0, &column_constants0);
cudaError_t err_result = CHK_STICKY(cudaSetDevice(ctx1.device_id));
if (err_result != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(err_result) << std::endl;
return;
}
bn254_init_optimized_poseidon_constants_cuda(size_col, ctx1, &column_constants1);
std::cout << "Parallel execution of Poseidon threads" << std::endl;
START_TIMER(parallel);
std::thread thread0(threadPoseidon, ctx0, size_partition, layers0, column_hash0, &column_constants0);
std::thread thread1(threadPoseidon, ctx1, size_partition, layers1, column_hash1, &column_constants1);
// Wait for the threads to finish
thread0.join();
thread1.join();
END_TIMER(parallel,"2 GPUs");
std::cout << "Output Data from Thread 0: ";
std::cout << column_hash0[0] << std::endl;
std::cout << "Output Data from Thread 1: ";
std::cout << column_hash1[0] << std::endl;
std::cout << "Sequential execution of Poseidon threads" << std::endl;
START_TIMER(sequential);
std::thread thread2(threadPoseidon, ctx0, size_partition, layers0, column_hash0, &column_constants0);
thread2.join();
std::thread thread3(threadPoseidon, ctx0, size_partition, layers1, column_hash1, &column_constants0);
thread3.join();
END_TIMER(sequential,"1 GPU");
std::cout << "Output Data from Thread 2: ";
std::cout << column_hash0[0] << std::endl;
std::cout << "Output Data from Thread 3: ";
std::cout << column_hash1[0] << std::endl;
nvmlShutdown();
return 0;
#define CHECK_ALLOC(ptr) \
if ((ptr) == nullptr) { \
std::cerr << "Memory allocation for '" #ptr "' failed." << std::endl; \
exit(EXIT_FAILURE); \
}
int main()
{
const unsigned size_row = (1 << 30);
const unsigned nof_partitions = 64;
const unsigned size_partition = size_row / nof_partitions;
// layers is allocated only for one partition, need to reuse for different partitions
const uint32_t size_layers = size_col * size_partition;
nvmlInit();
unsigned int deviceCount;
nvmlDeviceGetCount(&deviceCount);
std::cout << "Available GPUs: " << deviceCount << std::endl;
for (unsigned int i = 0; i < deviceCount; ++i) {
nvmlDevice_t device;
nvmlMemory_t memory;
char name[NVML_DEVICE_NAME_BUFFER_SIZE];
nvmlDeviceGetHandleByIndex(i, &device);
nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE);
nvmlDeviceGetMemoryInfo(device, &memory);
std::cout << "Device ID: " << i << ", Type: " << name << ", Memory Total/Free (MiB) " << memory.total / 1024 / 1024
<< "/" << memory.free / 1024 / 1024 << std::endl;
}
const unsigned memory_partition = sizeof(scalar_t) * (size_col + 1) * size_partition / 1024 / 1024;
std::cout << "Required Memory (MiB) " << memory_partition << std::endl;
//===============================================================================
// Key: multiple devices are supported by device context
//===============================================================================
device_context::DeviceContext ctx0 = device_context::get_default_device_context();
ctx0.device_id = 0;
device_context::DeviceContext ctx1 = device_context::get_default_device_context();
ctx1.device_id = 1;
std::cout << "Allocate and initialize the memory for layers and hashes" << std::endl;
scalar_t* layers0 = static_cast<scalar_t*>(malloc(size_layers * sizeof(scalar_t)));
CHECK_ALLOC(layers0);
scalar_t s = scalar_t::zero();
for (unsigned i = 0; i < size_col * size_partition; i++) {
layers0[i] = s;
s = s + scalar_t::one();
}
scalar_t* layers1 = static_cast<scalar_t*>(malloc(size_layers * sizeof(scalar_t)));
CHECK_ALLOC(layers1);
s = scalar_t::zero() + scalar_t::one();
for (unsigned i = 0; i < size_col * size_partition; i++) {
layers1[i] = s;
s = s + scalar_t::one();
}
scalar_t* column_hash0 = static_cast<scalar_t*>(malloc(size_partition * sizeof(scalar_t)));
CHECK_ALLOC(column_hash0);
scalar_t* column_hash1 = static_cast<scalar_t*>(malloc(size_partition * sizeof(scalar_t)));
CHECK_ALLOC(column_hash1);
Poseidon<scalar_t> column_poseidon0(size_col, ctx0);
cudaError_t err_result = CHK_STICKY(cudaSetDevice(ctx1.device_id));
if (err_result != cudaSuccess) {
std::cerr << "CUDA error: " << cudaGetErrorString(err_result) << std::endl;
return;
}
Poseidon<scalar_t> column_poseidon1(size_col, ctx1);
std::cout << "Parallel execution of Poseidon threads" << std::endl;
START_TIMER(parallel);
std::thread thread0(threadPoseidon, ctx0, size_partition, layers0, column_hash0, &column_poseidon0);
std::thread thread1(threadPoseidon, ctx1, size_partition, layers1, column_hash1, &column_poseidon1);
// Wait for the threads to finish
thread0.join();
thread1.join();
END_TIMER(parallel, "2 GPUs");
std::cout << "Output Data from Thread 0: ";
std::cout << column_hash0[0] << std::endl;
std::cout << "Output Data from Thread 1: ";
std::cout << column_hash1[0] << std::endl;
std::cout << "Sequential execution of Poseidon threads" << std::endl;
START_TIMER(sequential);
std::thread thread2(threadPoseidon, ctx0, size_partition, layers0, column_hash0, &column_poseidon0);
thread2.join();
std::thread thread3(threadPoseidon, ctx0, size_partition, layers1, column_hash1, &column_poseidon0);
thread3.join();
END_TIMER(sequential, "1 GPU");
std::cout << "Output Data from Thread 2: ";
std::cout << column_hash0[0] << std::endl;
std::cout << "Output Data from Thread 3: ";
std::cout << column_hash1[0] << std::endl;
nvmlShutdown();
return 0;
}

View File

@@ -1,2 +1,4 @@
#!/bin/bash
./build/example/example
# TODO update for V3
# ./build/example/example

View File

@@ -1,23 +0,0 @@
# Make sure NVIDIA Container Toolkit is installed on your host
# Use NVIDIA base image
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
# Update and install dependencies
RUN apt-get update && apt-get install -y \
nsight-systems-12.2 \
cmake \
protobuf-compiler \
curl \
build-essential \
git \
&& rm -rf /var/lib/apt/lists/*
# Clone Icicle from a GitHub repository
RUN git clone https://github.com/ingonyama-zk/icicle.git /icicle
# Set the working directory in the container
WORKDIR /icicle-example
# Specify the default command for the container
CMD ["/bin/bash"]

View File

@@ -1,24 +0,0 @@
{
"name": "Icicle Examples - Multiply",
"build": {
"dockerfile": "Dockerfile"
},
"workspaceMount": "source=${localWorkspaceFolder}/.,target=/icicle-example,type=bind",
"workspaceFolder": "/icicle-example",
"runArgs": [
"--gpus",
"all"
],
"postCreateCommand": [
"nvidia-smi"
],
"customizations": {
"vscode": {
"extensions": [
"ms-vscode.cmake-tools",
"ms-azuretools.vscode-docker",
"ms-vscode.cpptools-extension-pack"
]
}
}
}

View File

@@ -1,25 +0,0 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
add_executable(
example
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
target_link_libraries(example ${NVML_LIBRARY})
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -1,41 +0,0 @@
# Icicle example: Multiplication
## Best-Practices
We recommend to run our examples in [ZK-containers](../../ZK-containers.md) to save your time and mental energy.
## Key-Takeaway
`Icicle` accelerates multiplication operation `*` using [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
## Concise Usage Explanation
Define a `CURVE_ID` and include curve configuration header:
```c++
#define CURVE_ID 1
#include "curves/curve_config.cuh"
```
The values of `CURVE_ID` for different curves are in the above header. Multiplication is accelerated both for field scalars and point fields.
```c++
using namespace curve_config;
scalar_t a;
point_field_t b;
```
## Running the example
- `cd` to your example directory
- compile with `./compile.sh`
- run with `./run.sh`
## What's in the example
1. Define the parameters for the example such as vector size
2. Generate random vectors on-host
3. Copy them on-device
4. Execute element-wise vector multiplication on-device
5. Copy results on-host

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -1,162 +0,0 @@
#include <iostream>
#include <iomanip>
#include <chrono>
#include <nvml.h>
#include "api/bn254.h"
#include "vec_ops/vec_ops.cuh"
using namespace vec_ops;
using namespace bn254;
typedef scalar_t T;
int vector_mult(T* vec_b, T* vec_a, T* vec_result, size_t n_elments, device_context::DeviceContext ctx)
{
vec_ops::VecOpsConfig config = vec_ops::DefaultVecOpsConfig();
config.is_a_on_device = true;
config.is_b_on_device = true;
config.is_result_on_device = true;
cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elments, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int main(int argc, char** argv)
{
const unsigned vector_size = 1 << 15;
const unsigned repetitions = 1 << 15;
cudaError_t err;
nvmlInit();
nvmlDevice_t device;
nvmlDeviceGetHandleByIndex(0, &device); // for GPU 0
std::cout << "Icicle-Examples: vector multiplications" << std::endl;
char name[NVML_DEVICE_NAME_BUFFER_SIZE];
if (nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE) == NVML_SUCCESS) {
std::cout << "GPU Model: " << name << std::endl;
} else {
std::cerr << "Failed to get GPU model name." << std::endl;
}
unsigned power_limit;
nvmlDeviceGetPowerManagementLimit(device, &power_limit);
std::cout << "Vector size: " << vector_size << std::endl;
std::cout << "Repetitions: " << repetitions << std::endl;
std::cout << "Power limit: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_limit << " W" << std::endl;
unsigned int baseline_power;
nvmlDeviceGetPowerUsage(device, &baseline_power);
std::cout << "Baseline power: " << std::fixed << std::setprecision(3) << 1.0e-3 * baseline_power << " W" << std::endl;
unsigned baseline_temperature;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &baseline_temperature) == NVML_SUCCESS) {
std::cout << "Baseline GPU Temperature: " << baseline_temperature << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// host data
T* host_in1 = (T*)malloc(vector_size * sizeof(T));
T* host_in2 = (T*)malloc(vector_size * sizeof(T));
std::cout << "Initializing vectors with random data" << std::endl;
T::rand_host_many(host_in1, vector_size);
T::rand_host_many(host_in2, vector_size);
// device data
device_context::DeviceContext ctx = device_context::get_default_device_context();
T* device_in1;
T* device_in2;
T* device_out;
err = cudaMalloc((void**)&device_in1, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_in2, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_out, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
// copy from host to device
err = cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
std::cout << "Starting warm-up" << std::endl;
// Warm-up loop
for (int i = 0; i < repetitions; i++) {
vector_mult(device_in1, device_in2, device_out, vector_size, ctx);
}
std::cout << "Starting benchmarking" << std::endl;
unsigned power_before;
nvmlDeviceGetPowerUsage(device, &power_before);
std::cout << "Power before: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_before << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1) << (float)100.0 * power_before / power_limit
<< " %" << std::endl;
unsigned temperature_before;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_before) == NVML_SUCCESS) {
std::cout << "GPU Temperature before: " << temperature_before << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
auto start_time = std::chrono::high_resolution_clock::now();
// Benchmark loop
for (int i = 0; i < repetitions; i++) {
vector_mult(device_in1, device_in2, device_out, vector_size, ctx);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
std::cout << "Elapsed time: " << duration.count() << " microseconds" << std::endl;
unsigned power_after;
nvmlDeviceGetPowerUsage(device, &power_after);
std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1) << (float)100.0 * power_after / power_limit
<< " %" << std::endl;
unsigned temperature_after;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// Report performance in GMPS: Giga Multiplications Per Second
double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
// Optional: validate multiplication
T* host_out = (T*)malloc(vector_size * sizeof(T));
cudaMemcpy(host_out, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
// validate multiplication here...
// clean up and exit
free(host_in1);
free(host_in2);
free(host_out);
cudaFree(device_in1);
cudaFree(device_in2);
cudaFree(device_out);
nvmlShutdown();
return 0;
}

View File

@@ -1,2 +0,0 @@
#!/bin/bash
./build/example/example

View File

@@ -1,23 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_curve_bn254 icicle_field_bn254 icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()
add_executable(
example
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -1,33 +1,35 @@
# Icicle example: Number-Theoretical Transform (NTT)
## Best-Practices
We recommend to run our examples in [ZK-containers](../../ZK-containers.md) to save your time and mental energy.
## Key-Takeaway
`Icicle` provides CUDA C++ template function NTT for [Number Theoretical Transform](https://github.com/ingonyama-zk/ingopedia/blob/master/src/fft.md), also known as Discrete Fourier Transform.
## Concise Usage Explanation
1. Include the curve api
2. Init NTT domain
3. Call ntt api
```c++
// Select the curve
#define CURVE_ID 1
// Include NTT template
#include "appUtils/ntt/ntt.cu"
using namespace curve_config;
using namespace ntt;
// Configure NTT
NTTConfig<S> config=DefaultNTTConfig<S>();
// Call NTT
NTT<S, E>(input, ntt_size, NTTDir::kForward, config, output);
#include "icicle/api/bn254.h"
...
auto ntt_init_domain_cfg = default_ntt_init_domain_config();
...
bn254_ntt_init_domain(&basic_root, ntt_init_domain_cfg);
NTTConfig<scalar_t> config = default_ntt_config<scalar_t>();
...
bn254_ntt(input.get(), ntt_size, NTTDir::kForward, config, output.get())
```
## Running the example
- `cd` to your example directory
- compile with `./compile.sh`
- run with `./run.sh`
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
## What's in the example

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -0,0 +1,106 @@
#include <iostream>
#include "icicle/runtime.h"
#include "icicle/api/bn254.h"
using namespace bn254;
#include "examples_utils.h"
#include "icicle/backend/ntt_config.h"
void initialize_input(const unsigned ntt_size, const unsigned nof_ntts, scalar_t* elements);
int validate_output(const unsigned ntt_size, const unsigned nof_ntts, scalar_t* elements);
int main(int argc, char* argv[])
{
try_load_and_set_backend_device(argc, argv);
std::cout << "\nIcicle Examples: Number Theoretical Transform (NTT)" << std::endl;
const unsigned log_ntt_size = 20;
const unsigned ntt_size = 1 << log_ntt_size;
const unsigned batch_size = 2;
std::cout << "Example parameters:" << std::endl;
std::cout << "NTT size: " << ntt_size << std::endl;
std::cout << "batch size: " << batch_size << std::endl;
std::cout << "\nGenerating input data for lowest and highest harmonics" << std::endl;
auto input = std::make_unique<scalar_t[]>(batch_size * ntt_size);
auto output = std::make_unique<scalar_t[]>(batch_size * ntt_size);
initialize_input(ntt_size, batch_size, input.get());
// Initialize NTT domain
std::cout << "\nInit NTT domain" << std::endl;
scalar_t basic_root = scalar_t::omega(log_ntt_size /*NTT_LOG_SIZscalar_t*/);
auto ntt_init_domain_cfg = default_ntt_init_domain_config();
ConfigExtension backend_cfg_ext;
backend_cfg_ext.set(
CudaBackendConfig::CUDA_NTT_FAST_TWIDDLES_MODE, true); // optionally construct fast_twiddles for CUDA backend
ntt_init_domain_cfg.ext = &backend_cfg_ext;
ICICLE_CHECK(bn254_ntt_init_domain(&basic_root, ntt_init_domain_cfg));
// ntt configuration
NTTConfig<scalar_t> config = default_ntt_config<scalar_t>();
ConfigExtension ntt_cfg_ext;
config.ext = &ntt_cfg_ext;
config.batch_size = batch_size;
// warmup
ICICLE_CHECK(bn254_ntt(input.get(), ntt_size, NTTDir::kForward, config, output.get()));
// NTT radix-2 alg
std::cout << "\nRunning NTT radix-2 alg with on-host data" << std::endl;
ntt_cfg_ext.set(CudaBackendConfig::CUDA_NTT_ALGORITHM, CudaBackendConfig::NttAlgorithm::Radix2);
START_TIMER(Radix2);
ICICLE_CHECK(bn254_ntt(input.get(), ntt_size, NTTDir::kForward, config, output.get()));
END_TIMER(Radix2, "Radix2 NTT");
std::cout << "Validating output" << std::endl;
validate_output(ntt_size, batch_size, output.get());
// NTT mixed-radix alg
std::cout << "\nRunning NTT mixed-radix alg with on-host data" << std::endl;
ntt_cfg_ext.set(CudaBackendConfig::CUDA_NTT_ALGORITHM, CudaBackendConfig::NttAlgorithm::MixedRadix);
START_TIMER(MixedRadix);
ICICLE_CHECK(bn254_ntt(input.get(), ntt_size, NTTDir::kForward, config, output.get()));
END_TIMER(MixedRadix, "MixedRadix NTT");
std::cout << "Validating output" << std::endl;
validate_output(ntt_size, batch_size, output.get());
return 0;
}
void initialize_input(const unsigned ntt_size, const unsigned nof_ntts, scalar_t* elements)
{
// Lowest Harmonics
for (unsigned i = 0; i < ntt_size; i = i + 1) {
elements[i] = scalar_t::one();
}
// Highest Harmonics
for (unsigned i = 1 * ntt_size; i < 2 * ntt_size; i = i + 2) {
elements[i] = scalar_t::one();
elements[i + 1] = scalar_t::neg(scalar_t::one());
}
}
int validate_output(const unsigned ntt_size, const unsigned nof_ntts, scalar_t* elements)
{
int nof_errors = 0;
scalar_t amplitude = scalar_t::from((uint32_t)ntt_size);
// Lowest Harmonics
if (elements[0] != amplitude) {
++nof_errors;
std::cout << "Error in lowest harmonicscalar_t 0! " << std::endl;
} else {
std::cout << "Validated lowest harmonics" << std::endl;
}
// Highest Harmonics
if (elements[1 * ntt_size + ntt_size / 2] != amplitude) {
++nof_errors;
std::cout << "Error in highest harmonics! " << std::endl;
} else {
std::cout << "Validated highest harmonics" << std::endl;
}
return nof_errors;
}

View File

@@ -1,113 +0,0 @@
#include <chrono>
#include <iostream>
// include NTT template
#include "curves/params/bn254.cuh"
#include "api/bn254.h"
using namespace bn254;
using namespace ntt;
// Operate on scalars
typedef scalar_t S;
typedef scalar_t E;
void print_elements(const unsigned n, E* elements)
{
for (unsigned i = 0; i < n; i++) {
std::cout << i << ": " << elements[i] << std::endl;
}
}
void initialize_input(const unsigned ntt_size, const unsigned nof_ntts, E* elements)
{
// Lowest Harmonics
for (unsigned i = 0; i < ntt_size; i = i + 1) {
elements[i] = E::one();
}
// print_elements(ntt_size, elements );
// Highest Harmonics
for (unsigned i = 1 * ntt_size; i < 2 * ntt_size; i = i + 2) {
elements[i] = E::one();
elements[i + 1] = E::neg(scalar_t::one());
}
// print_elements(ntt_size, &elements[1*ntt_size] );
}
int validate_output(const unsigned ntt_size, const unsigned nof_ntts, E* elements)
{
int nof_errors = 0;
E amplitude = E::from((uint32_t)ntt_size);
// std::cout << "Amplitude: " << amplitude << std::endl;
// Lowest Harmonics
if (elements[0] != amplitude) {
++nof_errors;
std::cout << "Error in lowest harmonics 0! " << std::endl;
// print_elements(ntt_size, elements );
} else {
std::cout << "Validated lowest harmonics" << std::endl;
}
// Highest Harmonics
if (elements[1 * ntt_size + ntt_size / 2] != amplitude) {
++nof_errors;
std::cout << "Error in highest harmonics! " << std::endl;
// print_elements(ntt_size, &elements[1*ntt_size] );
} else {
std::cout << "Validated highest harmonics" << std::endl;
}
return nof_errors;
}
using FpMilliseconds = std::chrono::duration<float, std::chrono::milliseconds::period>;
#define START_TIMER(timer) auto timer##_start = std::chrono::high_resolution_clock::now();
#define END_TIMER(timer, msg) printf("%s: %.0f ms\n", msg, FpMilliseconds(std::chrono::high_resolution_clock::now() - timer##_start).count());
int main(int argc, char* argv[])
{
std::cout << "Icicle Examples: Number Theoretical Transform (NTT)" << std::endl;
std::cout << "Example parameters" << std::endl;
const unsigned log_ntt_size = 20;
std::cout << "Log2(NTT size): " << log_ntt_size << std::endl;
const unsigned ntt_size = 1 << log_ntt_size;
std::cout << "NTT size: " << ntt_size << std::endl;
const unsigned nof_ntts = 2;
std::cout << "Number of NTTs: " << nof_ntts << std::endl;
const unsigned batch_size = nof_ntts * ntt_size;
std::cout << "Generating input data for lowest and highest harmonics" << std::endl;
E* input;
input = (E*)malloc(sizeof(E) * batch_size);
initialize_input(ntt_size, nof_ntts, input);
E* output;
output = (E*)malloc(sizeof(E) * batch_size);
std::cout << "Running NTT with on-host data" << std::endl;
// Create a device context
auto ctx = device_context::get_default_device_context();
S basic_root = S::omega(log_ntt_size /*NTT_LOG_SIZE*/);
bn254_initialize_domain(&basic_root, ctx, true);
// Create an NTTConfig instance
NTTConfig<S> config = default_ntt_config<S>();
config.ntt_algorithm = NttAlgorithm::MixedRadix;
config.batch_size = nof_ntts;
START_TIMER(MixedRadix);
cudaError_t err = bn254_ntt_cuda(input, ntt_size, NTTDir::kForward, config, output);
END_TIMER(MixedRadix, "MixedRadix NTT");
std::cout << "Validating output" << std::endl;
validate_output(ntt_size, nof_ntts, output);
config.ntt_algorithm = NttAlgorithm::Radix2;
START_TIMER(Radix2);
err = bn254_ntt_cuda(input, ntt_size, NTTDir::kForward, config, output);
END_TIMER(Radix2, "Radix2 NTT");
std::cout << "Validating output" << std::endl;
validate_output(ntt_size, nof_ntts, output);
std::cout << "Cleaning-up memory" << std::endl;
free(input);
free(output);
return 0;
}

View File

@@ -1,2 +1,65 @@
#!/bin/bash
./build/example/example
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DMSM=OFF -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DMSM=OFF -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -1,26 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
add_executable(
example
example.cu
)
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_curve_bn254 icicle_field_bn254 icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_curve_bn254.a)
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
target_link_libraries(example ${NVML_LIBRARY})
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -1,9 +1,5 @@
# ICICLE example: Pedersen Commitment
## Best-Practices
We recommend to run our examples in [ZK-containers](../../ZK-containers.md) to save your time and mental energy.
## Key-Takeaway
A Pedersen Commitment is a cryptographic primitive to commit to a value or a vector of values while keeping it hidden, yet enabling the committer to reveal the value later. It provides both hiding (the commitment does not reveal any information about the value) and binding properties (once a value is committed, it cannot be changed without detection).
@@ -14,10 +10,12 @@ An example of MSM is [here](../msm/README.md).
## Running the example
- `cd` to your example directory
- compile with `./compile.sh`
- run with `./run.sh`
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
## Concise Explanation
We recommend this simple [explanation](https://www.rareskills.io/post/pedersen-commitment).

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -1,59 +1,62 @@
#include <iostream>
#include <iomanip>
#include <chrono>
#include <cassert>
#include <nvml.h>
#include "api/bn254.h"
#include "msm/msm.cuh"
#include "icicle/runtime.h"
#include "icicle/api/bn254.h"
#include "icicle/curves/params/bn254.h"
using namespace bn254;
typedef point_field_t T;
#include "examples_utils.h"
// modular power
T modPow(T base, T exp) {
template <typename T>
T modPow(T base, T exp)
{
T r = T::one();
T b = base;
T e = exp;
while (e != T::zero()) {
// If exp is odd, multiply the base with result
if (T::is_odd(e)) {
r = r * b;
}
// Now exp must be even, divide it by 2
e =T::div2(e);
b = b * b;
// If exp is odd, multiply the base with result
if (T::is_odd(e)) { r = r * b; }
// Now exp must be even, divide it by 2
e = T::div2(e);
b = b * b;
}
return r;
}
// Check if y2 is a quadratic residue using Euler's Criterion
bool quadratic_residue(T y2) {
template <typename T>
bool quadratic_residue(T y2)
{
return modPow(y2, T::div2(T::zero() - T::one())) == T::one();
}
// modular square root adapted from:
// https://github.com/ShahjalalShohag/code-library/blob/main/Number%20Theory/Tonelli%20Shanks%20Algorithm.cpp
bool mySQRT(T a, T *result) {
template <typename T>
bool mySQRT(T a, T* result)
{
if (a == T::zero()) {
*result = T::zero();
return true;
}
if (modPow(a, T::div2(T::zero() - T::one())) != T::one() ) {
if (modPow(a, T::div2(T::zero() - T::one())) != T::one()) {
return false; // solution does not exist
}
// TODO: consider special cases
// if (p % 4 == 3) return power(a, (p + 1) / 4, p);
T s = T::zero() - T::one(); // p - 1,
T n = T::one() + T::one(); //2;
T r = T::zero();
// if (p % 4 == 3) return power(a, (p + 1) / 4, p);
T s = T::zero() - T::one(); // p - 1,
T n = T::one() + T::one(); // 2;
T r = T::zero();
T m;
while (T::is_even(s)) {
r = r + T::one();
s = T::div2(s); //s /= 2;
s = T::div2(s); // s /= 2;
}
// find a non-square mod p
while (modPow(n, T::div2((T::zero() - T::one())) ) != T::zero() - T::one()) {
while (modPow(n, T::div2((T::zero() - T::one()))) != T::zero() - T::one()) {
n = n + T::one();
}
T x = modPow(a, T::div2(s + T::one()));
@@ -61,92 +64,93 @@ bool mySQRT(T a, T *result) {
T g = modPow(n, s);
for (;; r = m) {
T t = b;
for (m = T::zero(); T::lt(m,r) /* m < r*/ && t != T::one(); m = m + T::one()) t = t * t;
if (m == T::zero() ) {
for (m = T::zero(); T::lt(m, r) /* m < r*/ && t != T::one(); m = m + T::one())
t = t * t;
if (m == T::zero()) {
*result = x;
return true;
}
T gs = modPow(g, modPow(T::one() + T::one(), r - m - T::one()) );
g = gs * gs ;
x = x * gs ;
b = b * g ;
T gs = modPow(g, modPow(T::one() + T::one(), r - m - T::one()));
g = gs * gs;
x = x * gs;
b = b * g;
}
}
void point_near_x(T x, affine_t *point) {
const T wb = T { weierstrass_b };
template <typename T>
void point_near_x(T x, affine_t* point)
{
const T wb = T{G1::weierstrass_b};
T y2;
while (y2 = x*x*x + wb, quadratic_residue(y2) == false)
{
while (y2 = x * x * x + wb, quadratic_residue(y2) == false) {
x = x + T::one();
};
T y;
bool found = mySQRT(y2, &y);
assert(y*y == y2);
assert(y * y == y2);
point->x = x;
point->y = y;
}
static int seed = 0;
static HOST_INLINE T rand_host_seed()
{
std::mt19937_64 generator(seed++);
std::uniform_int_distribution<unsigned> distribution;
T value;
for (unsigned i = 0; i < T::TLC-1 ; i++)
// TODO: use the full range of limbs: for (unsigned i = 0; i < T::TLC ; i++)
value.limbs_storage.limbs[i] = distribution(generator);
// while (lt(Field{get_modulus()}, value))
// value = value - Field{get_modulus()};
return value;
}
template <typename T>
static T rand_host_seed()
{
std::mt19937_64 generator(seed++);
std::uniform_int_distribution<unsigned> distribution;
using FpMilliseconds = std::chrono::duration<float, std::chrono::milliseconds::period>;
#define START_TIMER(timer) auto timer##_start = std::chrono::high_resolution_clock::now();
#define END_TIMER(timer, msg) printf("%s: %.0f ms\n", msg, FpMilliseconds(std::chrono::high_resolution_clock::now() - timer##_start).count());
T value;
for (unsigned i = 0; i < T::TLC - 1; i++)
// TODO: use the full range of limbs: for (unsigned i = 0; i < T::TLC ; i++)
value.limbs_storage.limbs[i] = distribution(generator);
// while (lt(Field{get_modulus()}, value))
// value = value - Field{get_modulus()};
return value;
}
int main(int argc, char** argv)
{
try_load_and_set_backend_device(argc, argv);
const unsigned N = pow(2, 10);
std::cout << "Commitment vector size: " << N << "+1 for salt (a.k.a blinding factor)" << std::endl;
T* xs = new T[N+1];
point_field_t* xs = new point_field_t[N + 1];
std::cout << "Generating random points transparently using publicly chosen seed" << std::endl;
std::cout << "Public seed prevents committer from knowing the discrete logs of points used in the commitment" << std::endl;
std::cout << "Public seed prevents committer from knowing the discrete logs of points used in the commitment"
<< std::endl;
seed = 1234;
std::cout << "Using seed: " << seed << std::endl;
std::cout << "Generating random field values" << std::endl;
START_TIMER(gen);
for (unsigned i = 0; i < N; i++) {
xs[i] = rand_host_seed();
xs[i] = rand_host_seed<point_field_t>();
}
END_TIMER(gen, "Time to generate field values");
std::cout << "xs[0]: " << xs[0] << std::endl;
std::cout << "xs[1]: " << xs[1] << std::endl;
std::cout << "xs[0]: " << xs[0] << std::endl;
std::cout << "xs[1]: " << xs[1] << std::endl;
// affine_t points[N];
affine_t* points = new affine_t[N+1];
affine_t* points = new affine_t[N + 1];
std::cout << "Generating point about random field values" << std::endl;
START_TIMER(points);
for (unsigned i = 0; i < N+1; i++) {
for (unsigned i = 0; i < N + 1; i++) {
point_near_x(xs[i], &points[i]);
}
END_TIMER(points, "Time to generate points");
std::cout << "Generating commitment vector" << std::endl;
projective_t result;
scalar_t* scalars = new scalar_t[N+1];
scalar_t* scalars = new scalar_t[N + 1];
scalar_t::rand_host_many(scalars, N);
std::cout << "Generating salt" << std::endl;
scalars[N] = scalar_t::rand_host();
std::cout << "Executing MSM" << std::endl;
auto config = msm::default_msm_config();
auto config = default_msm_config();
START_TIMER(msm);
bn254_msm_cuda(scalars, points, N+1, config, &result);
bn254_msm(scalars, points, N + 1, config, &result);
END_TIMER(msm, "Time to execute MSM");
std::cout << "Computed commitment: " << result << std::endl;

View File

@@ -1,2 +1,65 @@
#!/bin/bash
./build/example/example
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b ICICLE_BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b ICICLE_BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -0,0 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_curve_bn254 icicle_field_bn254 icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()

View File

@@ -0,0 +1,46 @@
# ICICLE examples: computations with polynomials
## Key-Takeaway
Polynomials are crucial for Zero-Knowledge Proofs (ZKPs): they enable efficient representation and verification of computational statements, facilitate privacy-preserving protocols, and support complex mathematical operations essential for constructing and verifying proofs without revealing underlying data. Polynomial API is documented [here](https://dev.ingonyama.com/icicle/polynomials/overview)
## Running the example
To run example, from project root directory:
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
To change the scalar field, modify `compile.h` to build the corresponding lib and `CMakeLists.txt` to link to that lib and set `FIELD_ID` correspondingly.
## What's in the examples
- `example_evaluate`: Make polynomial from coefficients and evalue it at random point.
- `example_clone`: Make a separate copy of a polynomial.
- `example_from_rou`: Reconstruct polynomial from values at the roots of unity. This operation is a cornerstone in the efficient implementation of zero-knowledge proofs, particularly in the areas of proof construction, verification, and polynomial arithmetic. By leveraging the algebraic structure and computational properties of roots of unity, ZKP protocols can achieve the scalability, efficiency, and privacy necessary for practical applications in blockchain, secure computation, and beyond.
- `example_addition`, `example_addition_inplace`: Different flavors of polynomial addition.
- `example_multiplication`: A product of two polynimials
- `example_multiplicationScalar`: A product of scalar and a polynomial.
- `example_monomials`: Add/subtract a monomial to a polynom. Monomial is a single term, which is the product of a constant coefficient and a variable raised to a non-negative integer power.
- `example_ReadCoeffsToHost`: Download coefficients of a polynomial to a host. `ICICLE` keeps all polynomials on GPU, for on-host operation one needs such an operation.
- `example_divisionSmall`, `example_divisionLarge`: Different flavors of division.
- `example_divideByVanishingPolynomial`: A vanishing polynomial over a set S is a polynomial that evaluates to zero for every element in S. For a simple case, consider the set S={a}, a single element. The polynomial f(x)=xa vanishes over S because f(a)=0. Mathematically, dividing a polynomial P(x) by a vanishing polynomial V(x) typically involves finding another polynomial Q(x) and possibly a remainder R(x) such that P(x)=Q(x)V(x)+R(x), where R(x) has a lower degree than V(x). In many cryptographic applications, the focus is on ensuring that P(x) is exactly divisible by V(x), meaning R(x)=0.
- `example_EvenOdd`: even (odd) methods keep even (odd) coefficients of the original polynomial. For $f(x) = 1+2x+3x^2+4x^3$, even polynomial is $1+3x$, odd polynomial is $2+4x$.
- `example_Slice`: extends even/odd methods and keeps coefficients for a given offset and stride. For $f(x) = 1+2x+3x^2+4x^3$, origin 0 stride 3 slice gives $1+4x$
- `example_DeviceMemoryView`: device-memory views of polynomials allow "pass" polynomials to other GPU functions. In this example the coefficients of a polynomial are committed to a Merkle tree bypassing the host.

View File

@@ -0,0 +1,448 @@
#include <iostream>
#include <cassert>
#include "icicle/api/bn254.h"
#include "icicle/polynomials/polynomials.h"
#include "examples_utils.h"
using namespace icicle;
using namespace bn254; // typedef scalar_t as bn254-scalar type
// define the polynomial type
typedef Polynomial<scalar_t> Polynomial_t;
// we'll use the following constants in the examples
const auto zero = scalar_t::zero();
const auto one = scalar_t::one();
const auto two = scalar_t::from(2);
const auto three = scalar_t::from(3);
const auto four = scalar_t::from(4);
const auto five = scalar_t::from(5);
const auto minus_one = zero - one;
static std::unique_ptr<scalar_t[]> generate_pows(scalar_t tau, uint32_t size)
{
auto vec = std::make_unique<scalar_t[]>(size);
vec[0] = scalar_t::one();
for (size_t i = 1; i < size; ++i) {
vec[i] = vec[i - 1] * tau;
}
return std::move(vec);
}
static std::unique_ptr<affine_t[]> generate_SRS(uint32_t size)
{
auto secret_scalar = scalar_t::rand_host();
auto gen = projective_t::generator();
auto pows_of_tau = generate_pows(secret_scalar, size);
auto SRS = std::make_unique<affine_t[]>(size);
for (size_t i = 0; i < size; ++i) {
SRS[i] = projective_t::to_affine(pows_of_tau[i] * gen);
}
return std::move(SRS);
}
void example_evaluate()
{
std::cout << std::endl << "Example: Polynomial evaluation on random value" << std::endl;
const scalar_t coeffs[3] = {one, two, three};
auto f = Polynomial_t::from_coefficients(coeffs, 3);
std::cout << "f = " << f << std::endl;
scalar_t x = scalar_t::rand_host();
std::cout << "x = " << x << std::endl;
auto fx = f(x);
std::cout << "f(x) = " << fx << std::endl;
}
void example_from_rou(const int size)
{
std::cout << std::endl << "Example: Reconstruct polynomial from values at roots of unity" << std::endl;
const int log_size = (int)ceil(log2(size));
const int nof_evals = 1 << log_size;
auto coeff = std::make_unique<scalar_t[]>(size);
for (int i = 0; i < size; i++)
coeff[i] = scalar_t::rand_host();
auto f = Polynomial_t::from_coefficients(coeff.get(), size);
// rou: root of unity
auto omega = scalar_t::omega(log_size);
scalar_t evals[nof_evals] = {scalar_t::zero()};
auto x = scalar_t::one();
for (int i = 0; i < nof_evals; ++i) {
evals[i] = f(x);
x = x * omega;
}
// reconstruct f from evaluations
auto fr = Polynomial_t::from_rou_evaluations(evals, nof_evals);
// check for equality f-fr==0
auto h = f - fr;
std::cout << "degree of f - fr = " << h.degree() << std::endl;
}
static Polynomial_t randomize_polynomial(uint32_t size)
{
auto coeff = std::make_unique<scalar_t[]>(size);
for (int i = 0; i < size; i++)
coeff[i] = scalar_t::rand_host();
return Polynomial_t::from_coefficients(coeff.get(), size);
}
static Polynomial_t incremental_values(uint32_t size)
{
auto coeff = std::make_unique<scalar_t[]>(size);
for (int i = 0; i < size; i++) {
coeff[i] = i ? coeff[i - 1] + scalar_t::one() : scalar_t::one();
}
return Polynomial_t::from_coefficients(coeff.get(), size);
}
static bool is_equal(Polynomial_t& lhs, Polynomial_t& rhs)
{
const int deg_lhs = lhs.degree();
const int deg_rhs = rhs.degree();
if (deg_lhs != deg_rhs) { return false; }
auto lhs_coeffs = std::make_unique<scalar_t[]>(deg_lhs);
auto rhs_coeffs = std::make_unique<scalar_t[]>(deg_rhs);
lhs.copy_coeffs(lhs_coeffs.get(), 1, deg_lhs - 1);
rhs.copy_coeffs(rhs_coeffs.get(), 1, deg_rhs - 1);
return memcmp(lhs_coeffs.get(), rhs_coeffs.get(), deg_lhs * sizeof(scalar_t)) == 0;
}
void example_addition(const int size0, const int size1)
{
std::cout << std::endl << "Example: Polynomial addition" << std::endl;
auto f = randomize_polynomial(size0);
auto g = randomize_polynomial(size1);
auto x = scalar_t::rand_host();
auto f_x = f(x);
auto g_x = g(x);
auto fx_plus_gx = f_x + g_x;
auto h = f + g;
auto h_x = h(x);
std::cout << "evaluate and add: " << fx_plus_gx << std::endl;
std::cout << "add and evaluate: " << h_x << std::endl;
}
void example_addition_inplace(const int size0, const int size1)
{
std::cout << std::endl << "Example: Polynomial inplace addition" << std::endl;
auto f = randomize_polynomial(size0);
auto g = randomize_polynomial(size1);
auto x = scalar_t::rand_host();
auto f_x = f(x);
auto g_x = g(x);
auto fx_plus_gx = f_x + g_x;
f += g;
auto s_x = f(x);
std::cout << "evaluate and add: " << fx_plus_gx << std::endl;
std::cout << "add and evaluate: " << s_x << std::endl;
}
void example_multiplication(const int log0, const int log1)
{
std::cout << std::endl << "Example: Polynomial multiplication" << std::endl;
const int size0 = 1 << log0, size1 = 1 << log1;
auto f = randomize_polynomial(size0);
auto g = randomize_polynomial(size1);
scalar_t x = scalar_t::rand_host();
auto fx = f(x);
auto gx = g(x);
auto fx_mul_gx = fx * gx;
auto m = f * g;
auto mx = m(x);
std::cout << "evaluate and multiply: " << fx_mul_gx << std::endl;
std::cout << "multiply and evaluate: " << mx << std::endl;
}
void example_multiplication_scalar(const int log0)
{
std::cout << std::endl << "Example: Scalar by Polynomial multiplication" << std::endl;
const int size = 1 << log0;
auto f = randomize_polynomial(size);
auto s = scalar_t::from(2);
auto g = s * f;
auto x = scalar_t::rand_host();
auto fx = f(x);
auto fx2 = s * fx;
auto gx = g(x);
std::cout << "Compare (2*f)(x) and 2*f(x): " << std::endl;
std::cout << gx << std::endl;
std::cout << fx2 << std::endl;
}
void example_monomials()
{
std::cout << std::endl << "Example: Monomials" << std::endl;
const scalar_t coeffs[3] = {one, zero, two}; // 1+2x^2
auto f = Polynomial_t::from_coefficients(coeffs, 3);
const auto x = three;
auto fx = f(x);
f.add_monomial_inplace(three, 1); // add 3x
const auto expected_addmonmon_f_x = fx + three * x;
const auto addmonom_f_x = f(x);
std::cout << "Computed f'(x) = " << addmonom_f_x << std::endl;
std::cout << "Expected f'(x) = " << expected_addmonmon_f_x << std::endl;
}
void example_read_coeffs_to_host()
{
std::cout << std::endl << "Example: Read coefficients to host" << std::endl;
const scalar_t coeffs_f[3] = {zero, one, two}; // 0+1x+2x^2
auto f = Polynomial_t::from_coefficients(coeffs_f, 3);
const scalar_t coeffs_g[3] = {one, one, one}; // 1+x+x^2
auto g = Polynomial_t::from_coefficients(coeffs_g, 3);
auto h = f + g; // 1+2x+3x^3
std::cout << "Get one coefficient of h() at a time: " << std::endl;
const auto h0 = h.get_coeff(0);
const auto h1 = h.get_coeff(1);
const auto h2 = h.get_coeff(2);
std::cout << "Coefficients of h: " << std::endl;
std::cout << "0:" << h0 << " expected: " << one << std::endl;
std::cout << "1:" << h1 << " expected: " << two << std::endl;
std::cout << "2:" << h2 << " expected: " << three << std::endl;
std::cout << "Get all coefficients of h() at a time: " << std::endl;
scalar_t h_coeffs[3] = {0};
// fetch the coefficients for a given range
auto nof_coeffs = h.copy_coeffs(h_coeffs, 0, 2);
scalar_t expected_h_coeffs[nof_coeffs] = {one, two, three};
for (int i = 0; i < nof_coeffs; ++i) {
std::cout << i << ":" << h_coeffs[i] << " expected: " << expected_h_coeffs[i] << std::endl;
}
}
void example_division_small()
{
std::cout << std::endl << "Example: Polynomial division (small)" << std::endl;
const scalar_t coeffs_a[4] = {five, zero, four, three}; // 3x^3+4x^2+5
const scalar_t coeffs_b[3] = {minus_one, zero, one}; // x^2-1
auto a = Polynomial_t::from_coefficients(coeffs_a, 4);
auto b = Polynomial_t::from_coefficients(coeffs_b, 3);
auto [q, r] = a.divide(b);
scalar_t q_coeffs[2] = {0}; // 3x+4
scalar_t r_coeffs[2] = {0}; // 3x+9
const auto q_nof_coeffs = q.copy_coeffs(q_coeffs, 0, 1);
const auto r_nof_coeffs = r.copy_coeffs(r_coeffs, 0, 1);
std::cout << "Quotient: 0:" << q_coeffs[0] << " expected: " << scalar_t::from(4) << std::endl;
std::cout << "Quotient: 1:" << q_coeffs[1] << " expected: " << scalar_t::from(3) << std::endl;
std::cout << "Reminder: 0:" << r_coeffs[0] << " expected: " << scalar_t::from(9) << std::endl;
std::cout << "Reminder: 1:" << r_coeffs[1] << " expected: " << scalar_t::from(3) << std::endl;
}
void example_division_large(const int log0, const int log1)
{
std::cout << std::endl << "Example: Polynomial division (large)" << std::endl;
const int size0 = 1 << log0, size1 = 1 << log1;
auto a = randomize_polynomial(size0);
auto b = randomize_polynomial(size1);
auto [q, r] = a.divide(b);
scalar_t x = scalar_t::rand_host();
auto ax = a(x);
auto bx = b(x);
auto qx = q(x);
auto rx = r(x);
// check if a(x) == b(x)*q(x)+r(x)
std::cout << "a(x) == b(x)*q(x)+r(x)" << std::endl;
std::cout << "lhs = " << ax << std::endl;
std::cout << "rhs = " << bx * qx + rx << std::endl;
}
void example_divide_by_vanishing_polynomial()
{
std::cout << std::endl << "Example: Polynomial division by vanishing polynomial" << std::endl;
const scalar_t coeffs_v[5] = {minus_one, zero, zero, zero, one}; // x^4-1 vanishes on 4th roots of unity
auto v = Polynomial_t::from_coefficients(coeffs_v, 5);
auto h = incremental_values(1 << 11);
auto hv = h * v;
auto [h_div, R] = hv.divide(v);
std::cout << "h_div == h: " << is_equal(h_div, h) << std::endl;
auto h_div_by_vanishing = hv.divide_by_vanishing_polynomial(4);
std::cout << "h_div_by_vanishing == h: " << is_equal(h_div_by_vanishing, h) << std::endl;
}
void example_clone(const int log0)
{
std::cout << std::endl << "Example: clone polynomial" << std::endl;
const int size = 1 << log0;
auto f = randomize_polynomial(size);
const auto x = scalar_t::rand_host();
const auto fx = f(x);
Polynomial_t g;
g = f.clone();
g += f;
auto h = g.clone();
std::cout << "g(x) = " << g(x) << " expected: " << two * fx << std::endl;
std::cout << "h(x) = " << h(x) << " expected: " << g(x) << std::endl;
}
void example_even_odd()
{
std::cout << std::endl << "Example: Split into even and odd powers " << std::endl;
const scalar_t coeffs[4] = {one, two, three, four}; // 1+2x+3x^2+4x^3
auto f = Polynomial_t::from_coefficients(coeffs, 4);
auto f_even = f.even();
auto f_odd = f.odd();
scalar_t even_coeffs[2] = {0};
scalar_t odd_coeffs[2] = {0};
const auto even_nof_coeffs = f_even.copy_coeffs(even_coeffs, 0, 1);
const auto odd_nof_coeffs = f_odd.copy_coeffs(odd_coeffs, 0, 1);
std::cout << "Even: 0:" << even_coeffs[0] << " expected: " << one << std::endl;
std::cout << "Even: 1:" << even_coeffs[1] << " expected: " << three << std::endl;
std::cout << "Odd: 0:" << odd_coeffs[0] << " expected: " << two << std::endl;
std::cout << "Odd: 1:" << odd_coeffs[1] << " expected: " << four << std::endl;
}
void example_slice()
{
std::cout << std::endl << "Example: Slice polynomial " << std::endl;
const scalar_t coeffs[4] = {one, two, three, four}; // 1+2x+3x^2+4x^3
auto f = Polynomial_t::from_coefficients(coeffs, 4);
auto f_slice = f.slice(0 /*=offset*/, 3 /*= stride*/, 2 /*/= size*/); // 1+4x
scalar_t slice_coeffs[2] = {0};
const auto slice_nof_coeffs = f_slice.copy_coeffs(slice_coeffs, 0, 1);
std::cout << "Slice: 0:" << slice_coeffs[0] << " expected: " << one << std::endl;
std::cout << "Slice: 1:" << slice_coeffs[1] << " expected: " << four << std::endl;
}
void example_device_memory_view()
{
const int log_size = 6;
const int size = 1 << log_size;
auto f = randomize_polynomial(size);
auto [d_coeffs, N] = f.get_coefficients_view();
// compute coset evaluations
auto coset_evals = std::make_unique<scalar_t[]>(size);
auto ntt_config = default_ntt_config<scalar_t>();
ntt_config.are_inputs_on_device = true; // using the device data directly as a view
ntt_config.coset_gen = get_root_of_unity<scalar_t>(size * 2);
ntt(d_coeffs.get(), size, NTTDir::kForward, ntt_config, coset_evals.get());
}
void example_commit_with_device_memory_view()
{
// declare time vars
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
std::chrono::milliseconds duration;
std::cout << std::endl
<< "Example: a) commit with Polynomial views [(f1+f2)^2 + (f1-f2)^2 ]_1 = [4 (f1^2+ f_2^2)]_1" << std::endl;
std::cout << "Example: b) commit with Polynomial views [(f1+f2)^2 - (f1-f2)^2 ]_1 = [4 f1 *f_2]_1" << std::endl;
int N = 1025;
// generate group elements string of length N: (1, beta,beta^2....,beta^{N-1}). g
std::cout << "Setup: Generating mock SRS" << std::endl;
start = std::chrono::high_resolution_clock::now();
auto SRS = generate_SRS(2 * N);
// Allocate memory on device (points)
affine_t* points_d;
ICICLE_CHECK(icicle_malloc((void**)&points_d, sizeof(affine_t) * 2 * N));
// copy SRS to device (could have generated on device, but gives an indicator)
ICICLE_CHECK(icicle_copy(points_d, SRS.get(), sizeof(affine_t) * 2 * N));
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Setup: SRS of length " << N << " generated and loaded to device. Took: " << duration.count()
<< " milliseconds" << std::endl;
// goal:
// test commitment equality [(f1+f2)^2 + (f1-f2)^2 ]_1 = [4 (f1^2+ f_2^2)]_1
// test commitment equality [(f1+f2)^2 - (f1-f2)^2 ]_1 = [4 f1 *f_2]_1
// note: using polyapi to gen scalars: already on device.
std::cout << "Setup: Generating polys (on device) f1,f2 of log degree " << log2(N - 1) << std::endl;
start = std::chrono::high_resolution_clock::now();
auto f1 = randomize_polynomial(N);
auto f2 = randomize_polynomial(N);
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Setup: Gen poly done. Took: " << duration.count() << " milliseconds" << std::endl;
// deg 2N constraints (f1+f2)^2 + (f1-f2)^2 = 2 (f1^2+ f_2^2)
std::cout << "Computing constraints..start " << std::endl;
start = std::chrono::high_resolution_clock::now();
auto L1 = (f1 + f2) * (f1 + f2) + (f1 - f2) * (f1 - f2);
auto R1 = scalar_t::from(2) * (f1 * f1 + f2 * f2);
// deg 2N constraints (f1+f2)^2 - (f1-f2)^2 = 4 f1 *f_2
auto L2 = (f1 + f2) * (f1 + f2) - (f1 - f2) * (f1 - f2);
auto R2 = scalar_t::from(4) * f1 * f2;
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Computing constraints..done. Took: " << duration.count() << " milliseconds" << std::endl;
// extract coeff using coeff view
auto [viewL1, sizeL1] = L1.get_coefficients_view();
auto [viewL2, sizeL2] = L2.get_coefficients_view();
auto [viewR1, sizeR1] = R1.get_coefficients_view();
auto [viewR2, sizeR2] = R2.get_coefficients_view();
std::cout << "Computing Commitments with poly view" << std::endl;
start = std::chrono::high_resolution_clock::now();
MSMConfig config = default_msm_config();
config.are_points_on_device = true;
config.are_scalars_on_device = true;
// host vars (for result)
projective_t hL1{}, hL2{}, hR1{}, hR2{};
// straightforward msm bn254 api: no batching
msm(viewL1.get(), points_d, N, config, &hL1);
msm(viewL2.get(), points_d, N, config, &hL2);
msm(viewR1.get(), points_d, N, config, &hR1);
msm(viewR2.get(), points_d, N, config, &hR2);
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Commitments done. Took: " << duration.count() << " milliseconds" << std::endl;
// sanity checks
auto affL1 = projective_t::to_affine(hL1);
auto affR1 = projective_t::to_affine(hR1);
auto affL2 = projective_t::to_affine(hL2);
auto affR2 = projective_t::to_affine(hR2);
// test commitment equality [(f1+f2)^2 + (f1-f2)^2]_1 = [4 (f_1^2+f_2^2]_1
assert(affL1.x == affR1.x && affL1.y == affR1.y);
std::cout << "commitment [(f1+f2)^2 + (f1-f2)^2]_1:" << std::endl;
std::cout << "[x: " << affL1.x << ", y: " << affL1.y << "]" << std::endl;
std::cout << "commitment [[2 (f_1^2+f_2^2]_1:" << std::endl;
std::cout << "[x: " << affR1.x << ", y: " << affR1.y << "]" << std::endl;
assert(affL2.x == affR2.x && affL2.y == affR2.y);
std::cout << "commitment [(f1+f2)^2 - (f1-f2)^2]_1:" << std::endl;
std::cout << "[x: " << affL2.x << ", y: " << affL2.y << "]" << std::endl;
std::cout << "commitment [4 f_1*f_2]_1:" << std::endl;
std::cout << "[x: " << affR2.x << ", y: " << affR2.y << "]" << std::endl;
}
int main(int argc, char** argv)
{
try_load_and_set_backend_device(argc, argv);
static const int MAX_NTT_LOG_SIZE = 24;
const scalar_t basic_root = scalar_t::omega(MAX_NTT_LOG_SIZE);
ntt_init_domain(basic_root, default_ntt_init_domain_config());
START_TIMER(polyapi);
example_evaluate();
example_clone(10);
example_from_rou(100);
example_addition(12, 17);
example_addition_inplace(2, 2);
example_multiplication(15, 12);
example_multiplication_scalar(15);
example_monomials();
example_read_coeffs_to_host();
example_division_small();
example_division_large(12, 2);
example_divide_by_vanishing_polynomial();
example_even_odd();
example_slice();
example_device_memory_view();
example_commit_with_device_memory_view();
END_TIMER(polyapi, "polyapi example took");
return 0;
}

View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b ICICLE_BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b ICICLE_BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -1,27 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
# change the path to your Icicle location
include_directories("../../../icicle")
add_executable(
example
example.cu
)
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_curve_bn254 icicle_field_bn254 icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
target_link_libraries(example ${NVML_LIBRARY})
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -0,0 +1,30 @@
# Icicle Example: Polynomial Multiplication with NTT
## Key-Takeaway
Icicle provides polynomial multiplication using the Number Theoretical Transform (NTT), including forward and inverse transforms.
## Concise Usage Explanation
1. Include the necessary headers.
2. Initialize the NTT domain.
3. Prepare and transform the polynomials from host to device memory.
4. Perform pointwise multiplication.
5. Apply the inverse NTT.
## Running the example
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
## What's in the example
1. Define the size of the example.
2. Initialize input polynomials.
3. Perform Radix-2 or Mixed-Radix NTT.
4. Perform pointwise polynomial multiplication.
5. Apply the inverse NTT.

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -0,0 +1,102 @@
#include <iostream>
#include <vector>
#include <memory>
#include "icicle/runtime.h"
#include "icicle/api/bn254.h"
using namespace bn254;
#include "examples_utils.h"
void random_samples(scalar_t* res, uint32_t count)
{
for (int i = 0; i < count; i++)
res[i] = i < 1000 ? scalar_t::rand_host() : res[i - 1000];
}
void incremental_values(scalar_t* res, uint32_t count)
{
for (int i = 0; i < count; i++) {
res[i] = i ? res[i - 1] + scalar_t::one() * scalar_t::omega(4) : scalar_t::zero();
}
}
// calcaulting polynomial multiplication A*B via NTT,pointwise-multiplication and INTT
// (1) allocate A,B on HOST. Randomize first half, zero second half
// (2) allocate A,B,Res on device
// (3) calc NTT for A and for B from host to device
// (4) multiply d_polyRes = NttAGpu * NttBGpu (pointwise)
// (5) INTT d_polyRes inplace
int main(int argc, char** argv)
{
try_load_and_set_backend_device(argc, argv);
int NTT_LOG_SIZE = 20;
int NTT_SIZE = 1 << NTT_LOG_SIZE;
// init domain
scalar_t basic_root = scalar_t::omega(NTT_LOG_SIZE);
bn254_ntt_init_domain(&basic_root, default_ntt_init_domain_config());
// (1) cpu allocation
auto polyA = std::make_unique<scalar_t[]>(NTT_SIZE);
auto polyB = std::make_unique<scalar_t[]>(NTT_SIZE);
random_samples(polyA.get(), NTT_SIZE >> 1); // second half zeros
random_samples(polyB.get(), NTT_SIZE >> 1); // second half zeros
scalar_t *d_polyA, *d_polyB, *d_polyRes;
DeviceProperties device_props;
ICICLE_CHECK(icicle_get_device_properties(device_props));
auto benchmark = [&](bool print) {
// (2) device input allocation. If device does not share memory with host, copy inputs explicitly and
ICICLE_CHECK(icicle_malloc((void**)&d_polyA, sizeof(scalar_t) * NTT_SIZE));
ICICLE_CHECK(icicle_malloc((void**)&d_polyB, sizeof(scalar_t) * NTT_SIZE));
ICICLE_CHECK(icicle_malloc((void**)&d_polyRes, sizeof(scalar_t) * NTT_SIZE));
// start recording
START_TIMER(poly_multiply)
// (3) NTT for A,B from host memory to device-memory
auto ntt_config = default_ntt_config<scalar_t>();
ntt_config.are_inputs_on_device = false;
ntt_config.are_outputs_on_device = true;
ntt_config.ordering = Ordering::kNM;
ICICLE_CHECK(bn254_ntt(polyA.get(), NTT_SIZE, NTTDir::kForward, ntt_config, d_polyA));
ICICLE_CHECK(bn254_ntt(polyB.get(), NTT_SIZE, NTTDir::kForward, ntt_config, d_polyB));
// (4) multiply A,B
VecOpsConfig config{
nullptr,
true, // is_a_on_device
true, // is_b_on_device
true, // is_result_on_device
false, // is_async
nullptr // ext
};
ICICLE_CHECK(bn254_vector_mul(d_polyA, d_polyB, NTT_SIZE, config, d_polyRes));
// (5) INTT (in place)
ntt_config.are_inputs_on_device = true;
ntt_config.are_outputs_on_device = true;
ntt_config.ordering = Ordering::kMN;
ICICLE_CHECK(bn254_ntt(d_polyRes, NTT_SIZE, NTTDir::kInverse, ntt_config, d_polyRes));
if (print) { END_TIMER(poly_multiply, "polynomial multiplication took"); }
ICICLE_CHECK(icicle_free(d_polyA));
ICICLE_CHECK(icicle_free(d_polyB));
ICICLE_CHECK(icicle_free(d_polyRes));
return eIcicleError::SUCCESS;
};
benchmark(false); // warmup
benchmark(true);
ICICLE_CHECK(bn254_ntt_release_domain());
return 0;
}

View File

@@ -1,119 +0,0 @@
#include <chrono>
#include <iostream>
#include <vector>
#include <memory>
#include "api/bn254.h"
#include "gpu-utils/error_handler.cuh"
using namespace bn254;
typedef scalar_t test_scalar;
typedef scalar_t test_data;
void random_samples(test_data* res, uint32_t count)
{
for (int i = 0; i < count; i++)
res[i] = i < 1000 ? test_data::rand_host() : res[i - 1000];
}
void incremental_values(test_scalar* res, uint32_t count)
{
for (int i = 0; i < count; i++) {
res[i] = i ? res[i - 1] + test_scalar::one() * test_scalar::omega(4) : test_scalar::zero();
}
}
// calcaulting polynomial multiplication A*B via NTT,pointwise-multiplication and INTT
// (1) allocate A,B on CPU. Randomize first half, zero second half
// (2) allocate NttAGpu, NttBGpu on GPU
// (3) calc NTT for A and for B from cpu to GPU
// (4) multiply MulGpu = NttAGpu * NttBGpu (pointwise)
// (5) INTT MulGpu inplace
int main(int argc, char** argv)
{
cudaEvent_t start, stop;
float measured_time;
int NTT_LOG_SIZE = 23;
int NTT_SIZE = 1 << NTT_LOG_SIZE;
CHK_IF_RETURN(cudaFree(nullptr)); // init GPU context
// init domain
auto ntt_config = ntt::default_ntt_config<test_scalar>();
const bool is_radix2_alg = (argc > 1) ? atoi(argv[1]) : false;
ntt_config.ntt_algorithm = is_radix2_alg ? ntt::NttAlgorithm::Radix2 : ntt::NttAlgorithm::MixedRadix;
const char* ntt_alg_str = is_radix2_alg ? "Radix-2" : "Mixed-Radix";
std::cout << "Polynomial multiplication with " << ntt_alg_str << " NTT: ";
CHK_IF_RETURN(cudaEventCreate(&start));
CHK_IF_RETURN(cudaEventCreate(&stop));
test_scalar basic_root = test_scalar::omega(NTT_LOG_SIZE);
bn254_initialize_domain(&basic_root, ntt_config.ctx, true /*=fast_twidddles_mode*/);
// (1) cpu allocation
auto CpuA = std::make_unique<test_data[]>(NTT_SIZE);
auto CpuB = std::make_unique<test_data[]>(NTT_SIZE);
random_samples(CpuA.get(), NTT_SIZE >> 1); // second half zeros
random_samples(CpuB.get(), NTT_SIZE >> 1); // second half zeros
test_data *GpuA, *GpuB, *MulGpu;
auto benchmark = [&](bool print, int iterations = 1) {
// start recording
CHK_IF_RETURN(cudaEventRecord(start, ntt_config.ctx.stream));
for (int iter = 0; iter < iterations; ++iter) {
// (2) gpu input allocation
CHK_IF_RETURN(cudaMallocAsync(&GpuA, sizeof(test_data) * NTT_SIZE, ntt_config.ctx.stream));
CHK_IF_RETURN(cudaMallocAsync(&GpuB, sizeof(test_data) * NTT_SIZE, ntt_config.ctx.stream));
// (3) NTT for A,B from cpu to gpu
ntt_config.are_inputs_on_device = false;
ntt_config.are_outputs_on_device = true;
ntt_config.ordering = ntt::Ordering::kNM;
CHK_IF_RETURN(bn254_ntt_cuda(CpuA.get(), NTT_SIZE, ntt::NTTDir::kForward, ntt_config, GpuA));
CHK_IF_RETURN(bn254_ntt_cuda(CpuB.get(), NTT_SIZE, ntt::NTTDir::kForward, ntt_config, GpuB));
// (4) multiply A,B
CHK_IF_RETURN(cudaMallocAsync(&MulGpu, sizeof(test_data) * NTT_SIZE, ntt_config.ctx.stream));
vec_ops::VecOpsConfig config{
ntt_config.ctx,
true, // is_a_on_device
true, // is_b_on_device
true, // is_result_on_device
false // is_async
};
CHK_IF_RETURN(bn254_mul_cuda(GpuA, GpuB, NTT_SIZE, config, MulGpu));
// (5) INTT (in place)
ntt_config.are_inputs_on_device = true;
ntt_config.are_outputs_on_device = true;
ntt_config.ordering = ntt::Ordering::kMN;
CHK_IF_RETURN(bn254_ntt_cuda(MulGpu, NTT_SIZE, ntt::NTTDir::kInverse, ntt_config, MulGpu));
CHK_IF_RETURN(cudaFreeAsync(GpuA, ntt_config.ctx.stream));
CHK_IF_RETURN(cudaFreeAsync(GpuB, ntt_config.ctx.stream));
CHK_IF_RETURN(cudaFreeAsync(MulGpu, ntt_config.ctx.stream));
}
CHK_IF_RETURN(cudaEventRecord(stop, ntt_config.ctx.stream));
CHK_IF_RETURN(cudaStreamSynchronize(ntt_config.ctx.stream));
CHK_IF_RETURN(cudaEventElapsedTime(&measured_time, start, stop));
if (print) { std::cout << measured_time / iterations << " MS" << std::endl; }
return CHK_LAST();
};
benchmark(false); // warmup
benchmark(true, 20);
bn254_release_domain(ntt_config.ctx);
CHK_IF_RETURN(cudaStreamSynchronize(ntt_config.ctx.stream));
return 0;
}

View File

@@ -1,3 +1,65 @@
#!/bin/bash
./build/example/example 1 # radix2
./build/example/example 0 # mixed-radix
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b ICICLE_BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b ICICLE_BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DMSM=OFF -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DMSM=OFF -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -1,15 +1,17 @@
#!/bin/bash
# #!/bin/bash
# Exit immediately on error
set -e
# TODO update for V3
mkdir -p build/example
mkdir -p build/icicle
# # Exit immediately on error
# set -e
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# mkdir -p build/example
# mkdir -p build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example
# # Configure and build Icicle
# cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
# cmake --build build/icicle
# # Configure and build the example application
# cmake -S . -B build/example
# cmake --build build/example

View File

@@ -4,6 +4,8 @@
#include "api/bn254.h"
#include "curves/params/bn254.cuh"
#include "poseidon/poseidon.cuh"
#include "hash/hash.cuh"
using namespace poseidon;
using namespace bn254;
@@ -14,12 +16,12 @@ inline uint32_t tree_index(uint32_t level, uint32_t offset) { return (1 << level
// We assume the tree has leaves already set, compute all other levels
void build_tree(
const uint32_t tree_height, scalar_t* tree, PoseidonConstants<scalar_t> * constants, PoseidonConfig config)
const uint32_t tree_height, scalar_t* tree, Poseidon<scalar_t> &poseidon, HashConfig &config)
{
for (uint32_t level = tree_height - 1; level > 0; level--) {
const uint32_t next_level = level - 1;
const uint32_t next_level_width = 1 << next_level;
bn254_poseidon_hash_cuda(&tree[tree_index(level, 0)], &tree[tree_index(next_level, 0)], next_level_width, 2, *constants, config);
poseidon.hash_many(&tree[tree_index(level, 0)], &tree[tree_index(next_level, 0)], next_level_width, 2, 1, config);
}
}
@@ -37,11 +39,7 @@ uint32_t query_membership(scalar_t query, scalar_t* tree, const uint32_t tree_he
}
void generate_proof(
uint32_t position,
scalar_t* tree,
const uint32_t tree_height,
uint32_t* proof_lr,
scalar_t* proof_hash)
uint32_t position, scalar_t* tree, const uint32_t tree_height, uint32_t* proof_lr, scalar_t* proof_hash)
{
uint32_t level_index = position;
for (uint32_t level = tree_height - 1; level > 0; level--) {
@@ -68,8 +66,8 @@ uint32_t validate_proof(
const uint32_t tree_height,
const uint32_t* proof_lr,
const scalar_t* proof_hash,
PoseidonConstants<scalar_t> * constants,
PoseidonConfig config)
Poseidon<scalar_t> &poseidon,
HashConfig &config)
{
scalar_t hashes_in[2], hash_out[1], level_hash;
level_hash = hash;
@@ -82,7 +80,7 @@ uint32_t validate_proof(
hashes_in[1] = level_hash;
}
// next level hash
bn254_poseidon_hash_cuda(hashes_in, hash_out, 1, 2, *constants, config);
poseidon.hash_many(hashes_in, hash_out, 1, 2, 1, config);
level_hash = hash_out[0];
}
return proof_hash[0] == level_hash;
@@ -112,16 +110,15 @@ int main(int argc, char* argv[])
d = d + scalar_t::one();
}
std::cout << "Hashing blocks into tree leaves..." << std::endl;
PoseidonConstants<scalar_t> constants;
bn254_init_optimized_poseidon_constants_cuda(data_arity, ctx, &constants);
PoseidonConfig config = default_poseidon_config(data_arity+1);
bn254_poseidon_hash_cuda(data, &tree[tree_index(leaf_level, 0)], tree_width, 4, constants, config);
Poseidon<scalar_t> poseidon(data_arity, ctx);
HashConfig config = default_hash_config(ctx);
poseidon.hash_many(data, &tree[tree_index(leaf_level, 0)], tree_width, data_arity, 1, config);
std::cout << "3. Building Merkle tree" << std::endl;
PoseidonConstants<scalar_t> tree_constants;
bn254_init_optimized_poseidon_constants_cuda(tree_arity, ctx, &tree_constants);
PoseidonConfig tree_config = default_poseidon_config(tree_arity+1);
build_tree(tree_height, tree, &tree_constants, tree_config);
Poseidon<scalar_t> tree_poseidon(tree_arity, ctx);
HashConfig tree_config = default_hash_config(ctx);
build_tree(tree_height, tree, tree_poseidon, tree_config);
std::cout << "4. Generate membership proof" << std::endl;
uint32_t position = tree_width - 1;
@@ -136,12 +133,12 @@ int main(int argc, char* argv[])
std::cout << "5. Validate the hash membership" << std::endl;
uint32_t validated;
const scalar_t hash = tree[tree_index(leaf_level, query_position)];
validated = validate_proof(hash, tree_height, proof_lr, proof_hash, &tree_constants, tree_config);
validated = validate_proof(hash, tree_height, proof_lr, proof_hash, tree_poseidon, tree_config);
std::cout << "Validated: " << validated << std::endl;
std::cout << "6. Tamper the hash" << std::endl;
const scalar_t tampered_hash = hash + scalar_t::one();
validated = validate_proof(tampered_hash, tree_height, proof_lr, proof_hash, &tree_constants, tree_config);
validated = validate_proof(tampered_hash, tree_height, proof_lr, proof_hash, tree_poseidon, tree_config);
std::cout << "7. Invalidate tamper hash membership" << std::endl;
std::cout << "Validated: " << validated << std::endl;

View File

@@ -1,2 +1,4 @@
#!/bin/bash
./build/example/example
# #!/bin/bash
# TODO update for V3
# ./build/example/example

View File

@@ -0,0 +1,16 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
project(example)
add_executable(example example.cpp)
target_include_directories(example PRIVATE "../../../icicle_v3/include" "..")
target_link_directories(example PRIVATE "${CMAKE_SOURCE_DIR}/build/icicle")
message("${CMAKE_BINARY_DIR}/icicle")
target_link_libraries(example PRIVATE icicle_field_babybear icicle_device)
if(BACKEND_DIR)
add_compile_definitions(BACKEND_DIR="${BACKEND_DIR}")
endif()

View File

@@ -0,0 +1,45 @@
# ICICLE example: RISC0's Fibonacci sequence proof using Polynomial API
## Why RISC0?
[RISC0 Protocol](https://www.risczero.com/) creates computational integrity proofs (a.k.a. Zero Knowledge Proofs) for programs executing on RISC-V architecture.
The proofs are created for sequences of values in RISC-V registers, called execution traces.
This approach is transparent to developers and enables the use of general purpose languages.
## Best-Practices
This example builds on [ICICLE Polynomial API](../polynomial-api/README.md) so we recommend to run it first.
## Key-Takeaway
RISC0 encodes execution traces into very large polynomials and commits them using Merkle trees.
FRI speeds-up validation of such commitments by recursively generating smaller polynomials (and trees) from larger ones.
The key enabler for *recursion* is the *redundancy* of polynomial commitments, hence the use of Reed-Solomon codes.
## Running the example
To run example, from project root directory:
```sh
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
## What's in the example
The example follows [STARK by Hand](https://dev.risczero.com/proof-system/stark-by-hand), structured in the following Lessons:
1. The Execution Trace
2. Rule checks to validate a computation
3. Padding the Trace
4. Constructing Trace Polynomials
5. ZK Commitments of the Trace Data
6. Constraint Polynomials
7. Mixing Constraint Polynomials
8. The Core of the RISC Zero STARK
9. The DEEP Technique
10. Mixing (Batching) for FRI
11. FRI Protocol (Commit Phase)
12. FRI Protocol (Query Phase)

View File

@@ -0,0 +1,293 @@
#include <iostream>
#include <memory>
#include <vector>
#include <list>
#include "examples_utils.h"
#include "icicle/polynomials/polynomials.h"
#include "icicle/api/babybear.h"
using namespace babybear;
// define the polynomial type
typedef Polynomial<scalar_t> Polynomial_t;
// RISC-V register type
typedef int64_t rv_t;
// Convert RISC-V registers to Finite Fields
void to_ff(rv_t* rv, scalar_t* s, size_t n)
{
for (int i = 0; i < n; ++i) {
s[i] = scalar_t::from(rv[i]);
}
}
void p_print(Polynomial_t* p, int logn, scalar_t shift, std::string header = "Print Vector")
{
std::cout << header << std::endl;
auto n = 1 << logn;
auto omega = scalar_t::omega(logn);
auto x = shift;
for (int i = 0; i < n; ++i) {
std::cout << i << ": " << (*p)(x) << std::endl;
x = x * omega;
}
}
// value to polynomial
Polynomial_t p_value(scalar_t value)
{
auto p_value = Polynomial_t::from_coefficients(&value, 1);
return p_value;
}
Polynomial_t p_rotate(Polynomial_t* p, int logn)
{
// rotate polynomial coefficients right by one position
auto n = 1 << logn;
auto evaluations_rou_domain = std::make_unique<scalar_t[]>(n);
p->evaluate_on_rou_domain(logn, evaluations_rou_domain.get());
scalar_t tmp = evaluations_rou_domain[n - 1];
for (int i = n - 1; i > 0; --i) {
evaluations_rou_domain[i] = evaluations_rou_domain[i - 1];
}
evaluations_rou_domain[0] = tmp;
return Polynomial_t::from_rou_evaluations(evaluations_rou_domain.get(), n);
}
// mix polynomials (c.f. mix polynomial evaluations)
Polynomial_t p_mix(Polynomial_t* in[], size_t nmix, scalar_t mix_parameter)
{
scalar_t factor = mix_parameter;
Polynomial_t out = in[0]->clone();
for (int i = 1; i < nmix; ++i) {
out += factor * (*in[i]);
factor = factor * mix_parameter;
}
return out;
}
void solve_linear(scalar_t xa, scalar_t ya, scalar_t xb, scalar_t yb, scalar_t* coeffs)
{
coeffs[1] = (ya - yb) * scalar_t::inverse(xa - xb);
coeffs[0] = ya - coeffs[1] * xa;
}
std::unique_ptr<scalar_t[]> InterpolateOnLargerDomain(Polynomial_t* p, int n, scalar_t shift = scalar_t::one())
{
const int deg = p->degree();
auto input = std::make_unique<scalar_t[]>(n);
// TBD: check if scalar_t constructor initializes to zero
for (int i = 0; i < n; ++i) {
input[i] = scalar_t::zero();
}
p->copy_coeffs(input.get(), 0 /*start*/, deg);
auto ntt_config = default_ntt_config<scalar_t>();
ntt_config.coset_gen = shift;
auto evals_h = std::make_unique<scalar_t[]>(n);
ICICLE_CHECK(ntt(input.get(), n, NTTDir::kForward, ntt_config, evals_h.get()));
return evals_h;
}
int main(int argc, char** argv)
{
try_load_and_set_backend_device(argc, argv);
START_TIMER(risc0_example);
std::cout << "This is an ICICLE C++ implementation of the STARK by Hand Explainer." << std::endl;
std::cout << "https://dev.risczero.com/proof-system/stark-by-hand" << std::endl;
const int logn = 3;
const int n = 1 << logn;
std::cout << "Initializing NTT" << std::endl;
static const int MAX_NTT_LOG_SIZE = 24;
auto ntt_config = default_ntt_config<scalar_t>();
const scalar_t basic_root = scalar_t::omega(MAX_NTT_LOG_SIZE);
ntt_init_domain(basic_root, default_ntt_init_domain_config());
std::cout << std::endl << "Lesson 1: The Execution Trace" << std::endl;
// Trace: Data Columns
rv_t rv_d1_trace[] = {24, 30, 54, 84, 78, 15, 29, 50};
rv_t rv_d2_trace[] = {30, 54, 84, 138, 2, 77, 21, 36};
rv_t rv_d3_trace[] = {54, 84, 138, 222, 71, 17, 92, 33};
auto d1_trace = std::make_unique<scalar_t[]>(n);
auto d2_trace = std::make_unique<scalar_t[]>(n);
auto d3_trace = std::make_unique<scalar_t[]>(n);
to_ff(rv_d1_trace, d1_trace.get(), n);
to_ff(rv_d2_trace, d2_trace.get(), n);
to_ff(rv_d3_trace, d3_trace.get(), n);
// Trace: Control Columns
// Init steps are flagged in c1_trace
// Computation steps are flagged in c2_trace
// Termination step is flagged in c3_trace
// 0s at the end of each control column correspond to the padding of the trace
rv_t rv_c1_trace[] = {1, 0, 0, 0, 0, 0, 0, 0};
rv_t rv_c2_trace[] = {0, 1, 1, 1, 0, 0, 0, 0};
rv_t rv_c3_trace[] = {0, 0, 0, 1, 0, 0, 0, 0};
auto c1_trace = std::make_unique<scalar_t[]>(n);
auto c2_trace = std::make_unique<scalar_t[]>(n);
auto c3_trace = std::make_unique<scalar_t[]>(n);
to_ff(rv_c1_trace, c1_trace.get(), n);
to_ff(rv_c2_trace, c2_trace.get(), n);
to_ff(rv_c3_trace, c3_trace.get(), n);
std::cout << "Lesson 2: Rule checks to validate a computation" << std::endl;
std::cout << "We use rule-checking polynomials." << std::endl;
std::cout << "Lesson 3: Padding the Trace" << std::endl;
// The trace is padded to a power of 2 size to allow for efficient NTT operations.
// we already did this in the initialization of the trace data
// We will construct a zero-knowledge proof that:
// this trace represents a program that satisfies these 6 rules:
// 1) Fibonacci words here
// 2) d1_trace[0] == 24 (init 1 constraint)
// 3) d2_trace[0] == 30 (init 2 constraint)
// 4) d3_trace[3] == 28 (termination constraint)
// 5) if c2_trace[i] == 1, then d2_trace[i] == d1_trace[i+1]
// 6) if c2_trace[i] == 1, then d3_trace[i] == d2_trace[i+1}
std::cout << "Lesson 4: Constructing Trace Polynomials" << std::endl;
auto p_d1 = Polynomial_t::from_rou_evaluations(d1_trace.get(), n);
auto p_d2 = Polynomial_t::from_rou_evaluations(d2_trace.get(), n);
auto p_d3 = Polynomial_t::from_rou_evaluations(d3_trace.get(), n);
auto p_c1 = Polynomial_t::from_rou_evaluations(c1_trace.get(), n);
auto p_c2 = Polynomial_t::from_rou_evaluations(c2_trace.get(), n);
auto p_c3 = Polynomial_t::from_rou_evaluations(c3_trace.get(), n);
std::cout << "Lesson 5: ZK Commitments of the Trace Data" << std::endl;
std::cout << "To maintain a zk protocol, the trace polynomials are evaluated over a zk commitment domain"
<< std::endl;
std::cout << "zk commitment domain is a coset of Reed Solomon domain shifted by a basic root of unity" << std::endl;
scalar_t xzk = basic_root;
p_print(&p_d1, logn, xzk, "ZK commitment for d1 polynomial");
std::cout << "Build Merkle Tree for ZK commitments (outside the scope of this example)" << std::endl;
std::cout << "Lesson 6: Constraint Polynomials" << std::endl;
std::cout << "The constraints are used to check the correctness of the trace. In this example, we check 6 rules to "
"establish the validity of the trace."
<< std::endl;
auto p_fib_constraint = (p_d3 - p_d2 - p_d1) * (p_c1 + p_c2 + p_c3);
auto fib_constraint_zkcommitment = InterpolateOnLargerDomain(&p_fib_constraint, 4 * n, xzk);
auto p_init1_constraint = (p_d1 - p_value(scalar_t::from(24))) * p_c1;
// sanity checks printing
p_print(
&p_init1_constraint, logn + 2, scalar_t::one(), "Reed-Solomon constraint polynomial gives 0s in every 4th row");
p_print(&p_init1_constraint, logn + 2, xzk, "ZK Commitment constraint polynomial gives no 0s");
auto p_init2_constraint = (p_d2 - p_value(scalar_t::from(30))) * p_c1;
auto p_termination_constraint = (p_d3 - p_value(scalar_t::from(222))) * p_c3;
auto p_recursion_constraint1 = (p_d1 - p_rotate(&p_d2, logn)) * p_c2;
auto p_recursion_constraint2 = (p_d2 - p_rotate(&p_d3, logn)) * p_c2;
std::cout << std::endl << "Lesson 7: Mixing Constraint Polynomials" << std::endl;
Polynomial_t* p_all_constraints[] = {&p_fib_constraint, &p_init1_constraint, &p_init2_constraint,
&p_termination_constraint, &p_recursion_constraint1, &p_recursion_constraint2};
const size_t nmix = sizeof(p_all_constraints) / sizeof(p_all_constraints[0]);
auto p_mixed_constraints = p_mix(p_all_constraints, nmix, scalar_t::from(5));
std::cout << "All constraint polynomials are low-degree:" << std::endl;
for (int i = 0; i < nmix; ++i) {
std::cout << i << ": " << p_all_constraints[i]->degree() << std::endl;
}
std::cout << "Lesson 8: The Core of the RISC Zero STARK" << std::endl;
std::cout << "Degree of the mixed constraints polynomial: " << p_mixed_constraints.degree() << std::endl;
auto p_validity = p_mixed_constraints.divide_by_vanishing_polynomial(n);
std::cout << "Degree of the validity polynomial: " << p_validity.degree() << std::endl;
std::cout << "The Verifier should provide the Merke commitment for the above" << std::endl;
std::cout << "Lesson 9: The DEEP Technique" << std::endl;
std::cout
<< "The DEEP technique improves the security of a single query by sampling outside of the commitment domain."
<< std::endl;
// In the original STARK protocol, the Verifier tests validity polynomial at a number of test points;
// the soundness of the protocol depends on the number of tests.
// The DEEP-ALI technique allows us to achieve a high degree of soundness with a single test.
// The details of DEEP are described in the following lesson.
auto DEEP_point = scalar_t::from(93);
std::cout << "The prover convinces the verifier that V=C/Z at the DEEP_test_point, " << DEEP_point << std::endl;
const scalar_t coeffs1[2] = {scalar_t::zero() - DEEP_point, scalar_t::one()};
auto denom_DEEP1 = Polynomial_t::from_coefficients(coeffs1, 2);
auto [p_d1_DEEP, r] = (p_d1 - p_value(DEEP_point)).divide(denom_DEEP1);
std::cout << "The DEEP d1 degree is: " << p_d1_DEEP.degree() << std::endl;
// d2, d3 use recursion constraints and need the point corresponding to the previous state (clock cycle)
auto omega = scalar_t::omega(logn);
auto DEEP_prev_point = DEEP_point * scalar_t::inverse(omega);
auto coeffs2 = std::make_unique<scalar_t[]>(2);
coeffs2[0] = scalar_t::zero() - DEEP_prev_point;
coeffs2[1] = scalar_t::one();
auto denom_DEEP2 = Polynomial_t::from_coefficients(coeffs2.get(), 2);
auto coeffs_d2bar = std::make_unique<scalar_t[]>(2);
solve_linear(DEEP_point, p_d2(DEEP_point), DEEP_prev_point, p_d2(DEEP_prev_point), coeffs_d2bar.get());
auto d2bar = Polynomial_t::from_coefficients(coeffs_d2bar.get(), 2);
auto [p_d2_DEEP, r2] = (p_d2 - d2bar).divide(denom_DEEP1 * denom_DEEP2);
std::cout << "The DEEP d2 degree is: " << p_d2_DEEP.degree() << std::endl;
auto coeffs_d3bar = std::make_unique<scalar_t[]>(2);
solve_linear(DEEP_point, p_d3(DEEP_point), DEEP_prev_point, p_d3(DEEP_prev_point), coeffs_d3bar.get());
auto d3bar = Polynomial_t::from_coefficients(coeffs_d3bar.get(), 2);
auto [p_d3_DEEP, r3] = (p_d3 - d3bar).divide(denom_DEEP1 * denom_DEEP2);
std::cout << "The DEEP d3 degree is: " << p_d3_DEEP.degree() << std::endl;
// DEEP c{1,2,3} polynomials
const scalar_t coeffs_c1bar[1] = {p_c1(DEEP_point)};
auto c1bar = Polynomial_t::from_coefficients(coeffs_c1bar, 1);
auto [p_c1_DEEP, r_c1] = (p_c1 - c1bar).divide(denom_DEEP1);
std::cout << "The DEEP c1 degree is: " << p_c1_DEEP.degree() << std::endl;
const scalar_t coeffs_c2bar[1] = {p_c2(DEEP_point)};
auto c2bar = Polynomial_t::from_coefficients(coeffs_c2bar, 1);
auto [p_c2_DEEP, r_c2] = (p_c2 - c2bar).divide(denom_DEEP1);
std::cout << "The DEEP c2 degree is: " << p_c2_DEEP.degree() << std::endl;
const scalar_t coeffs_c3bar[1] = {p_c3(DEEP_point)};
auto c3bar = Polynomial_t::from_coefficients(coeffs_c3bar, 1);
auto [p_c3_DEEP, r_c3] = (p_c3 - c3bar).divide(denom_DEEP1);
std::cout << "The DEEP c3 degree is: " << p_c3_DEEP.degree() << std::endl;
// DEEP validity polynomial
const scalar_t coeffs_vbar[1] = {p_validity(DEEP_point)};
auto vbar = Polynomial_t::from_coefficients(coeffs_vbar, 1);
auto [v_DEEP, r_v] = (p_validity - vbar).divide(denom_DEEP1);
std::cout << "The DEEP validity polynomial degree is: " << v_DEEP.degree() << std::endl;
std::cout << "The Prover sends DEEP polynomials to the Verifier" << std::endl;
std::cout << "Lesson 10: Mixing (Batching) for FRI" << std::endl;
std::cout << "The initial FRI polynomial is the mix of the 7 DEEP polynomials." << std::endl;
Polynomial_t* all_DEEP[] = {&p_d1_DEEP, &p_d2_DEEP, &p_d3_DEEP, &p_c1_DEEP, &p_c2_DEEP, &p_c3_DEEP, &v_DEEP};
Polynomial_t fri_input = p_mix(all_DEEP, 7, scalar_t::from(99));
std::cout << "The degree of the mixed DEEP polynomial is: " << fri_input.degree() << std::endl;
std::cout << "Lesson 11: FRI Protocol (Commit Phase)" << std::endl;
std::cout << "The prover provides information to convince the verifier that the DEEP polynomials are low-degree."
<< std::endl;
int nof_rounds = 3;
Polynomial_t feven[nof_rounds], fodd[nof_rounds], fri[nof_rounds + 1];
scalar_t rfri[nof_rounds];
fri[0] = fri_input.clone();
for (int i = 0; i < nof_rounds; ++i) {
feven[i] = fri[i].even();
fodd[i] = fri[i].odd();
rfri[i] = scalar_t::rand_host();
fri[i + 1] = feven[i] + rfri[i] * fodd[i];
std::cout << "The degree of the Round " << i << " polynomial is: " << fri[i + 1].degree() << std::endl;
}
std::cout << "Lesson 12: FRI Protocol (Query Phase)" << std::endl;
// We use Polynomial API to evaluate the FRI polynomials
// In practice, verifier will use Merkle commitments
auto xp = scalar_t::rand_host();
auto xm = scalar_t::zero() - xp;
scalar_t lhs[nof_rounds], rhs[nof_rounds];
for (int i = 0; i < nof_rounds; ++i) {
rhs[i] = (rfri[i] + xp) * fri[i](xp) * scalar_t::inverse(scalar_t::from(2) * xp) +
(rfri[i] + xm) * fri[i](xm) * scalar_t::inverse(scalar_t::from(2) * xm);
lhs[i] = fri[i + 1](xp * xp);
std::cout << "Round " << i << std::endl << "rhs: " << rhs[i] << std::endl << "lhs: " << lhs[i] << std::endl;
}
END_TIMER(risc0_example, "risc0 example");
return 0;
}

65
examples/c++/risc0/run.sh Executable file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Exit immediately if a command exits with a non-zero status
set -e
# Function to display usage information
show_help() {
echo "Usage: $0 [-d DEVICE_TYPE] [-b ICICLE_BACKEND_INSTALL_DIR]"
echo
echo "Options:"
echo " -d DEVICE_TYPE Specify the device type (default: CPU)"
echo " -b ICICLE_BACKEND_INSTALL_DIR Specify the backend installation directory (default: empty)"
echo " -h Show this help message"
exit 0
}
# Parse command line options
while getopts ":d:b:h" opt; do
case ${opt} in
d )
DEVICE_TYPE=$OPTARG
;;
b )
ICICLE_BACKEND_INSTALL_DIR="$(realpath ${OPTARG})"
;;
h )
show_help
;;
\? )
echo "Invalid option: -$OPTARG" 1>&2
show_help
;;
: )
echo "Invalid option: -$OPTARG requires an argument" 1>&2
show_help
;;
esac
done
# Set default values if not provided
: "${DEVICE_TYPE:=CPU}"
: "${ICICLE_BACKEND_INSTALL_DIR:=}"
# Create necessary directories
mkdir -p build/example
mkdir -p build/icicle
ICILE_DIR=$(realpath "../../../icicle_v3/")
ICICLE_CUDA_SOURCE_DIR="${ICILE_DIR}/backend/cuda"
# Build Icicle and the example app that links to it
if [ "$DEVICE_TYPE" == "CUDA" ] && [ ! -d "${ICICLE_BACKEND_INSTALL_DIR}" ] && [ -d "${ICICLE_CUDA_SOURCE_DIR}" ]; then
echo "Building icicle with CUDA backend"
cmake -DCMAKE_BUILD_TYPE=Release -DFIELD=babybear -DCUDA_BACKEND=local -S "${ICILE_DIR}" -B build/icicle
export ICICLE_BACKEND_INSTALL_DIR=$(realpath "build/icicle/backend")
else
echo "Building icicle without CUDA backend, ICICLE_BACKEND_INSTALL_DIR=${ICICLE_BACKEND_INSTALL_DIR}"
cmake -DCMAKE_BUILD_TYPE=Release -DFIELD=babybear -S "${ICILE_DIR}" -B build/icicle
fi
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/example
cmake --build build/icicle -j
cmake --build build/example -j
./build/example/example "$DEVICE_TYPE"

View File

@@ -0,0 +1,34 @@
# ICICLE example: MultiScalar Multiplication (MSM) in Golang
`ICICLE` provides Golang bindings to CUDA-accelerated C++ implementation of [Multi-Scalar Multiplication](https://github.com/ingonyama-zk/ingopedia/blob/master/src/msm.md).
## Usage
```go
err := Msm(
/* Scalars input vector */ scalars,
/* Points input vector */ points,
/* MSMConfig reference */ &cfg,
/* Projective point result */ results)
```
In this example we use `BN254` and `BLS12377` curves. The function computes $result = \sum_{i=0}^{size-1} scalars[i] \cdot points[i]$, where input `points[]` uses affine coordinates, and `result` uses projective coordinates.
## What's in the example
1. Define the size of MSM.
2. Generate random inputs on-device
3. Configure MSM
4. Execute MSM on-device
5. Move the result on host
Running the example:
```sh
go run main.go
```
> [!NOTE]
> The default sizes are 2^17 - 2^22. You can change this by passing the `-l <size> -u <size>` options. To change the size range to 2^21 - 2^24, run the example like this:
> ```sh
> go run main.go -l=21 -u=24
> ```

209
examples/golang/msm/main.go Normal file
View File

@@ -0,0 +1,209 @@
package main
import (
"flag"
"fmt"
"time"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bls12377"
bls12377G2 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bls12377/g2"
bls12377Msm "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bls12377/msm"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
bn254G2 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
bn254Msm "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/msm"
)
func main() {
var logSizeMin int
var logSizeMax int
flag.IntVar(&logSizeMin, "l", 17, "Minimum log size")
flag.IntVar(&logSizeMax, "u", 22, "Maximum log size")
flag.Parse()
sizeMax := 1 << logSizeMax
print("Generating BN254 scalars ... ")
startTime := time.Now()
scalarsBn254Max := bn254.GenerateScalars(sizeMax)
println(time.Since(startTime).String())
print("Generating BN254 points ... ")
startTime = time.Now()
pointsBn254Max := bn254.GenerateAffinePoints(sizeMax)
println(time.Since(startTime).String())
print("Generating BN254 G2 points ... ")
startTime = time.Now()
pointsBn254G2Max := bn254G2.G2GenerateAffinePoints(sizeMax)
println(time.Since(startTime).String())
print("Generating BLS12_377 scalars ... ")
startTime = time.Now()
scalarsBls12377Max := bls12377.GenerateScalars(sizeMax)
println(time.Since(startTime).String())
print("Generating BLS12_377 points ... ")
startTime = time.Now()
pointsBls12377Max := bls12377.GenerateAffinePoints(sizeMax)
println(time.Since(startTime).String())
print("Generating BLS12_377 G2 points ... ")
startTime = time.Now()
pointsBls12377G2Max := bls12377G2.G2GenerateAffinePoints(sizeMax)
println(time.Since(startTime).String())
for logSize := logSizeMin; logSize <= logSizeMax; logSize++ {
// Define the size of the problem, here 2^18.
size := 1 << logSize
fmt.Printf("---------------------- MSM size 2^%d=%d ------------------------\n", logSize, size)
// println(scalarsBls12377, pointsBls12377, pointsBn254G2)
// println(scalarsBn254, pointsBn254, pointsBls12377G2)
print("Configuring bn254 MSM ... ")
startTime = time.Now()
scalarsBn254 := scalarsBn254Max[:size]
pointsBn254 := pointsBn254Max[:size]
pointsBn254G2 := pointsBn254G2Max[:size]
cfgBn254 := core.GetDefaultMSMConfig()
cfgBn254G2 := core.GetDefaultMSMConfig()
cfgBn254.IsAsync = true
cfgBn254G2.IsAsync = true
streamBn254, _ := cr.CreateStream()
streamBn254G2, _ := cr.CreateStream()
cfgBn254.Ctx.Stream = &streamBn254
cfgBn254G2.Ctx.Stream = &streamBn254G2
var projectiveBn254 bn254.Projective
var projectiveBn254G2 bn254G2.G2Projective
var msmResultBn254 core.DeviceSlice
var msmResultBn254G2 core.DeviceSlice
_, e := msmResultBn254.MallocAsync(projectiveBn254.Size(), projectiveBn254.Size(), streamBn254)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"Bn254 Malloc failed: ", e)
panic(errorString)
}
_, e = msmResultBn254G2.MallocAsync(projectiveBn254G2.Size(), projectiveBn254G2.Size(), streamBn254G2)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"Bn254 Malloc G2 failed: ", e)
panic(errorString)
}
println(time.Since(startTime).String())
print("Configuring Bls12377 MSM ... ")
startTime = time.Now()
scalarsBls12377 := scalarsBls12377Max[:size]
pointsBls12377 := pointsBls12377Max[:size]
pointsBls12377G2 := pointsBls12377G2Max[:size]
cfgBls12377 := core.GetDefaultMSMConfig()
cfgBls12377G2 := core.GetDefaultMSMConfig()
cfgBls12377.IsAsync = true
cfgBls12377G2.IsAsync = true
streamBls12377, _ := cr.CreateStream()
streamBls12377G2, _ := cr.CreateStream()
cfgBls12377.Ctx.Stream = &streamBls12377
cfgBls12377G2.Ctx.Stream = &streamBls12377G2
var projectiveBls12377 bls12377.Projective
var projectiveBls12377G2 bls12377G2.G2Projective
var msmResultBls12377 core.DeviceSlice
var msmResultBls12377G2 core.DeviceSlice
_, e = msmResultBls12377.MallocAsync(projectiveBls12377.Size(), projectiveBls12377.Size(), streamBls12377)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"Bls12_377 Malloc failed: ", e)
panic(errorString)
}
_, e = msmResultBls12377G2.MallocAsync(projectiveBls12377G2.Size(), projectiveBls12377G2.Size(), streamBls12377G2)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"Bls12_377 Malloc G2 failed: ", e)
panic(errorString)
}
println(time.Since(startTime).String())
print("Executing bn254 MSM on device ... ")
startTime = time.Now()
e = bn254Msm.Msm(scalarsBn254, pointsBn254, &cfgBn254, msmResultBn254)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"bn254 Msm failed: ", e)
panic(errorString)
}
e = bn254G2.G2Msm(scalarsBn254, pointsBn254G2, &cfgBn254G2, msmResultBn254G2)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"bn254 Msm G2 failed: ", e)
panic(errorString)
}
msmResultBn254Host := make(core.HostSlice[bn254.Projective], 1)
msmResultBn254G2Host := make(core.HostSlice[bn254G2.G2Projective], 1)
msmResultBn254Host.CopyFromDeviceAsync(&msmResultBn254, streamBn254)
msmResultBn254G2Host.CopyFromDeviceAsync(&msmResultBn254G2, streamBn254G2)
msmResultBn254.FreeAsync(streamBn254)
msmResultBn254G2.FreeAsync(streamBn254G2)
cr.SynchronizeStream(&streamBn254)
cr.SynchronizeStream(&streamBn254G2)
println(time.Since(startTime).String())
print("Executing Bls12377 MSM on device ... ")
startTime = time.Now()
e = bls12377Msm.Msm(scalarsBls12377, pointsBls12377, &cfgBls12377, msmResultBls12377)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"bls12_377 Msm failed: ", e)
panic(errorString)
}
e = bls12377G2.G2Msm(scalarsBls12377, pointsBls12377G2, &cfgBls12377G2, msmResultBls12377G2)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"bls12_377 Msm G2 failed: ", e)
panic(errorString)
}
msmResultBls12377Host := make(core.HostSlice[bls12377.Projective], 1)
msmResultBls12377G2Host := make(core.HostSlice[bls12377G2.G2Projective], 1)
msmResultBls12377Host.CopyFromDeviceAsync(&msmResultBls12377, streamBls12377)
msmResultBls12377G2Host.CopyFromDeviceAsync(&msmResultBls12377G2, streamBls12377G2)
msmResultBls12377.FreeAsync(streamBls12377)
msmResultBls12377G2.FreeAsync(streamBls12377G2)
cr.SynchronizeStream(&streamBls12377)
cr.SynchronizeStream(&streamBls12377G2)
println(time.Since(startTime).String())
}
}

View File

@@ -0,0 +1,39 @@
# ICICLE example: Number Theoretic Transform (NTT) in Golang
## Key-Takeaway
`ICICLE` provides Golang bindings to CUDA-accelerated C++ implementation of [Number Theoretic Transform](https://github.com/ingonyama-zk/ingopedia/blob/master/src/fft.md).
## Usage
```go
err := Ntt(
/* input slice */ scalars,
/* NTT Direction */ core.KForward,
/* NTT Configuration */ &cfg,
/* output slice */ result)
```
In this example we use the `BN254` and `BLS12377` fields.
## What's in this example
1. Define the size of NTT.
2. Generate random inputs
3. Set up the domain.
4. Configure NTT
5. Execute NTT on-device
6. Move the result on host
Running the example:
```sh
go run main.go
```
> [!NOTE]
> The default size is 2^20. You can change this by passing the `-s <size>` option. To change the size to 2^23, run the example like this:
```sh
go run main.go -s=23
```

131
examples/golang/ntt/main.go Normal file
View File

@@ -0,0 +1,131 @@
package main
import (
"flag"
"fmt"
"time"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bls12377"
bls12377Ntt "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bls12377/ntt"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
bn254Ntt "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/ntt"
bls12377Fft "github.com/consensys/gnark-crypto/ecc/bls12-377/fr/fft"
bn254Fft "github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
)
func main() {
var logSize int
flag.IntVar(&logSize, "s", 20, "Log size")
flag.Parse()
size := 1 << logSize
fmt.Printf("---------------------- NTT size 2^%d=%d ------------------------\n", logSize, size)
print("Generating BN254 scalars ... ")
startTime := time.Now()
scalarsBn254 := bn254.GenerateScalars(size)
println(time.Since(startTime).String())
cfgBn254 := bn254Ntt.GetDefaultNttConfig()
cfgBn254.IsAsync = true
print("Generating BLS12_377 scalars ... ")
startTime = time.Now()
scalarsBls12377 := bls12377.GenerateScalars(size)
println(time.Since(startTime).String())
cfgBls12377 := bls12377Ntt.GetDefaultNttConfig()
cfgBls12377.IsAsync = true
rouMontBn254, _ := bn254Fft.Generator(uint64(size))
rouBn254 := rouMontBn254.Bits()
rouIcicleBn254 := bn254.ScalarField{}
limbsBn254 := core.ConvertUint64ArrToUint32Arr(rouBn254[:])
rouIcicleBn254.FromLimbs(limbsBn254)
bn254Ntt.InitDomain(rouIcicleBn254, cfgBn254.Ctx, false)
rouMontBls12377, _ := bls12377Fft.Generator(uint64(size))
rouBls12377 := rouMontBls12377.Bits()
rouIcicleBls12377 := bls12377.ScalarField{}
limbsBls12377 := core.ConvertUint64ArrToUint32Arr(rouBls12377[:])
rouIcicleBls12377.FromLimbs(limbsBls12377)
bls12377Ntt.InitDomain(rouIcicleBls12377, cfgBls12377.Ctx, false)
print("Configuring bn254 NTT ... ")
startTime = time.Now()
streamBn254, _ := cr.CreateStream()
cfgBn254.Ctx.Stream = &streamBn254
var nttResultBn254 core.DeviceSlice
_, e := nttResultBn254.MallocAsync(size*scalarsBn254.SizeOfElement(), scalarsBn254.SizeOfElement(), streamBn254)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"Bn254 Malloc failed: ", e)
panic(errorString)
}
println(time.Since(startTime).String())
print("Configuring Bls12377 NTT ... ")
startTime = time.Now()
streamBls12377, _ := cr.CreateStream()
cfgBls12377.Ctx.Stream = &streamBls12377
var nttResultBls12377 core.DeviceSlice
_, e = nttResultBls12377.MallocAsync(size*scalarsBls12377.SizeOfElement(), scalarsBls12377.SizeOfElement(), streamBls12377)
if e != cr.CudaSuccess {
errorString := fmt.Sprint(
"Bls12_377 Malloc failed: ", e)
panic(errorString)
}
println(time.Since(startTime).String())
print("Executing bn254 NTT on device ... ")
startTime = time.Now()
err := bn254Ntt.Ntt(scalarsBn254, core.KForward, &cfgBn254, nttResultBn254)
if err.CudaErrorCode != cr.CudaSuccess {
errorString := fmt.Sprint(
"bn254 Ntt failed: ", e)
panic(errorString)
}
nttResultBn254Host := make(core.HostSlice[bn254.ScalarField], size)
nttResultBn254Host.CopyFromDeviceAsync(&nttResultBn254, streamBn254)
nttResultBn254.FreeAsync(streamBn254)
cr.SynchronizeStream(&streamBn254)
println(time.Since(startTime).String())
print("Executing Bls12377 NTT on device ... ")
startTime = time.Now()
err = bls12377Ntt.Ntt(scalarsBls12377, core.KForward, &cfgBls12377, nttResultBls12377)
if err.CudaErrorCode != cr.CudaSuccess {
errorString := fmt.Sprint(
"bls12_377 Ntt failed: ", e)
panic(errorString)
}
nttResultBls12377Host := make(core.HostSlice[bls12377.ScalarField], size)
nttResultBls12377Host.CopyFromDeviceAsync(&nttResultBls12377, streamBls12377)
nttResultBls12377.FreeAsync(streamBls12377)
cr.SynchronizeStream(&streamBls12377)
println(time.Since(startTime).String())
}

View File

@@ -0,0 +1,49 @@
# ICICLE example: Polynomials in Golang
`ICICLE` provides Golang bindings to CUDA-accelerated C++ implementation of [Polynomials](https://dev.ingonyama.com/icicle/polynomials/overview).
## Usage
### Backend Initialization
```go
InitPolyBackend()
```
### Construction
```go
poly1 := CreateFromCoeffecitients(/* Coefficients of polynomial */ coeffs)
poly2 := CreateFromROUEvaluations(/* evaluations */ evals)
poly3 := Clone(/* polynomial to clone */ poly1)
```
### Arithmetic
```go
polyAdd := poly1.Add(&poly2)
polySub := poly1.Subtract(&poly2)
polyMul := poly1.Multiply(&poly2)
polyMulScalar := MultiplyByScalar(scalar)
quotient, remainder := poly1.Divide(&poly2)
```
### Evaluation
```go
ev := poly1.Eval(scalar)
ev2 := poly1.EvalOnDomain(scalars)
```
In this example we use `BN254` and `Babybear` fields. The examples shows arithmetic operations and evaluations execution.
## What's in the example
1. Define the size of polynomials.
2. Initialize backends.
3. Generate random polynomials.
4. Execute arithmetic operations.
5. Execute evaluations.
6. Execute slicing.
Running the example:
```sh
go run main.go
```

View File

@@ -0,0 +1,114 @@
package main
import (
"flag"
"fmt"
bn254Fft "github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
bn254Ntt "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/ntt"
bn254Polynomial "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/polynomial"
"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
babybear "github.com/ingonyama-zk/icicle/v2/wrappers/golang/fields/babybear"
babybearNtt "github.com/ingonyama-zk/icicle/v2/wrappers/golang/fields/babybear/ntt"
babybearPolynomial "github.com/ingonyama-zk/icicle/v2/wrappers/golang/fields/babybear/polynomial"
)
var maxNttLogSize uint
var polyLogSize uint
func initBn254Domain() core.IcicleError {
deviceCfg, _ := cr.GetDefaultDeviceContext()
rouMontBn254, _ := bn254Fft.Generator(uint64(1 << maxNttLogSize))
rouBn254 := rouMontBn254.Bits()
rouIcicleBn254 := bn254.ScalarField{}
limbsBn254 := core.ConvertUint64ArrToUint32Arr(rouBn254[:])
rouIcicleBn254.FromLimbs(limbsBn254)
return bn254Ntt.InitDomain(rouIcicleBn254, deviceCfg, false)
}
func initBabybearDomain() core.IcicleError {
deviceCfg, _ := cr.GetDefaultDeviceContext()
rouIcicle := babybear.ScalarField{}
rouIcicle.FromUint32(1461624142)
return babybearNtt.InitDomain(rouIcicle, deviceCfg, false)
}
func init() {
flag.UintVar(&maxNttLogSize, "maxNttLogSize", 20, "")
flag.UintVar(&polyLogSize, "polyLogSize", 15, "")
e := initBn254Domain()
if e.IcicleErrorCode != core.IcicleSuccess {
errorString := fmt.Sprint(
"Bn254 Domain initialization failed: ", e)
panic(errorString)
}
e = initBabybearDomain()
if e.IcicleErrorCode != core.IcicleSuccess {
errorString := fmt.Sprint(
"Babybear Domain initialization failed: ", e)
panic(errorString)
}
bn254Polynomial.InitPolyBackend()
babybearPolynomial.InitPolyBackend()
}
func main() {
polySize := 1 << polyLogSize
// randomize three polynomials over bn254 scalar field
var fBn254 bn254Polynomial.DensePolynomial
var gBn254 bn254Polynomial.DensePolynomial
var hBn254 bn254Polynomial.DensePolynomial
fBn254.CreateFromCoeffecitients(bn254.GenerateScalars(polySize))
gBn254.CreateFromCoeffecitients(bn254.GenerateScalars(polySize / 2))
hBn254.CreateFromROUEvaluations(bn254.GenerateScalars(polySize / 4))
// randomize two polynomials over babybear field
var fBabybear babybearPolynomial.DensePolynomial
var gBabybear babybearPolynomial.DensePolynomial
fBabybear.CreateFromCoeffecitients(babybear.GenerateScalars(polySize))
gBabybear.CreateFromCoeffecitients(babybear.GenerateScalars(polySize / 2))
// Arithmetic
t0 := fBn254.Add(&gBn254)
t1 := fBn254.Multiply(&hBn254)
q, r := t1.Divide(&t0)
rBabybear := fBabybear.Add(&gBabybear)
rDegree := r.Degree()
_ = rBabybear
_ = rDegree
// evaluate in single domain point
var five bn254.ScalarField
five.FromUint32(5)
qAtFive := q.Eval(five)
var thirty bn254.ScalarField
thirty.FromUint32(30)
// evaluate on domain. Note: domain and image can be either Host or Device slice.
// in this example domain in on host and evals on device.
hostDomain := core.HostSliceFromElements([]bn254.ScalarField{five, thirty})
var deviceImage core.DeviceSlice
_, err := deviceImage.Malloc(five.Size()*hostDomain.Len(), five.Size())
if err != cr.CudaSuccess {
errorString := fmt.Sprint(
"deviceImage allocation failed: ", err)
panic(errorString)
}
t1.EvalOnDomain(hostDomain, deviceImage)
// slicing
o := hBn254.Odd()
e := hBn254.Even()
oddMult := o.MultiplyByScalar(qAtFive)
fold := e.Add(&oddMult) // e(x) + o(x)*scalar
coeff := fold.GetCoeff(2) // coeff of x^2
_ = coeff
}

View File

@@ -4,15 +4,11 @@ version = "1.2.0"
edition = "2018"
[dependencies]
icicle-cuda-runtime = { path = "../../../wrappers/rust/icicle-cuda-runtime" }
icicle-core = { path = "../../../wrappers/rust/icicle-core" }
icicle-bn254 = { path = "../../../wrappers/rust/icicle-curves/icicle-bn254", features = ["g2"] }
icicle-bls12-377 = { path = "../../../wrappers/rust/icicle-curves/icicle-bls12-377" }
ark-bn254 = { version = "0.4.0", optional = true }
ark-bls12-377 = { version = "0.4.0", optional = true }
ark-ec = { version = "0.4.0", optional = true }
icicle-runtime = { path = "../../../wrappers/rust_v3/icicle-runtime" }
icicle-core = { path = "../../../wrappers/rust_v3/icicle-core" }
icicle-bn254 = { path = "../../../wrappers/rust_v3/icicle-curves/icicle-bn254", features = ["g2"] }
icicle-bls12-377 = { path = "../../../wrappers/rust_v3/icicle-curves/icicle-bls12-377" }
clap = { version = "<=4.4.12", features = ["derive"] }
[features]
arkworks = ["ark-bn254", "ark-bls12-377", "ark-ec", "icicle-core/arkworks", "icicle-bn254/arkworks", "icicle-bls12-377/arkworks"]
profile = []
cuda = ["icicle-runtime/cuda_backend", "icicle-bn254/cuda_backend", "icicle-bls12-377/cuda_backend"]

View File

@@ -2,10 +2,6 @@
`ICICLE` provides Rust bindings to CUDA-accelerated C++ implementation of [Multi-Scalar Multiplication](https://github.com/ingonyama-zk/ingopedia/blob/master/src/msm.md).
## Best Practices
In order to save time and setting up prerequisites manually, we recommend running this example in our [ZKContainer](../../ZKContainer.md).
## Usage
```rust
@@ -28,29 +24,9 @@ In this example we use `BN254` curve. The function computes $result = \sum_{i=0}
Running the example:
```sh
cargo run --release
# for CPU
./run.sh -d CPU
# for CUDA
./run.sh -d CUDA -b /path/to/cuda/backend/install/dir
```
You can add the `--feature arkworks,profile` flag to measure times of both ICICLE and arkworks.
> [!NOTE]
> The default sizes are 2^19 - 2^23. You can change this by passing the `--lower_bound_log_size <size> --upper_bound_log_size <size>` options. To change the size range to 2^21 - 2^24, run the example like this:
> ```sh
> cargo run --release -- -l 21 -u 24
> ```
## Benchmarks
These benchmarks were run on a 16 core 24 thread i9-12900k CPU and an RTX 3090 Ti GPU
### Single BN254 MSM
| Library\Size | 2^19 | 2^20 | 2^21 | 2^22 | 2^23 |
|--------------|------|------|------|------|------|
| ICICLE | 10 ms | 11 ms | 21 ms | 39 ms | 77 ms |
| Arkworks | 284 ms | 540 ms | 1,152 ms | 2,320 ms | 4,491 ms |
### Single BLS12377 MSM
| Library\Size | 2^19 | 2^20 | 2^21 | 2^22 | 2^23 |
|--------------|------|------|------|------|------|
| ICICLE | 9 ms | 14 ms | 25 ms | 48 ms | 93 ms |
| Arkworks | 490 ms | 918 ms | 1,861 ms | 3,624 ms | 7,191 ms |

Some files were not shown because too many files have changed in this diff Show More