Compare commits

...

12 Commits

Author SHA1 Message Date
Leon Hibnik
080071870f Update wrappers/rust/icicle-core/src/vec_ops/mod.rs 2024-08-06 13:52:55 +03:00
Leon Hibnik
1fc640c8b3 Apply suggestions from code review 2024-08-06 13:51:00 +03:00
danny-shterman
b10ee73b5e Remove ## Best-Practices section 2024-08-04 06:19:10 +00:00
danny-shterman
117ee4c4a4 Fix formatting. 2024-07-31 14:24:25 +00:00
danny-shterman
993652ae8d Formatting for rust and golang. 2024-07-31 13:47:45 +00:00
danny-shterman
9e452c47d0 Add montgomery config in rust and golang files. 2024-07-31 09:35:53 +00:00
danny-shterman
132abbb2d4 Fixed style by clang-format 2024-07-30 05:39:32 +00:00
danny-shterman
118c82f829 Fixed style by clang-format 2024-07-30 05:38:30 +00:00
danny-shterman
7e2b42f756 Add montgomery to vec_ops and example of that 2024-07-29 06:39:49 +00:00
omahs
53f34aade5 Fix typos (#558)
Fix typos
2024-07-18 11:58:04 +03:00
release-bot
aacec3f72f Bump rust crates' version
icicle-babybear@2.8.0
icicle-bls12-377@2.8.0
icicle-bls12-381@2.8.0
icicle-bn254@2.8.0
icicle-bw6-761@2.8.0
icicle-core@2.8.0
icicle-cuda-runtime@2.8.0
icicle-grumpkin@2.8.0
icicle-hash@2.8.0
icicle-m31@2.8.0
icicle-stark252@2.8.0

Generated by cargo-workspaces
2024-07-16 13:57:56 +00:00
ChickenLover
a8fa05d0e3 Feat/roman/hash docs (#556)
## Describe the changes

This PR...

## Linked Issues

Resolves #

---------

Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
2024-07-16 16:39:35 +03:00
28 changed files with 1145 additions and 96 deletions

View File

@@ -1,6 +1,6 @@
# Contributor's Guide
We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build a ecosystem.
We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build an ecosystem.
## Contributing to ICICLE
@@ -14,9 +14,9 @@ We welcome all contributions with open arms. At Ingonyama we take a village appr
When opening a [pull request](https://github.com/ingonyama-zk/icicle/pulls) please keep the following in mind.
- `Clear Purpose` - The pull request should solve a single issue and be clean of any unrelated changes.
- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how its best that we test it. For bug fixes please describe the issue and the solution.
- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how it's best that we test it. For bug fixes please describe the issue and the solution.
- `Consistent style` - Rust and Golang code should be linted by the official linters (golang fmt and rust fmt) and maintain a proper style. For CUDA and C++ code we use [`clang-format`](https://github.com/ingonyama-zk/icicle/blob/main/.clang-format), [here](https://github.com/ingonyama-zk/icicle/blob/605c25f9d22135c54ac49683b710fe2ce06e2300/.github/workflows/main-format.yml#L46) you can see how we run it.
- `Minimal Tests` - please add test which cover basic usage of your changes .
- `Minimal Tests` - please add test which cover basic usage of your changes.
## Questions?

View File

@@ -47,7 +47,7 @@ type NTTConfig[T any] struct {
- **`areInputsOnDevice`**: Indicates if input scalars are located on the device.
- **`areOutputsOnDevice`**: Indicates if results are stored on the device.
- **`IsAsync`**: Controls whether the NTT operation runs asynchronously.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algoruithm.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algorithm.
### Default Configuration

View File

@@ -139,7 +139,7 @@ cfg.Ctx.IsBigTriangle = true
Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `MSM` function.
The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus its important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.
The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus it's important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.
```go
...
@@ -168,7 +168,7 @@ import (
)
```
This package include `G2Projective` and `G2Affine` points as well as a `G2Msm` method.
This package includes `G2Projective` and `G2Affine` points as well as a `G2Msm` method.
```go
package main

View File

@@ -171,7 +171,7 @@ Polynomial& add_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
Polynomial& sub_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
```
The ability to add or subtract monomials directly and in-place is an efficient way to manipualte polynomials.
The ability to add or subtract monomials directly and in-place is an efficient way to manipulate polynomials.
Example:

View File

@@ -14,9 +14,62 @@ At its core, Keccak consists of a permutation function operating on a state arra
## Using Keccak
ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree.
ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree or running multiple hashes in parallel.
### Supported Bindings
- [Golang](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/hash/keccak)
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
### Example usage
This is an example of running 1024 Keccak-256 hashes in parallel, where input strings are of size 136 bytes:
```rust
use icicle_core::hash::HashConfig;
use icicle_cuda_runtime::memory::HostSlice;
use icicle_hash::keccak::keccak256;
let config = HashConfig::default();
let input_block_len = 136;
let number_of_hashes = 1024;
let preimages = vec![1u8; number_of_hashes * input_block_len];
let mut digests = vec![0u8; number_of_hashes * 64];
let preimages_slice = HostSlice::from_slice(&preimages);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
keccak256(
preimages_slice,
input_block_len as u32,
number_of_hashes as u32,
digests_slice,
&config,
)
.unwrap();
```
### Merkle Tree
You can build a keccak merkle tree using the corresponding functions:
```rust
use icicle_core::tree::{merkle_tree_digests_len, TreeBuilderConfig};
use icicle_cuda_runtime::memory::HostSlice;
use icicle_hash::keccak::build_keccak256_merkle_tree;
let mut config = TreeBuilderConfig::default();
config.arity = 2;
let height = 22;
let input_block_len = 136;
let leaves = vec![1u8; (1 << height) * input_block_len];
let mut digests = vec![0u64; merkle_tree_digests_len((height + 1) as u32, 2, 1)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
build_keccak256_merkle_tree(leaves_slice, digests_slice, height, input_block_len, &config).unwrap();
```
In the example above, a binary tree of height 22 is being built. Each leaf is considered to be a 136 byte long array. The leaves and digests are aligned in a flat array. You can also use keccak512 in `build_keccak512_merkle_tree` function.

View File

@@ -1,6 +1,6 @@
# MSM - Multi scalar multiplication
MSM stands for Multi scalar multiplication, its defined as:
MSM stands for Multi scalar multiplication, it's defined as:
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>M</mi>
@@ -43,7 +43,7 @@ $a_0, \ldots, a_n$ - Scalars
$MSM(a, G) \in G$ - a single EC (elliptic curve) point
In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. Its obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.
In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. It's obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.
Accelerating MSM is crucial to a ZK protocol's performance due to the [large percent of run time](https://hackmd.io/@0xMonia/SkQ6-oRz3#Hardware-acceleration-in-action) they take when generating proofs.
@@ -131,7 +131,7 @@ Large buckets exist in two cases:
2. When `c` does not divide the scalar bit-size.
`large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around $log_2(msmSize) - 1$.
The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advance and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined with a `c` value around $log_2(msmSize) - 1$.
## Memory usage estimation

View File

@@ -56,7 +56,7 @@ Choosing an algorithm is heavily dependent on your use case. For example Cooley-
NTT also supports two different modes `Batch NTT` and `Single NTT`
Deciding weather to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
Deciding whether to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
#### Single NTT

View File

@@ -1,6 +1,6 @@
# Poseidon
[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because its optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.
[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because it's optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.
Poseidon has also been specifically designed to be efficient when implemented within ZK circuits, Poseidon uses far less constraints compared to other hash functions like Keccak or SHA-256 in the context of ZK circuits.
@@ -42,7 +42,7 @@ To generate a secure hash output, the algorithm goes through a series of "full r
**Linear Transformation and Round Constants:** A linear transformation is performed and round constants are added. The linear transformation in partial rounds can be designed to be less computationally intensive (this is done by using a sparse matrix) than in full rounds, further optimizing the function's efficiency.
The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance is highly dependent on the use case.
The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance are highly dependent on the use case.
## Using Poseidon
@@ -53,13 +53,14 @@ So for Poseidon of arity 2 and input of size 1024 * 2, we would expect 1024 elem
### Supported Bindings
[`Go`](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/curves/bn254/poseidon/poseidon.go)
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon)
### Constants
Poseidon is extremely customizable and using different constants will produce different hashes, security levels and performance results.
We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants/generate_parameters.py).
@@ -91,8 +92,6 @@ primitive_element = 7 # bls12-381
# primitive_element = 15 # bw6-761
```
We only support `alpha = 5` so if you want to use another alpha for S-box please reach out on discord or open a github issue.
### Rust API
This is the most basic way to use the Poseidon API.
@@ -101,71 +100,58 @@ This is the most basic way to use the Poseidon API.
let test_size = 1 << 10;
let arity = 2u32;
let ctx = get_default_device_context();
let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();
let config = PoseidonConfig::default();
let poseidon = Poseidon::load(arity, &ctx).unwrap();
let config = HashConfig::default();
let inputs = vec![F::one(); test_size * arity as usize];
let outputs = vec![F::zero(); test_size];
let mut input_slice = HostOrDeviceSlice::on_host(inputs);
let mut output_slice = HostOrDeviceSlice::on_host(outputs);
poseidon_hash_many::<F>(
poseidon.hash_many::<F>(
&mut input_slice,
&mut output_slice,
test_size as u32,
arity as u32,
&constants,
1, // Output length
&config,
)
.unwrap();
```
The `PoseidonConfig::default()` can be modified, by default the inputs and outputs are set to be on `Host` for example.
The `HashConfig` can be modified, by default the inputs and outputs are set to be on `Host` for example.
```rust
impl<'a> Default for PoseidonConfig<'a> {
impl<'a> Default for HashConfig<'a> {
fn default() -> Self {
let ctx = get_default_device_context();
Self {
ctx,
are_inputs_on_device: false,
are_outputs_on_device: false,
input_is_a_state: false,
aligned: false,
loop_state: false,
is_async: false,
}
}
}
```
In the example above `load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. Its possible to [generate](#constants) your own constants and load them.
In the example above `Poseidon::load(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. It's possible to [generate](#constants) your own constants and load them.
```rust
let ctx = get_default_device_context();
let cargo_manifest_dir = env!("CARGO_MANIFEST_DIR");
let constants_file = PathBuf::from(cargo_manifest_dir)
.join("tests")
.join(format!("{}_constants.bin", field_prefix));
let mut constants_buf = vec![];
File::open(constants_file)
.unwrap()
.read_to_end(&mut constants_buf)
.unwrap();
let mut custom_constants = vec![];
for chunk in constants_buf.chunks(field_bytes) {
custom_constants.push(F::from_bytes_le(chunk));
}
let custom_constants = create_optimized_poseidon_constants::<F>(
arity as u32,
&ctx,
full_rounds_half,
partial_rounds,
&mut custom_constants,
)
.unwrap();
let custom_poseidon = Poseidon::new(
arity, // The arity of poseidon hash. The width will be equal to arity + 1
alpha, // The S-box power
full_rounds_half,
partial_rounds,
round_constants,
mds_matrix,
non_sparse_matrix,
sparse_matrices,
domain_tag,
ctx,
)
.unwrap();
```
## The Tree Builder
@@ -175,21 +161,34 @@ The tree builder allows you to build Merkle trees using Poseidon.
You can define both the tree's `height` and its `arity`. The tree `height` determines the number of layers in the tree, including the root and the leaf layer. The `arity` determines how many children each internal node can have.
```rust
let height = 20;
let arity = 2;
let leaves = vec![F::one(); 1 << (height - 1)];
let mut digests = vec![F::zero(); merkle_tree_digests_len(height, arity)];
let mut leaves_slice = HostOrDeviceSlice::on_host(leaves);
let ctx = get_default_device_context();
let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap()
use icicle_bn254::tree::Bn254TreeBuilder;
use icicle_bn254::poseidon::Poseidon;
let mut config = TreeBuilderConfig::default();
config.keep_rows = 1;
build_poseidon_merkle_tree::<F>(&mut leaves_slice, &mut digests, height, arity, &constants, &config).unwrap();
let arity = 2;
config.arity = arity as u32;
let input_block_len = arity;
let leaves = vec![F::one(); (1 << height) * arity];
let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
println!("Root: {:?}", digests[0..1][0]);
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
let ctx = device_context::DeviceContext::default();
let hash = Poseidon::load(2, &ctx).unwrap();
let mut config = TreeBuilderConfig::default();
config.keep_rows = 5;
Bn254TreeBuilder::build_merkle_tree(
leaves_slice,
digests_slice,
height,
input_block_len,
&hash,
&hash,
&config,
)
.unwrap();
```
Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig::default()`

View File

@@ -0,0 +1,88 @@
# Poseidon2
[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
## Using Poseidon2
ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
### Supported Bindings
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
### Constants
Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
We support pre-calculated constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon2/constants) and are labeled clearly per curve `<curve_name>_poseidon2.h`.
You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
### Rust API
This is the most basic way to use the Poseidon2 API.
```rust
let test_size = 1 << 10;
let width = 16;
let rate = 8;
let ctx = get_default_device_context();
let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
let config = HashConfig::default();
let inputs = vec![F::one(); test_size * rate as usize];
let outputs = vec![F::zero(); test_size];
let mut input_slice = HostOrDeviceSlice::on_host(inputs);
let mut output_slice = HostOrDeviceSlice::on_host(outputs);
poseidon.hash_many::<F>(
&mut input_slice,
&mut output_slice,
test_size as u32,
rate as u32,
8, // Output length
&config,
)
.unwrap();
```
In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
## The Tree Builder
Similar to Poseidon1, you can use Poseidon2 in a tree builder.
```rust
use icicle_bn254::tree::Bn254TreeBuilder;
use icicle_bn254::poseidon2::Poseidon2;
let mut config = TreeBuilderConfig::default();
let arity = 2;
config.arity = arity as u32;
let input_block_len = arity;
let leaves = vec![F::one(); (1 << height) * arity];
let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
let ctx = device_context::DeviceContext::default();
let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
let mut config = TreeBuilderConfig::default();
config.keep_rows = 5;
Bn254TreeBuilder::build_merkle_tree(
leaves_slice,
digests_slice,
height,
input_block_len,
&hash,
&hash,
&config,
)
.unwrap();
```

19
docs/package-lock.json generated
View File

@@ -3680,6 +3680,8 @@
"version": "8.12.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
"integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
"optional": true,
"peer": true,
"dependencies": {
"fast-deep-equal": "^3.1.1",
"json-schema-traverse": "^1.0.0",
@@ -3694,7 +3696,9 @@
"node_modules/ajv-formats/node_modules/json-schema-traverse": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
"optional": true,
"peer": true
},
"node_modules/ajv-keywords": {
"version": "3.5.2",
@@ -16340,14 +16344,13 @@
"version": "2.1.1",
"resolved": "https://registry.npmjs.org/ajv-formats/-/ajv-formats-2.1.1.tgz",
"integrity": "sha512-Wx0Kx52hxE7C18hkMEggYlEifqWZtYaRgouJor+WMdPnQyEK13vgEWyVNup7SoeeoLMsr4kf5h6dOW11I15MUA==",
"requires": {
"ajv": "^8.0.0"
},
"requires": {},
"dependencies": {
"ajv": {
"version": "8.12.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
"version": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
"integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
"optional": true,
"peer": true,
"requires": {
"fast-deep-equal": "^3.1.1",
"json-schema-traverse": "^1.0.0",
@@ -16358,7 +16361,9 @@
"json-schema-traverse": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
"optional": true,
"peer": true
}
}
},

View File

@@ -53,6 +53,11 @@ module.exports = {
label: "Poseidon Hash",
id: "icicle/primitives/poseidon",
},
{
type: "doc",
label: "Poseidon2 Hash",
id: "icicle/primitives/poseidon2",
},
],
},
{

View File

@@ -0,0 +1,24 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
add_executable(
example
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
target_link_libraries(example ${NVML_LIBRARY})
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -0,0 +1,38 @@
#Icicle example : Montgomery vector operations(mul, add, sub) for allpossible options:
is_a_on_device
is_b_on_device
is_result_on_device
is_in_montgomery_form
(is_async isn't checked)
## Key-Takeaway
`Icicle` accelerates multiplication operation `*` using [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
## Concise Usage Explanation
Define field to be used, e. g.:
```c++
#include "api/bn254.h"
```
```c++
using namespace bn254;
typedef scalar_t T;
```
## Running the example
- `cd` to your example directory
- compile with `./compile.sh`
- run with `./run.sh`
## What's in the example
1. Define the parameters for the example such as vector size
2. Generate random vectors on-host
3. Copy them on-device
4. Execute element-wise vector multiplication on-device
5. Copy results on-host

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Debug -DCURVE=bn254
cmake --build build/icicle -j
# Configure and build the example application
cmake -DCMAKE_BUILD_TYPE=Debug -S. -B build/example
cmake --build build/example

View File

@@ -0,0 +1,15 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -0,0 +1,734 @@
#include <iostream>
#include <iomanip>
#include <chrono>
#include <nvml.h>
#include "api/bn254.h"
#include "vec_ops/vec_ops.cuh"
#include <vec_ops/../../include/utils/mont.cuh>
using namespace vec_ops;
using namespace bn254;
typedef scalar_t T;
enum Op { MUL, ADD, SUB, LAST };
// bn254 p = 21888242871839275222246405745257275088548364400416034343698204186575808495617
int vector_op(
T* vec_a,
T* vec_b,
T* vec_result,
size_t n_elements,
device_context::DeviceContext ctx,
vec_ops::VecOpsConfig config,
Op op)
{
cudaError_t err;
switch (op) {
case MUL:
err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
break;
case ADD:
err = bn254_add_cuda(vec_a, vec_b, n_elements, config, vec_result);
break;
case SUB:
err = bn254_sub_cuda(vec_a, vec_b, n_elements, config, vec_result);
break;
}
// cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int vector_mul(
T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
{
cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int vector_add(
T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
{
cudaError_t err = bn254_add_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int vector_sub(
T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
{
cudaError_t err = bn254_sub_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int main(int argc, char** argv)
{
const unsigned vector_size = 1 << 0;
const unsigned repetitions = 1 << 0;
cudaError_t err;
nvmlInit();
nvmlDevice_t device;
nvmlDeviceGetHandleByIndex(0, &device); // for GPU 0
std::cout << "Icicle-Examples: vector mul / add / sub operations." << std::endl;
char name[NVML_DEVICE_NAME_BUFFER_SIZE];
if (nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE) == NVML_SUCCESS) {
std::cout << "GPU Model: " << name << std::endl;
} else {
std::cerr << "Failed to get GPU model name." << std::endl;
}
unsigned power_limit;
nvmlDeviceGetPowerManagementLimit(device, &power_limit);
std::cout << "Vector size: " << vector_size << std::endl;
std::cout << "Repetitions: " << repetitions << std::endl;
std::cout << "Power limit: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_limit << " W" << std::endl;
unsigned int baseline_power;
nvmlDeviceGetPowerUsage(device, &baseline_power);
std::cout << "Baseline power: " << std::fixed << std::setprecision(3) << 1.0e-3 * baseline_power << " W" << std::endl;
unsigned baseline_temperature;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &baseline_temperature) == NVML_SUCCESS) {
std::cout << "Baseline GPU Temperature: " << baseline_temperature << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// host data
std::cout << "Allocate memory for the input vectors (both normal and Montgomery presentation)" << std::endl;
T* host_in1_init = (T*)malloc(vector_size * sizeof(T));
T* host_in2_init = (T*)malloc(vector_size * sizeof(T));
std::cout << "Initializing vectors with normal presentation random data" << std::endl;
T::rand_host_many(host_in1_init, vector_size);
T::rand_host_many(host_in2_init, vector_size);
std::cout << "Allocate memory for the output vectors" << std::endl;
T* host_out = (T*)malloc(vector_size * sizeof(T)); // This memory will be used for the test output.
T* host_out_ref_mul = (T*)malloc(
vector_size *
sizeof(T)); // This memory will be used as a reference result for mul (will be compared to host_out content).
T* host_out_ref_add = (T*)malloc(
vector_size *
sizeof(T)); // This memory will be used as a reference result for add (will be compared to host_out content).
T* host_out_ref_sub = (T*)malloc(
vector_size *
sizeof(T)); // This memory will be used as a reference result for sub (will be compared to host_out content).
std::cout << "Initializing output vectors with random data" << std::endl;
T::rand_host_many(host_out, vector_size);
T::rand_host_many(host_out_ref_mul, vector_size);
T::rand_host_many(host_out_ref_add, vector_size);
T::rand_host_many(host_out_ref_sub, vector_size);
// device data
device_context::DeviceContext ctx = device_context::get_default_device_context();
T* device_in1;
T* device_in2;
T* device_out;
err = cudaMalloc((void**)&device_in1, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_in2, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_out, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
vec_ops::VecOpsConfig config = vec_ops::DefaultVecOpsConfig();
//****************************************
// Test warn-up and reference output config. Reference output to be used to check if test passed or not.
//****************************************
// copy from host to device
err = cudaMemcpy(device_in1, host_in1_init, vector_size * sizeof(T), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMemcpy(device_in2, host_in2_init, vector_size * sizeof(T), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
std::cout << "Starting warm-up run" << std::endl;
// Warm-up loop
for (int op = MUL; op != LAST; op++) {
for (int i = 0; i < repetitions; i++) {
// vector_mul(device_in1, device_in2, device_out, vector_size, ctx, config);
vector_op(device_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
}
switch (op) {
case MUL:
err = cudaMemcpy(host_out_ref_mul, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
break;
case ADD:
err = cudaMemcpy(host_out_ref_add, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
break;
case SUB:
err = cudaMemcpy(host_out_ref_sub, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
break;
}
}
// copy the result from device to host_out_ref_mul to keep it for later comparisons.
// err = cudaMemcpy(host_out_ref_mul, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host - " << cudaGetErrorString(err) << std::endl;
return 0;
}
//****************************************
// End of test warn-up and reference output config.
//****************************************
std::cout << "Starting benchmarking" << std::endl;
unsigned power_before;
nvmlDeviceGetPowerUsage(device, &power_before);
std::cout << "Power before: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_before << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1) << (float)100.0 * power_before / power_limit
<< " %" << std::endl;
unsigned temperature_before;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_before) == NVML_SUCCESS) {
std::cout << "GPU Temperature before: " << temperature_before << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
//*******************************************************
// Benchmark test:
// Loop for (mul, add, sub):
// Loop (is_a_on_device, is_b_on_device, is_result_on_device, is_in_montgomery_form):
//*******************************************************
T* host_in1 =
(T*)malloc(vector_size * sizeof(T)); // This buffer is used to load the data from host_in1_init for the benchmark.
T* host_in2 =
(T*)malloc(vector_size * sizeof(T)); // This buffer is used to load the data from host_in2_init for the benchmark.
// Test when the result is not in-place
for (int op = MUL; op != LAST; op++) {
// for (int config_idx = 0; config_idx < 0; config_idx++) {
for (int config_idx = 0; config_idx < 16; config_idx++) {
std::cout << "Start benchmark loop for config_idx " << config_idx << std::endl;
for (int i = 0; i < vector_size; i++) {
host_in1[i] = host_in1_init[i];
host_in2[i] = host_in2_init[i];
}
config.is_a_on_device = (config_idx >> 3) & 0x1;
config.is_b_on_device = (config_idx >> 2) & 0x1;
config.is_result_on_device = (config_idx >> 1) & 0x1;
config.is_in_montgomery_form = (config_idx >> 0) & 0x1;
// Copy from host to device (copy again in order to be used later in the loop and device_inX was already
// overwritten by warmup.
if (config.is_a_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1)); // Convert in-place.
} else { // Normal presentation.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1));
err = cudaMemcpy(host_in1, device_in1, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in1 to host_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
if (config.is_b_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2)); // Convert in-place.
} else {
// Normal presentation.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2));
err = cudaMemcpy(host_in2, device_in2, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in2 to host_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
CHK_IF_RETURN(cudaPeekAtLastError());
auto start_time = std::chrono::high_resolution_clock::now();
// Benchmark loop
for (int i = 0; i < repetitions; i++) {
switch (config_idx >> 1) { // {is_a_on_device, is_b_on_device, is_result_on_device}
case 0b000:
vector_op(host_in1, host_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b001:
vector_op(host_in1, host_in2, device_out, vector_size, ctx, config, (Op)op);
break;
case 0b010:
vector_op(host_in1, device_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b011:
vector_op(host_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
break;
case 0b100:
vector_op(device_in1, host_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b101:
vector_op(device_in1, host_in2, device_out, vector_size, ctx, config, (Op)op);
break;
case 0b110:
vector_op(device_in1, device_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b111:
vector_op(device_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
break;
}
CHK_IF_RETURN(cudaPeekAtLastError());
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
switch (op) {
case MUL:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation MUL for config_idx "
<< config_idx << " and result not in-place" << std::endl;
break;
case ADD:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation ADD for config_idx "
<< config_idx << " and result not in-place" << std::endl;
break;
case SUB:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation SUB for config_idx "
<< config_idx << " and result not in-place" << std::endl;
break;
}
if (config.is_result_on_device) { // Copy the data to host_out in order to compare it vs. host_out_ref_mul value.
if (config.is_in_montgomery_form) { // Convert to normal from montgomery if needed.
CHK_IF_RETURN(mont::from_montgomery(
device_out, vector_size, config.ctx.stream,
device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
}
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // Data is not on device but it is in host_out.
if (config.is_in_montgomery_form) { // host_out should be written to device, converted to mmontgomery and
// written back to host. Then compared vs. host_out_ref_mul.
err = cudaMemcpy(
device_out, host_out, vector_size * sizeof(T),
cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_out to device_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::from_montgomery(
device_out, vector_size, config.ctx.stream,
device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // host_out could be compared vs. host_out_ref_mul as is.
}
}
//****************************************
// End of benchmark test.
//****************************************
//***********************************************
// Test result check
// Check is performed by executing the operation in a normal presentation
// (located in in host_out_ref_mul) and comparing it with the
// benchmark test result.
//***********************************************
int test_failed = 0;
// std::cout << "===>>> host_out_ref_mul[" << i << "]: " << host_out_ref_mul[i] << std::endl;
// std::cout << "===>>> host_out[" << i << "] after test run: " << host_out[i] << std::endl;
switch (op) {
case MUL:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_mul[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! MUL: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
test_failed = 1;
}
}
break;
case ADD:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_add[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! ADD: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
test_failed = 1;
}
}
break;
case SUB:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_sub[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! SUB: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
test_failed = 1;
}
}
break;
}
if (test_failed) {
// std::cout << "===>>> ERROR!!! Test failed for vector index " << i << ", config is printed below:" <<
// std::endl;
std::cout << "===>>> result is not in-place: " << std::endl;
std::cout << "===>>> is_a_on_device: " << config.is_a_on_device << std::endl;
std::cout << "===>>> is_b_on_device: " << config.is_b_on_device << std::endl;
std::cout << "===>>> is_result_on_device: " << config.is_result_on_device << std::endl;
std::cout << "===>>> is_in_montgomery_form: " << config.is_in_montgomery_form << std::endl;
exit(2);
}
unsigned power_after;
nvmlDeviceGetPowerUsage(device, &power_after);
std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1)
<< (float)100.0 * power_after / power_limit << " %" << std::endl;
unsigned temperature_after;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// Report performance in GMPS: Giga Multiplications Per Second
double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
}
}
// Test when the result is in-place
for (int op = MUL; op != LAST; op++) {
for (int config_idx = 0; config_idx < 16; config_idx++) {
for (int i = 0; i < vector_size; i++) {
host_in1[i] = host_in1_init[i];
host_in2[i] = host_in2_init[i];
}
config.is_a_on_device = (config_idx >> 4) & 0x1;
config.is_b_on_device = (config_idx >> 3) & 0x1;
config.is_result_on_device = (config_idx >> 2) & 0x1;
config.is_in_montgomery_form = (config_idx >> 1) & 0x1;
if (config.is_a_on_device ^ config.is_result_on_device == 1) { continue; }
// Copy from host to device (copy again in order to be used later in the loop and device_inX was already
// overwritten by warmup.
if (config.is_a_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1)); // Convert in-place.
} else { // Normal presentation.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1));
err = cudaMemcpy(host_in1, device_in1, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in1 to host_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
if (config.is_b_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2)); // Convert in-place.
} else {
// Normal presentation.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2));
err = cudaMemcpy(host_in2, device_in2, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in2 to host_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
CHK_IF_RETURN(cudaPeekAtLastError());
auto start_time = std::chrono::high_resolution_clock::now();
// Benchmark loop
for (int i = 0; i < repetitions; i++) {
switch (config_idx >> 2) { // {is_a_on_device, is_b_on_device, is_result_on_device}
case 0b000:
vector_op(host_in1, host_in2, host_in1, vector_size, ctx, config, (Op)op);
break;
case 0b001:
break;
case 0b010:
vector_op(host_in1, device_in2, host_in1, vector_size, ctx, config, (Op)op);
break;
case 0b011:
break;
case 0b100:
break;
case 0b101:
vector_op(device_in1, host_in2, device_in1, vector_size, ctx, config, (Op)op);
break;
case 0b110:
break;
case 0b111:
vector_op(device_in1, device_in2, device_in1, vector_size, ctx, config, (Op)op);
break;
}
CHK_IF_RETURN(cudaPeekAtLastError());
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
switch (op) {
case MUL:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation MUL for config_idx "
<< config_idx << " and result in-place" << std::endl;
break;
case ADD:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation ADD for config_idx "
<< config_idx << " and result in-place" << std::endl;
break;
case SUB:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation SUB for config_idx "
<< config_idx << " and result in-place" << std::endl;
break;
}
if (config.is_result_on_device) { // Copy the data to host_out in order to compare it vs. host_out_ref_mul value.
if (config.is_in_montgomery_form) { // Convert to normal from montgomery if needed.
CHK_IF_RETURN(mont::from_montgomery(
device_in1, vector_size, config.ctx.stream,
device_in1)); // Convert to normal in order to check vs. host_out_ref_mul.
}
err = cudaMemcpy(
host_out, device_in1, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in1 to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // Data is not on device but it is in host_in1. It should be moved to host_out for test pass/fail check.
if (config.is_in_montgomery_form) { // host_out should be written to device, converted to mmontgomery and
// written back to host. Then compared vs. host_out_ref_mul.
err = cudaMemcpy(
device_out, host_in1, vector_size * sizeof(T),
cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::from_montgomery(
device_out, vector_size, config.ctx.stream,
device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // host_out could be compared vs. host_out_ref_mul as is.
err = cudaMemcpy(
device_out, host_in1, vector_size * sizeof(T),
cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
//****************************************
// End of benchmark test.
//****************************************
//***********************************************
// Test result check
// Check is performed by executing the operation in a normal presentation
// (located in in host_out_ref_mul) and comparing it with the
// benchmark test result.
//***********************************************
int test_failed = 0;
// std::cout << "===>>> host_out_ref_mul[" << i << "]: " << host_out_ref_mul[i] << std::endl;
// std::cout << "===>>> host_out[" << i << "] after test run: " << host_out[i] << std::endl;
switch (op) {
case MUL:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_mul[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! MUL: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
std::cout << "host_out_ref_mul[0] = " << host_out_ref_mul[0] << std::endl;
test_failed = 1;
}
}
break;
case ADD:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_add[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! ADD: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
std::cout << "host_out_ref_add[0] = " << host_out_ref_add[0] << std::endl;
test_failed = 1;
}
}
break;
case SUB:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_sub[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! SUB: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
std::cout << "host_out_ref_sub[0] = " << host_out_ref_sub[0] << std::endl;
test_failed = 1;
}
}
break;
}
if (test_failed) {
// std::cout << "===>>> ERROR!!! Test failed for vector index " << i << ", config is printed below:" <<
// std::endl;
std::cout << "===>>> result is in-place: " << std::endl;
std::cout << "===>>> is_a_on_device: " << config.is_a_on_device << std::endl;
std::cout << "===>>> is_b_on_device: " << config.is_b_on_device << std::endl;
std::cout << "===>>> is_result_on_device: " << config.is_result_on_device << std::endl;
std::cout << "===>>> is_in_montgomery_form: " << config.is_in_montgomery_form << std::endl;
std::cout << "host_out[0] = " << host_out[0] << std::endl;
exit(2);
}
unsigned power_after;
nvmlDeviceGetPowerUsage(device, &power_after);
std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1)
<< (float)100.0 * power_after / power_limit << " %" << std::endl;
unsigned temperature_after;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// Report performance in GMPS: Giga Multiplications Per Second
double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
}
}
// clean up and exit
free(host_in1_init);
free(host_in2_init);
free(host_in1);
free(host_in2);
free(host_out);
free(host_out_ref_mul);
cudaFree(device_in1);
cudaFree(device_in2);
cudaFree(device_out);
nvmlShutdown();
return 0;
}

View File

@@ -0,0 +1,2 @@
#! /bin/bash
./build/example/example

View File

@@ -11,6 +11,17 @@
using namespace hash;
namespace keccak {
// Input rate in bytes
const int KECCAK_256_RATE = 136;
const int KECCAK_512_RATE = 72;
// Digest size in u64
const int KECCAK_256_DIGEST = 4;
const int KECCAK_512_DIGEST = 8;
// Number of state elements in u64
const int KECCAK_STATE_SIZE = 25;
class Keccak : public Hasher<uint8_t, uint64_t>
{
public:
@@ -22,7 +33,7 @@ namespace keccak {
unsigned int output_len,
const device_context::DeviceContext& ctx) const override;
Keccak(unsigned int rate) : Hasher<uint8_t, uint64_t>(25, 25, rate, 0) {}
Keccak(unsigned int rate) : Hasher<uint8_t, uint64_t>(KECCAK_STATE_SIZE, KECCAK_STATE_SIZE, rate, 0) {}
};
} // namespace keccak

View File

@@ -27,6 +27,8 @@ namespace vec_ops {
* non-blocking and you'd need to synchronize it explicitly by running
* `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the
* function will block the current CPU thread. */
bool is_in_montgomery_form; /**< If true then vec_a, vec_b and result are in montgomery form.
* Default value: false. */
};
/**
@@ -42,6 +44,7 @@ namespace vec_ops {
false, // is_b_on_device
false, // is_result_on_device
false, // is_async
false, // is_in_montgomery_form
};
return config;
}

View File

@@ -11,13 +11,15 @@ namespace keccak {
extern "C" cudaError_t
keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)
{
return Keccak(136).hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, 4, config);
return Keccak(KECCAK_256_RATE)
.hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, KECCAK_256_DIGEST, config);
}
extern "C" cudaError_t
keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)
{
return Keccak(72).hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, 8, config);
return Keccak(KECCAK_512_RATE)
.hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, KECCAK_512_DIGEST, config);
}
extern "C" cudaError_t build_keccak256_merkle_tree_cuda(
@@ -27,7 +29,7 @@ namespace keccak {
unsigned int input_block_len,
const merkle_tree::TreeBuilderConfig& tree_config)
{
Keccak keccak(136);
Keccak keccak(KECCAK_256_RATE);
return merkle_tree::build_merkle_tree<uint8_t, uint64_t>(
leaves, digests, height, input_block_len, keccak, keccak, tree_config);
}
@@ -39,7 +41,7 @@ namespace keccak {
unsigned int input_block_len,
const merkle_tree::TreeBuilderConfig& tree_config)
{
Keccak keccak(72);
Keccak keccak(KECCAK_512_RATE);
return merkle_tree::build_merkle_tree<uint8_t, uint64_t>(
leaves, digests, height, input_block_len, keccak, keccak, tree_config);
}

View File

@@ -161,7 +161,7 @@ namespace keccak {
0x8000000000008002, 0x8000000000000080, 0x000000000000800a, 0x800000008000000a,
0x8000000080008081, 0x8000000000008080, 0x0000000080000001, 0x8000000080008008};
__device__ void keccakf(u64 s[25])
__device__ void keccakf(u64 s[KECCAK_STATE_SIZE])
{
u64 t0, t1, t2, t3, t4;
@@ -188,7 +188,7 @@ namespace keccak {
const uint8_t* b_input = input + sid * input_block_size;
uint64_t* b_output = output + sid * output_len;
uint64_t state[25] = {}; // Initialize with zeroes
uint64_t state[KECCAK_STATE_SIZE] = {}; // Initialize with zeroes
int input_len = input_block_size;
@@ -238,12 +238,12 @@ namespace keccak {
int number_of_gpu_blocks = (number_of_states - 1) / number_of_threads + 1;
switch (rate) {
case 136:
keccak_hash_blocks<136><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
case KECCAK_256_RATE:
keccak_hash_blocks<KECCAK_256_RATE><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
input, input_len, output_len, number_of_states, output);
break;
case 72:
keccak_hash_blocks<72><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
case KECCAK_512_RATE:
keccak_hash_blocks<KECCAK_512_RATE><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
input, input_len, output_len, number_of_states, output);
break;
default:

View File

@@ -95,25 +95,64 @@ namespace vec_ops {
E *d_result, *d_alloc_vec_a, *d_alloc_vec_b;
E* d_vec_a;
const E* d_vec_b;
int is_d_alloc_vec_a_allocated = 0;
if (!config.is_a_on_device) {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
d_vec_a = d_alloc_vec_a;
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
CHK_IF_RETURN(mont::from_montgomery(d_alloc_vec_a, n * sizeof(E), config.ctx.stream, d_alloc_vec_a));
is_d_alloc_vec_a_allocated = 1;
d_vec_a = d_alloc_vec_a;
} else {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
is_d_alloc_vec_a_allocated = 1;
d_vec_a = d_alloc_vec_a;
}
} else {
d_vec_a = vec_a;
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(
&d_alloc_vec_a, n * sizeof(E), config.ctx.stream)); // Allocate in order not to change the input.
CHK_IF_RETURN(mont::from_montgomery(vec_a, n * sizeof(E), config.ctx.stream, d_alloc_vec_a));
is_d_alloc_vec_a_allocated = 1;
d_vec_a = d_alloc_vec_a;
} else {
d_vec_a = vec_a;
}
}
int is_d_alloc_vec_b_allocated = 0;
if (!config.is_b_on_device) {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
d_vec_b = d_alloc_vec_b;
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
CHK_IF_RETURN(mont::from_montgomery(d_alloc_vec_b, n * sizeof(E), config.ctx.stream, d_alloc_vec_b));
is_d_alloc_vec_b_allocated = 1;
d_vec_b = d_alloc_vec_b;
} else {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
is_d_alloc_vec_b_allocated = 1;
d_vec_b = d_alloc_vec_b;
}
} else {
d_vec_b = vec_b;
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(
&d_alloc_vec_b, n * sizeof(E), config.ctx.stream)); // Allocate in order not to change the input.
CHK_IF_RETURN(mont::from_montgomery(vec_b, n * sizeof(E), config.ctx.stream, d_alloc_vec_b));
is_d_alloc_vec_b_allocated = 1;
d_vec_b = d_alloc_vec_b;
} else {
d_vec_b = vec_b;
}
}
int is_d_result_allocated = 0;
if (!config.is_result_on_device) {
if (!is_in_place) {
CHK_IF_RETURN(cudaMallocAsync(&d_result, n * sizeof(E), config.ctx.stream));
is_d_result_allocated = 1;
} else {
d_result = d_vec_a;
}
@@ -129,12 +168,21 @@ namespace vec_ops {
Kernel<<<num_blocks, num_threads, 0, config.ctx.stream>>>(d_vec_a, d_vec_b, n, d_result);
if (!config.is_result_on_device) {
CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
CHK_IF_RETURN(cudaFreeAsync(d_result, config.ctx.stream));
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(mont::to_montgomery(d_result, n * sizeof(E), config.ctx.stream, d_result)); // Convert in-place.
CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
} else {
CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
}
} else {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(mont::to_montgomery(d_result, n * sizeof(E), config.ctx.stream, d_result)); // Convert in-place.
}
}
if (!config.is_a_on_device && !is_in_place) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_a, config.ctx.stream)); }
if (!config.is_b_on_device) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_b, config.ctx.stream)); }
if (is_d_alloc_vec_a_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_a, config.ctx.stream)); }
if (is_d_alloc_vec_b_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_b, config.ctx.stream)); }
if (is_d_result_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_result, config.ctx.stream)); }
if (!config.is_async) return CHK_STICKY(cudaStreamSynchronize(config.ctx.stream));

View File

@@ -28,6 +28,8 @@ type VecOpsConfig struct {
* non-blocking and you'll need to synchronize it explicitly by calling
* `SynchronizeStream`. If set to false, the function will block the current CPU thread. */
IsAsync bool
/* If true then vec_a, vec_b and result are in montgomery form. Default value: false. */
IsInMontgomeryForm bool
}
/**
@@ -42,6 +44,7 @@ func DefaultVecOpsConfig() VecOpsConfig {
false, // isBOnDevice
false, // isResultOnDevice
false, // IsAsync
false, // IsInMontgomeryForm
}
return config

View File

@@ -15,6 +15,7 @@ func TestVecOpsDefaultConfig(t *testing.T) {
false, // isBOnDevice
false, // isResultOnDevice
false, // IsAsync
false, // IsInMontgomeryForm
}
actual := DefaultVecOpsConfig()

View File

@@ -18,7 +18,7 @@ exclude = [
]
[workspace.package]
version = "2.7.1"
version = "2.8.0"
edition = "2021"
authors = [ "Ingonyama" ]
homepage = "https://www.ingonyama.com"

View File

@@ -20,6 +20,8 @@ pub struct VecOpsConfig<'a> {
/// Whether to run the vector operations asynchronously. If set to `true`, the functions will be non-blocking and you'd need to synchronize
/// it explicitly by running `stream.synchronize()`. If set to false, the functions will block the current CPU thread.
pub is_async: bool,
/// If true then vec_a, vec_b and result are in montgomery form. Default value: false.
pub is_in_montgomery_form: bool,
}
impl<'a> Default for VecOpsConfig<'a> {
@@ -36,6 +38,7 @@ impl<'a> VecOpsConfig<'a> {
is_b_on_device: false,
is_result_on_device: false,
is_async: false,
is_in_montgomery_form: false,
}
}
}

View File

@@ -24,6 +24,6 @@ pub(crate) mod tests {
let ctx = device_context::DeviceContext::default();
let sponge = Poseidon::load(2, &ctx).unwrap();
check_build_field_merkle_tree::<_, _, Bls12_377TreeBuilder>(25, 2, &sponge, &sponge, ScalarField::zero());
check_build_field_merkle_tree::<_, _, Bls12_377TreeBuilder>(18, 2, &sponge, &sponge, ScalarField::zero());
}
}

View File

@@ -24,6 +24,6 @@ pub(crate) mod tests {
let ctx = device_context::DeviceContext::default();
let sponge = Poseidon::load(2, &ctx).unwrap();
check_build_field_merkle_tree::<_, _, Bls12_381TreeBuilder>(25, 2, &sponge, &sponge, ScalarField::zero());
check_build_field_merkle_tree::<_, _, Bls12_381TreeBuilder>(18, 2, &sponge, &sponge, ScalarField::zero());
}
}