Compare commits

..

1 Commits

Author SHA1 Message Date
LeonHibnik
1ae9c560b5 two curve examples 2024-07-17 09:19:37 +03:00
34 changed files with 325 additions and 1195 deletions

View File

@@ -1,6 +1,6 @@
# Contributor's Guide
We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build an ecosystem.
We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build a ecosystem.
## Contributing to ICICLE
@@ -14,9 +14,9 @@ We welcome all contributions with open arms. At Ingonyama we take a village appr
When opening a [pull request](https://github.com/ingonyama-zk/icicle/pulls) please keep the following in mind.
- `Clear Purpose` - The pull request should solve a single issue and be clean of any unrelated changes.
- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how it's best that we test it. For bug fixes please describe the issue and the solution.
- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how its best that we test it. For bug fixes please describe the issue and the solution.
- `Consistent style` - Rust and Golang code should be linted by the official linters (golang fmt and rust fmt) and maintain a proper style. For CUDA and C++ code we use [`clang-format`](https://github.com/ingonyama-zk/icicle/blob/main/.clang-format), [here](https://github.com/ingonyama-zk/icicle/blob/605c25f9d22135c54ac49683b710fe2ce06e2300/.github/workflows/main-format.yml#L46) you can see how we run it.
- `Minimal Tests` - please add test which cover basic usage of your changes.
- `Minimal Tests` - please add test which cover basic usage of your changes .
## Questions?

View File

@@ -47,7 +47,7 @@ type NTTConfig[T any] struct {
- **`areInputsOnDevice`**: Indicates if input scalars are located on the device.
- **`areOutputsOnDevice`**: Indicates if results are stored on the device.
- **`IsAsync`**: Controls whether the NTT operation runs asynchronously.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algorithm.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algoruithm.
### Default Configuration

View File

@@ -139,7 +139,7 @@ cfg.Ctx.IsBigTriangle = true
Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `MSM` function.
The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus it's important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.
The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus its important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.
```go
...
@@ -168,7 +168,7 @@ import (
)
```
This package includes `G2Projective` and `G2Affine` points as well as a `G2Msm` method.
This package include `G2Projective` and `G2Affine` points as well as a `G2Msm` method.
```go
package main

View File

@@ -171,7 +171,7 @@ Polynomial& add_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
Polynomial& sub_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
```
The ability to add or subtract monomials directly and in-place is an efficient way to manipulate polynomials.
The ability to add or subtract monomials directly and in-place is an efficient way to manipualte polynomials.
Example:

View File

@@ -14,62 +14,9 @@ At its core, Keccak consists of a permutation function operating on a state arra
## Using Keccak
ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree or running multiple hashes in parallel.
ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree.
### Supported Bindings
- [Golang](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/hash/keccak)
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
### Example usage
This is an example of running 1024 Keccak-256 hashes in parallel, where input strings are of size 136 bytes:
```rust
use icicle_core::hash::HashConfig;
use icicle_cuda_runtime::memory::HostSlice;
use icicle_hash::keccak::keccak256;
let config = HashConfig::default();
let input_block_len = 136;
let number_of_hashes = 1024;
let preimages = vec![1u8; number_of_hashes * input_block_len];
let mut digests = vec![0u8; number_of_hashes * 64];
let preimages_slice = HostSlice::from_slice(&preimages);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
keccak256(
preimages_slice,
input_block_len as u32,
number_of_hashes as u32,
digests_slice,
&config,
)
.unwrap();
```
### Merkle Tree
You can build a keccak merkle tree using the corresponding functions:
```rust
use icicle_core::tree::{merkle_tree_digests_len, TreeBuilderConfig};
use icicle_cuda_runtime::memory::HostSlice;
use icicle_hash::keccak::build_keccak256_merkle_tree;
let mut config = TreeBuilderConfig::default();
config.arity = 2;
let height = 22;
let input_block_len = 136;
let leaves = vec![1u8; (1 << height) * input_block_len];
let mut digests = vec![0u64; merkle_tree_digests_len((height + 1) as u32, 2, 1)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
build_keccak256_merkle_tree(leaves_slice, digests_slice, height, input_block_len, &config).unwrap();
```
In the example above, a binary tree of height 22 is being built. Each leaf is considered to be a 136 byte long array. The leaves and digests are aligned in a flat array. You can also use keccak512 in `build_keccak512_merkle_tree` function.
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)

View File

@@ -1,6 +1,6 @@
# MSM - Multi scalar multiplication
MSM stands for Multi scalar multiplication, it's defined as:
MSM stands for Multi scalar multiplication, its defined as:
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>M</mi>
@@ -43,7 +43,7 @@ $a_0, \ldots, a_n$ - Scalars
$MSM(a, G) \in G$ - a single EC (elliptic curve) point
In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. It's obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.
In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. Its obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.
Accelerating MSM is crucial to a ZK protocol's performance due to the [large percent of run time](https://hackmd.io/@0xMonia/SkQ6-oRz3#Hardware-acceleration-in-action) they take when generating proofs.
@@ -131,7 +131,7 @@ Large buckets exist in two cases:
2. When `c` does not divide the scalar bit-size.
`large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advance and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined with a `c` value around $log_2(msmSize) - 1$.
The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around $log_2(msmSize) - 1$.
## Memory usage estimation

View File

@@ -56,7 +56,7 @@ Choosing an algorithm is heavily dependent on your use case. For example Cooley-
NTT also supports two different modes `Batch NTT` and `Single NTT`
Deciding whether to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
Deciding weather to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
#### Single NTT

View File

@@ -1,6 +1,6 @@
# Poseidon
[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because it's optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.
[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because its optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.
Poseidon has also been specifically designed to be efficient when implemented within ZK circuits, Poseidon uses far less constraints compared to other hash functions like Keccak or SHA-256 in the context of ZK circuits.
@@ -42,7 +42,7 @@ To generate a secure hash output, the algorithm goes through a series of "full r
**Linear Transformation and Round Constants:** A linear transformation is performed and round constants are added. The linear transformation in partial rounds can be designed to be less computationally intensive (this is done by using a sparse matrix) than in full rounds, further optimizing the function's efficiency.
The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance are highly dependent on the use case.
The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance is highly dependent on the use case.
## Using Poseidon
@@ -53,14 +53,13 @@ So for Poseidon of arity 2 and input of size 1024 * 2, we would expect 1024 elem
### Supported Bindings
[`Go`](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/curves/bn254/poseidon/poseidon.go)
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon)
### Constants
Poseidon is extremely customizable and using different constants will produce different hashes, security levels and performance results.
We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants/generate_parameters.py).
@@ -92,6 +91,8 @@ primitive_element = 7 # bls12-381
# primitive_element = 15 # bw6-761
```
We only support `alpha = 5` so if you want to use another alpha for S-box please reach out on discord or open a github issue.
### Rust API
This is the most basic way to use the Poseidon API.
@@ -100,58 +101,71 @@ This is the most basic way to use the Poseidon API.
let test_size = 1 << 10;
let arity = 2u32;
let ctx = get_default_device_context();
let poseidon = Poseidon::load(arity, &ctx).unwrap();
let config = HashConfig::default();
let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();
let config = PoseidonConfig::default();
let inputs = vec![F::one(); test_size * arity as usize];
let outputs = vec![F::zero(); test_size];
let mut input_slice = HostOrDeviceSlice::on_host(inputs);
let mut output_slice = HostOrDeviceSlice::on_host(outputs);
poseidon.hash_many::<F>(
poseidon_hash_many::<F>(
&mut input_slice,
&mut output_slice,
test_size as u32,
arity as u32,
1, // Output length
&constants,
&config,
)
.unwrap();
```
The `HashConfig` can be modified, by default the inputs and outputs are set to be on `Host` for example.
The `PoseidonConfig::default()` can be modified, by default the inputs and outputs are set to be on `Host` for example.
```rust
impl<'a> Default for HashConfig<'a> {
impl<'a> Default for PoseidonConfig<'a> {
fn default() -> Self {
let ctx = get_default_device_context();
Self {
ctx,
are_inputs_on_device: false,
are_outputs_on_device: false,
input_is_a_state: false,
aligned: false,
loop_state: false,
is_async: false,
}
}
}
```
In the example above `Poseidon::load(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. It's possible to [generate](#constants) your own constants and load them.
In the example above `load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. Its possible to [generate](#constants) your own constants and load them.
```rust
let ctx = get_default_device_context();
let custom_poseidon = Poseidon::new(
arity, // The arity of poseidon hash. The width will be equal to arity + 1
alpha, // The S-box power
full_rounds_half,
partial_rounds,
round_constants,
mds_matrix,
non_sparse_matrix,
sparse_matrices,
domain_tag,
ctx,
)
.unwrap();
let cargo_manifest_dir = env!("CARGO_MANIFEST_DIR");
let constants_file = PathBuf::from(cargo_manifest_dir)
.join("tests")
.join(format!("{}_constants.bin", field_prefix));
let mut constants_buf = vec![];
File::open(constants_file)
.unwrap()
.read_to_end(&mut constants_buf)
.unwrap();
let mut custom_constants = vec![];
for chunk in constants_buf.chunks(field_bytes) {
custom_constants.push(F::from_bytes_le(chunk));
}
let custom_constants = create_optimized_poseidon_constants::<F>(
arity as u32,
&ctx,
full_rounds_half,
partial_rounds,
&mut custom_constants,
)
.unwrap();
```
## The Tree Builder
@@ -161,34 +175,21 @@ The tree builder allows you to build Merkle trees using Poseidon.
You can define both the tree's `height` and its `arity`. The tree `height` determines the number of layers in the tree, including the root and the leaf layer. The `arity` determines how many children each internal node can have.
```rust
use icicle_bn254::tree::Bn254TreeBuilder;
use icicle_bn254::poseidon::Poseidon;
let mut config = TreeBuilderConfig::default();
let height = 20;
let arity = 2;
config.arity = arity as u32;
let input_block_len = arity;
let leaves = vec![F::one(); (1 << height) * arity];
let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
let leaves = vec![F::one(); 1 << (height - 1)];
let mut digests = vec![F::zero(); merkle_tree_digests_len(height, arity)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
let mut leaves_slice = HostOrDeviceSlice::on_host(leaves);
let ctx = device_context::DeviceContext::default();
let hash = Poseidon::load(2, &ctx).unwrap();
let ctx = get_default_device_context();
let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap()
let mut config = TreeBuilderConfig::default();
config.keep_rows = 5;
Bn254TreeBuilder::build_merkle_tree(
leaves_slice,
digests_slice,
height,
input_block_len,
&hash,
&hash,
&config,
)
.unwrap();
config.keep_rows = 1;
build_poseidon_merkle_tree::<F>(&mut leaves_slice, &mut digests, height, arity, &constants, &config).unwrap();
println!("Root: {:?}", digests[0..1][0]);
```
Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig::default()`

View File

@@ -1,88 +0,0 @@
# Poseidon2
[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
## Using Poseidon2
ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
### Supported Bindings
[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
### Constants
Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
We support pre-calculated constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon2/constants) and are labeled clearly per curve `<curve_name>_poseidon2.h`.
You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
### Rust API
This is the most basic way to use the Poseidon2 API.
```rust
let test_size = 1 << 10;
let width = 16;
let rate = 8;
let ctx = get_default_device_context();
let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
let config = HashConfig::default();
let inputs = vec![F::one(); test_size * rate as usize];
let outputs = vec![F::zero(); test_size];
let mut input_slice = HostOrDeviceSlice::on_host(inputs);
let mut output_slice = HostOrDeviceSlice::on_host(outputs);
poseidon.hash_many::<F>(
&mut input_slice,
&mut output_slice,
test_size as u32,
rate as u32,
8, // Output length
&config,
)
.unwrap();
```
In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
## The Tree Builder
Similar to Poseidon1, you can use Poseidon2 in a tree builder.
```rust
use icicle_bn254::tree::Bn254TreeBuilder;
use icicle_bn254::poseidon2::Poseidon2;
let mut config = TreeBuilderConfig::default();
let arity = 2;
config.arity = arity as u32;
let input_block_len = arity;
let leaves = vec![F::one(); (1 << height) * arity];
let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
let leaves_slice = HostSlice::from_slice(&leaves);
let digests_slice = HostSlice::from_mut_slice(&mut digests);
let ctx = device_context::DeviceContext::default();
let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
let mut config = TreeBuilderConfig::default();
config.keep_rows = 5;
Bn254TreeBuilder::build_merkle_tree(
leaves_slice,
digests_slice,
height,
input_block_len,
&hash,
&hash,
&config,
)
.unwrap();
```

19
docs/package-lock.json generated
View File

@@ -3680,8 +3680,6 @@
"version": "8.12.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
"integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
"optional": true,
"peer": true,
"dependencies": {
"fast-deep-equal": "^3.1.1",
"json-schema-traverse": "^1.0.0",
@@ -3696,9 +3694,7 @@
"node_modules/ajv-formats/node_modules/json-schema-traverse": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
"optional": true,
"peer": true
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
},
"node_modules/ajv-keywords": {
"version": "3.5.2",
@@ -16344,13 +16340,14 @@
"version": "2.1.1",
"resolved": "https://registry.npmjs.org/ajv-formats/-/ajv-formats-2.1.1.tgz",
"integrity": "sha512-Wx0Kx52hxE7C18hkMEggYlEifqWZtYaRgouJor+WMdPnQyEK13vgEWyVNup7SoeeoLMsr4kf5h6dOW11I15MUA==",
"requires": {},
"requires": {
"ajv": "^8.0.0"
},
"dependencies": {
"ajv": {
"version": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
"version": "8.12.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
"integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
"optional": true,
"peer": true,
"requires": {
"fast-deep-equal": "^3.1.1",
"json-schema-traverse": "^1.0.0",
@@ -16361,9 +16358,7 @@
"json-schema-traverse": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
"optional": true,
"peer": true
"integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
}
}
},

View File

@@ -53,11 +53,6 @@ module.exports = {
label: "Poseidon Hash",
id: "icicle/primitives/poseidon",
},
{
type: "doc",
label: "Poseidon2 Hash",
id: "icicle/primitives/poseidon2",
},
],
},
{

View File

@@ -1,24 +0,0 @@
cmake_minimum_required(VERSION 3.18)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
else()
set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
endif ()
project(example LANGUAGES CUDA CXX)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
set(CMAKE_CUDA_FLAGS_RELEASE "")
set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
add_executable(
example
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
target_link_libraries(example ${NVML_LIBRARY})
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -1,38 +0,0 @@
#Icicle example : Montgomery vector operations(mul, add, sub) for allpossible options:
is_a_on_device
is_b_on_device
is_result_on_device
is_in_montgomery_form
(is_async isn't checked)
## Key-Takeaway
`Icicle` accelerates multiplication operation `*` using [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
## Concise Usage Explanation
Define field to be used, e. g.:
```c++
#include "api/bn254.h"
```
```c++
using namespace bn254;
typedef scalar_t T;
```
## Running the example
- `cd` to your example directory
- compile with `./compile.sh`
- run with `./run.sh`
## What's in the example
1. Define the parameters for the example such as vector size
2. Generate random vectors on-host
3. Copy them on-device
4. Execute element-wise vector multiplication on-device
5. Copy results on-host

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Debug -DCURVE=bn254
cmake --build build/icicle -j
# Configure and build the example application
cmake -DCMAKE_BUILD_TYPE=Debug -S. -B build/example
cmake --build build/example

View File

@@ -1,15 +0,0 @@
#!/bin/bash
# Exit immediately on error
set -e
mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -1,734 +0,0 @@
#include <iostream>
#include <iomanip>
#include <chrono>
#include <nvml.h>
#include "api/bn254.h"
#include "vec_ops/vec_ops.cuh"
#include <vec_ops/../../include/utils/mont.cuh>
using namespace vec_ops;
using namespace bn254;
typedef scalar_t T;
enum Op { MUL, ADD, SUB, LAST };
// bn254 p = 21888242871839275222246405745257275088548364400416034343698204186575808495617
int vector_op(
T* vec_a,
T* vec_b,
T* vec_result,
size_t n_elements,
device_context::DeviceContext ctx,
vec_ops::VecOpsConfig config,
Op op)
{
cudaError_t err;
switch (op) {
case MUL:
err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
break;
case ADD:
err = bn254_add_cuda(vec_a, vec_b, n_elements, config, vec_result);
break;
case SUB:
err = bn254_sub_cuda(vec_a, vec_b, n_elements, config, vec_result);
break;
}
// cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int vector_mul(
T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
{
cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int vector_add(
T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
{
cudaError_t err = bn254_add_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int vector_sub(
T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
{
cudaError_t err = bn254_sub_cuda(vec_a, vec_b, n_elements, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int main(int argc, char** argv)
{
const unsigned vector_size = 1 << 0;
const unsigned repetitions = 1 << 0;
cudaError_t err;
nvmlInit();
nvmlDevice_t device;
nvmlDeviceGetHandleByIndex(0, &device); // for GPU 0
std::cout << "Icicle-Examples: vector mul / add / sub operations." << std::endl;
char name[NVML_DEVICE_NAME_BUFFER_SIZE];
if (nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE) == NVML_SUCCESS) {
std::cout << "GPU Model: " << name << std::endl;
} else {
std::cerr << "Failed to get GPU model name." << std::endl;
}
unsigned power_limit;
nvmlDeviceGetPowerManagementLimit(device, &power_limit);
std::cout << "Vector size: " << vector_size << std::endl;
std::cout << "Repetitions: " << repetitions << std::endl;
std::cout << "Power limit: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_limit << " W" << std::endl;
unsigned int baseline_power;
nvmlDeviceGetPowerUsage(device, &baseline_power);
std::cout << "Baseline power: " << std::fixed << std::setprecision(3) << 1.0e-3 * baseline_power << " W" << std::endl;
unsigned baseline_temperature;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &baseline_temperature) == NVML_SUCCESS) {
std::cout << "Baseline GPU Temperature: " << baseline_temperature << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// host data
std::cout << "Allocate memory for the input vectors (both normal and Montgomery presentation)" << std::endl;
T* host_in1_init = (T*)malloc(vector_size * sizeof(T));
T* host_in2_init = (T*)malloc(vector_size * sizeof(T));
std::cout << "Initializing vectors with normal presentation random data" << std::endl;
T::rand_host_many(host_in1_init, vector_size);
T::rand_host_many(host_in2_init, vector_size);
std::cout << "Allocate memory for the output vectors" << std::endl;
T* host_out = (T*)malloc(vector_size * sizeof(T)); // This memory will be used for the test output.
T* host_out_ref_mul = (T*)malloc(
vector_size *
sizeof(T)); // This memory will be used as a reference result for mul (will be compared to host_out content).
T* host_out_ref_add = (T*)malloc(
vector_size *
sizeof(T)); // This memory will be used as a reference result for add (will be compared to host_out content).
T* host_out_ref_sub = (T*)malloc(
vector_size *
sizeof(T)); // This memory will be used as a reference result for sub (will be compared to host_out content).
std::cout << "Initializing output vectors with random data" << std::endl;
T::rand_host_many(host_out, vector_size);
T::rand_host_many(host_out_ref_mul, vector_size);
T::rand_host_many(host_out_ref_add, vector_size);
T::rand_host_many(host_out_ref_sub, vector_size);
// device data
device_context::DeviceContext ctx = device_context::get_default_device_context();
T* device_in1;
T* device_in2;
T* device_out;
err = cudaMalloc((void**)&device_in1, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_in2, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_out, vector_size * sizeof(T));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
vec_ops::VecOpsConfig config = vec_ops::DefaultVecOpsConfig();
//****************************************
// Test warn-up and reference output config. Reference output to be used to check if test passed or not.
//****************************************
// copy from host to device
err = cudaMemcpy(device_in1, host_in1_init, vector_size * sizeof(T), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMemcpy(device_in2, host_in2_init, vector_size * sizeof(T), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
std::cout << "Starting warm-up run" << std::endl;
// Warm-up loop
for (int op = MUL; op != LAST; op++) {
for (int i = 0; i < repetitions; i++) {
// vector_mul(device_in1, device_in2, device_out, vector_size, ctx, config);
vector_op(device_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
}
switch (op) {
case MUL:
err = cudaMemcpy(host_out_ref_mul, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
break;
case ADD:
err = cudaMemcpy(host_out_ref_add, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
break;
case SUB:
err = cudaMemcpy(host_out_ref_sub, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
break;
}
}
// copy the result from device to host_out_ref_mul to keep it for later comparisons.
// err = cudaMemcpy(host_out_ref_mul, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host - " << cudaGetErrorString(err) << std::endl;
return 0;
}
//****************************************
// End of test warn-up and reference output config.
//****************************************
std::cout << "Starting benchmarking" << std::endl;
unsigned power_before;
nvmlDeviceGetPowerUsage(device, &power_before);
std::cout << "Power before: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_before << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1) << (float)100.0 * power_before / power_limit
<< " %" << std::endl;
unsigned temperature_before;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_before) == NVML_SUCCESS) {
std::cout << "GPU Temperature before: " << temperature_before << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
//*******************************************************
// Benchmark test:
// Loop for (mul, add, sub):
// Loop (is_a_on_device, is_b_on_device, is_result_on_device, is_in_montgomery_form):
//*******************************************************
T* host_in1 =
(T*)malloc(vector_size * sizeof(T)); // This buffer is used to load the data from host_in1_init for the benchmark.
T* host_in2 =
(T*)malloc(vector_size * sizeof(T)); // This buffer is used to load the data from host_in2_init for the benchmark.
// Test when the result is not in-place
for (int op = MUL; op != LAST; op++) {
// for (int config_idx = 0; config_idx < 0; config_idx++) {
for (int config_idx = 0; config_idx < 16; config_idx++) {
std::cout << "Start benchmark loop for config_idx " << config_idx << std::endl;
for (int i = 0; i < vector_size; i++) {
host_in1[i] = host_in1_init[i];
host_in2[i] = host_in2_init[i];
}
config.is_a_on_device = (config_idx >> 3) & 0x1;
config.is_b_on_device = (config_idx >> 2) & 0x1;
config.is_result_on_device = (config_idx >> 1) & 0x1;
config.is_in_montgomery_form = (config_idx >> 0) & 0x1;
// Copy from host to device (copy again in order to be used later in the loop and device_inX was already
// overwritten by warmup.
if (config.is_a_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1)); // Convert in-place.
} else { // Normal presentation.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1));
err = cudaMemcpy(host_in1, device_in1, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in1 to host_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
if (config.is_b_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2)); // Convert in-place.
} else {
// Normal presentation.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2));
err = cudaMemcpy(host_in2, device_in2, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in2 to host_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
CHK_IF_RETURN(cudaPeekAtLastError());
auto start_time = std::chrono::high_resolution_clock::now();
// Benchmark loop
for (int i = 0; i < repetitions; i++) {
switch (config_idx >> 1) { // {is_a_on_device, is_b_on_device, is_result_on_device}
case 0b000:
vector_op(host_in1, host_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b001:
vector_op(host_in1, host_in2, device_out, vector_size, ctx, config, (Op)op);
break;
case 0b010:
vector_op(host_in1, device_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b011:
vector_op(host_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
break;
case 0b100:
vector_op(device_in1, host_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b101:
vector_op(device_in1, host_in2, device_out, vector_size, ctx, config, (Op)op);
break;
case 0b110:
vector_op(device_in1, device_in2, host_out, vector_size, ctx, config, (Op)op);
break;
case 0b111:
vector_op(device_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
break;
}
CHK_IF_RETURN(cudaPeekAtLastError());
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
switch (op) {
case MUL:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation MUL for config_idx "
<< config_idx << " and result not in-place" << std::endl;
break;
case ADD:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation ADD for config_idx "
<< config_idx << " and result not in-place" << std::endl;
break;
case SUB:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation SUB for config_idx "
<< config_idx << " and result not in-place" << std::endl;
break;
}
if (config.is_result_on_device) { // Copy the data to host_out in order to compare it vs. host_out_ref_mul value.
if (config.is_in_montgomery_form) { // Convert to normal from montgomery if needed.
CHK_IF_RETURN(mont::from_montgomery(
device_out, vector_size, config.ctx.stream,
device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
}
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // Data is not on device but it is in host_out.
if (config.is_in_montgomery_form) { // host_out should be written to device, converted to mmontgomery and
// written back to host. Then compared vs. host_out_ref_mul.
err = cudaMemcpy(
device_out, host_out, vector_size * sizeof(T),
cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_out to device_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::from_montgomery(
device_out, vector_size, config.ctx.stream,
device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // host_out could be compared vs. host_out_ref_mul as is.
}
}
//****************************************
// End of benchmark test.
//****************************************
//***********************************************
// Test result check
// Check is performed by executing the operation in a normal presentation
// (located in in host_out_ref_mul) and comparing it with the
// benchmark test result.
//***********************************************
int test_failed = 0;
// std::cout << "===>>> host_out_ref_mul[" << i << "]: " << host_out_ref_mul[i] << std::endl;
// std::cout << "===>>> host_out[" << i << "] after test run: " << host_out[i] << std::endl;
switch (op) {
case MUL:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_mul[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! MUL: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
test_failed = 1;
}
}
break;
case ADD:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_add[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! ADD: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
test_failed = 1;
}
}
break;
case SUB:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_sub[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! SUB: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
test_failed = 1;
}
}
break;
}
if (test_failed) {
// std::cout << "===>>> ERROR!!! Test failed for vector index " << i << ", config is printed below:" <<
// std::endl;
std::cout << "===>>> result is not in-place: " << std::endl;
std::cout << "===>>> is_a_on_device: " << config.is_a_on_device << std::endl;
std::cout << "===>>> is_b_on_device: " << config.is_b_on_device << std::endl;
std::cout << "===>>> is_result_on_device: " << config.is_result_on_device << std::endl;
std::cout << "===>>> is_in_montgomery_form: " << config.is_in_montgomery_form << std::endl;
exit(2);
}
unsigned power_after;
nvmlDeviceGetPowerUsage(device, &power_after);
std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1)
<< (float)100.0 * power_after / power_limit << " %" << std::endl;
unsigned temperature_after;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// Report performance in GMPS: Giga Multiplications Per Second
double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
}
}
// Test when the result is in-place
for (int op = MUL; op != LAST; op++) {
for (int config_idx = 0; config_idx < 16; config_idx++) {
for (int i = 0; i < vector_size; i++) {
host_in1[i] = host_in1_init[i];
host_in2[i] = host_in2_init[i];
}
config.is_a_on_device = (config_idx >> 4) & 0x1;
config.is_b_on_device = (config_idx >> 3) & 0x1;
config.is_result_on_device = (config_idx >> 2) & 0x1;
config.is_in_montgomery_form = (config_idx >> 1) & 0x1;
if (config.is_a_on_device ^ config.is_result_on_device == 1) { continue; }
// Copy from host to device (copy again in order to be used later in the loop and device_inX was already
// overwritten by warmup.
if (config.is_a_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1)); // Convert in-place.
} else { // Normal presentation.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1));
err = cudaMemcpy(host_in1, device_in1, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in1 to host_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
if (config.is_b_on_device) {
if (config.is_in_montgomery_form) {
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in1 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(
mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2)); // Convert in-place.
} else {
// Normal presentation.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
} else {
if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
err =
cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2));
err = cudaMemcpy(host_in2, device_in2, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in2 to host_in2 - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
CHK_IF_RETURN(cudaPeekAtLastError());
auto start_time = std::chrono::high_resolution_clock::now();
// Benchmark loop
for (int i = 0; i < repetitions; i++) {
switch (config_idx >> 2) { // {is_a_on_device, is_b_on_device, is_result_on_device}
case 0b000:
vector_op(host_in1, host_in2, host_in1, vector_size, ctx, config, (Op)op);
break;
case 0b001:
break;
case 0b010:
vector_op(host_in1, device_in2, host_in1, vector_size, ctx, config, (Op)op);
break;
case 0b011:
break;
case 0b100:
break;
case 0b101:
vector_op(device_in1, host_in2, device_in1, vector_size, ctx, config, (Op)op);
break;
case 0b110:
break;
case 0b111:
vector_op(device_in1, device_in2, device_in1, vector_size, ctx, config, (Op)op);
break;
}
CHK_IF_RETURN(cudaPeekAtLastError());
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
switch (op) {
case MUL:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation MUL for config_idx "
<< config_idx << " and result in-place" << std::endl;
break;
case ADD:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation ADD for config_idx "
<< config_idx << " and result in-place" << std::endl;
break;
case SUB:
std::cout << "Elapsed time: " << duration.count() << " microseconds, operation SUB for config_idx "
<< config_idx << " and result in-place" << std::endl;
break;
}
if (config.is_result_on_device) { // Copy the data to host_out in order to compare it vs. host_out_ref_mul value.
if (config.is_in_montgomery_form) { // Convert to normal from montgomery if needed.
CHK_IF_RETURN(mont::from_montgomery(
device_in1, vector_size, config.ctx.stream,
device_in1)); // Convert to normal in order to check vs. host_out_ref_mul.
}
err = cudaMemcpy(
host_out, device_in1, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_in1 to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // Data is not on device but it is in host_in1. It should be moved to host_out for test pass/fail check.
if (config.is_in_montgomery_form) { // host_out should be written to device, converted to mmontgomery and
// written back to host. Then compared vs. host_out_ref_mul.
err = cudaMemcpy(
device_out, host_in1, vector_size * sizeof(T),
cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
CHK_IF_RETURN(mont::from_montgomery(
device_out, vector_size, config.ctx.stream,
device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
} else { // host_out could be compared vs. host_out_ref_mul as is.
err = cudaMemcpy(
device_out, host_in1, vector_size * sizeof(T),
cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host_in1 to device_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMemcpy(
host_out, device_out, vector_size * sizeof(T),
cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
return 0;
}
}
}
//****************************************
// End of benchmark test.
//****************************************
//***********************************************
// Test result check
// Check is performed by executing the operation in a normal presentation
// (located in in host_out_ref_mul) and comparing it with the
// benchmark test result.
//***********************************************
int test_failed = 0;
// std::cout << "===>>> host_out_ref_mul[" << i << "]: " << host_out_ref_mul[i] << std::endl;
// std::cout << "===>>> host_out[" << i << "] after test run: " << host_out[i] << std::endl;
switch (op) {
case MUL:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_mul[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! MUL: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
std::cout << "host_out_ref_mul[0] = " << host_out_ref_mul[0] << std::endl;
test_failed = 1;
}
}
break;
case ADD:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_add[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! ADD: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
std::cout << "host_out_ref_add[0] = " << host_out_ref_add[0] << std::endl;
test_failed = 1;
}
}
break;
case SUB:
for (int i = 0; i < vector_size; i++) {
if (host_out_ref_sub[i] != host_out[i]) {
std::cout << "===>>> ERROR!!! SUB: Test failed for vector index " << i
<< ", config is printed below:" << std::endl;
std::cout << "host_out_ref_sub[0] = " << host_out_ref_sub[0] << std::endl;
test_failed = 1;
}
}
break;
}
if (test_failed) {
// std::cout << "===>>> ERROR!!! Test failed for vector index " << i << ", config is printed below:" <<
// std::endl;
std::cout << "===>>> result is in-place: " << std::endl;
std::cout << "===>>> is_a_on_device: " << config.is_a_on_device << std::endl;
std::cout << "===>>> is_b_on_device: " << config.is_b_on_device << std::endl;
std::cout << "===>>> is_result_on_device: " << config.is_result_on_device << std::endl;
std::cout << "===>>> is_in_montgomery_form: " << config.is_in_montgomery_form << std::endl;
std::cout << "host_out[0] = " << host_out[0] << std::endl;
exit(2);
}
unsigned power_after;
nvmlDeviceGetPowerUsage(device, &power_after);
std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
std::cout << "Power utilization: " << std::fixed << std::setprecision(1)
<< (float)100.0 * power_after / power_limit << " %" << std::endl;
unsigned temperature_after;
if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
} else {
std::cerr << "Failed to get GPU temperature." << std::endl;
}
// Report performance in GMPS: Giga Multiplications Per Second
double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
}
}
// clean up and exit
free(host_in1_init);
free(host_in2_init);
free(host_in1);
free(host_in2);
free(host_out);
free(host_out_ref_mul);
cudaFree(device_in1);
cudaFree(device_in2);
cudaFree(device_out);
nvmlShutdown();
return 0;
}

View File

@@ -1,2 +0,0 @@
#! /bin/bash
./build/example/example

View File

@@ -19,5 +19,5 @@ add_executable(
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_curve_bn254.a)
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_curve_bn254.a ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_curve_bls12_377.a)
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -10,6 +10,9 @@ mkdir -p build/icicle
cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DG2=ON
cmake --build build/icicle
cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bls12_377 -DG2=ON
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -3,9 +3,23 @@
#include <iomanip>
#include "api/bn254.h"
using namespace bn254;
#include "api/bls12_377.h"
int main(int argc, char* argv[])
// using namespace bn254;
typedef bn254::scalar_t scalar_bn254;
typedef bn254::affine_t affine_bn254;
typedef bn254::g2_affine_t g2_affine_bn254;
typedef bn254::projective_t projective_bn254;
typedef bn254::g2_projective_t g2_projective_bn254;
typedef bls12_377::scalar_t scalar_bls12377;
typedef bls12_377::affine_t affine_bls12377;
typedef bls12_377::g2_affine_t g2_affine_bls12377;
typedef bls12_377::projective_t projective_bls12377;
typedef bls12_377::g2_projective_t g2_projective_bls12377;
int msm_bn254(int argc, char* argv[])
{
std::cout << "Icicle example: Muli-Scalar Multiplication (MSM)" << std::endl;
std::cout << "Example parameters" << std::endl;
@@ -18,11 +32,11 @@ int main(int argc, char* argv[])
std::cout << "Part I: use G1 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
scalar_t* scalars = new scalar_t[N];
affine_t* points = new affine_t[N];
projective_t result;
scalar_t::rand_host_many(scalars, N);
projective_t::rand_host_many_affine(points, N);
scalar_bn254* scalars = new scalar_bn254[N];
affine_bn254* points = new affine_bn254[N];
projective_bn254 result;
scalar_bn254::rand_host_many(scalars, N);
projective_bn254::rand_host_many_affine(points, N);
std::cout << "Using default MSM configuration with on-host inputs" << std::endl;
device_context::DeviceContext ctx = device_context::get_default_device_context();
@@ -48,17 +62,17 @@ int main(int argc, char* argv[])
cudaStream_t stream = config.ctx.stream;
// Execute the MSM kernel
bn254_msm_cuda(scalars, points, msm_size, config, &result);
std::cout << projective_t::to_affine(result) << std::endl;
std::cout << projective_bn254::to_affine(result) << std::endl;
std::cout << "Copying inputs on-device" << std::endl;
scalar_t* scalars_d;
affine_t* points_d;
projective_t* result_d;
cudaMalloc(&scalars_d, sizeof(scalar_t) * N);
cudaMalloc(&points_d, sizeof(affine_t) * N);
cudaMalloc(&result_d, sizeof(projective_t));
cudaMemcpy(scalars_d, scalars, sizeof(scalar_t) * N, cudaMemcpyHostToDevice);
cudaMemcpy(points_d, points, sizeof(affine_t) * N, cudaMemcpyHostToDevice);
scalar_bn254* scalars_d;
affine_bn254* points_d;
projective_bn254* result_d;
cudaMalloc(&scalars_d, sizeof(scalar_bn254) * N);
cudaMalloc(&points_d, sizeof(affine_bn254) * N);
cudaMalloc(&result_d, sizeof(projective_bn254));
cudaMemcpy(scalars_d, scalars, sizeof(scalar_bn254) * N, cudaMemcpyHostToDevice);
cudaMemcpy(points_d, points, sizeof(affine_bn254) * N, cudaMemcpyHostToDevice);
std::cout << "Reconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
@@ -70,9 +84,9 @@ int main(int argc, char* argv[])
bn254_msm_cuda(scalars_d, points_d, msm_size, config, result_d);
// Copy the result back to the host
cudaMemcpy(&result, result_d, sizeof(projective_t), cudaMemcpyDeviceToHost);
cudaMemcpy(&result, result_d, sizeof(projective_bn254), cudaMemcpyDeviceToHost);
// Print the result
std::cout << projective_t::to_affine(result) << std::endl;
std::cout << projective_bn254::to_affine(result) << std::endl;
// Free the device memory
cudaFree(scalars_d);
cudaFree(points_d);
@@ -84,25 +98,25 @@ int main(int argc, char* argv[])
std::cout << "Generating random inputs on-host" << std::endl;
// use the same scalars
g2_affine_t* g2_points = new g2_affine_t[N];
g2_projective_t::rand_host_many_affine(g2_points, N);
g2_affine_bn254* g2_points = new g2_affine_bn254[N];
g2_projective_bn254::rand_host_many_affine(g2_points, N);
std::cout << "Reconfiguring MSM to use on-host inputs" << std::endl;
config.are_results_on_device = false;
config.are_scalars_on_device = false;
config.are_points_on_device = false;
g2_projective_t g2_result;
g2_projective_bn254 g2_result;
bn254_g2_msm_cuda(scalars, g2_points, msm_size, config, &g2_result);
std::cout << g2_projective_t::to_affine(g2_result) << std::endl;
std::cout << g2_projective_bn254::to_affine(g2_result) << std::endl;
std::cout << "Copying inputs on-device" << std::endl;
g2_affine_t* g2_points_d;
g2_projective_t* g2_result_d;
cudaMalloc(&scalars_d, sizeof(scalar_t) * N);
cudaMalloc(&g2_points_d, sizeof(g2_affine_t) * N);
cudaMalloc(&g2_result_d, sizeof(g2_projective_t));
cudaMemcpy(scalars_d, scalars, sizeof(scalar_t) * N, cudaMemcpyHostToDevice);
cudaMemcpy(g2_points_d, g2_points, sizeof(g2_affine_t) * N, cudaMemcpyHostToDevice);
g2_affine_bn254* g2_points_d;
g2_projective_bn254* g2_result_d;
cudaMalloc(&scalars_d, sizeof(scalar_bn254) * N);
cudaMalloc(&g2_points_d, sizeof(g2_affine_bn254) * N);
cudaMalloc(&g2_result_d, sizeof(g2_projective_bn254));
cudaMemcpy(scalars_d, scalars, sizeof(scalar_bn254) * N, cudaMemcpyHostToDevice);
cudaMemcpy(g2_points_d, g2_points, sizeof(g2_affine_bn254) * N, cudaMemcpyHostToDevice);
std::cout << "Reconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
@@ -111,14 +125,140 @@ int main(int argc, char* argv[])
std::cout << "Running MSM kernel with on-device inputs" << std::endl;
bn254_g2_msm_cuda(scalars_d, g2_points_d, msm_size, config, g2_result_d);
cudaMemcpy(&g2_result, g2_result_d, sizeof(g2_projective_t), cudaMemcpyDeviceToHost);
std::cout << g2_projective_t::to_affine(g2_result) << std::endl;
cudaMemcpy(&g2_result, g2_result_d, sizeof(g2_projective_bn254), cudaMemcpyDeviceToHost);
std::cout << g2_projective_bn254::to_affine(g2_result) << std::endl;
cudaFree(scalars_d);
cudaFree(g2_points_d);
cudaFree(g2_result_d);
delete[] g2_points;
delete[] scalars;
cudaStreamDestroy(stream);
return 0;
}
int msm_bls12_377(int argc, char* argv[])
{
std::cout << "Icicle example: Muli-Scalar Multiplication (MSM)" << std::endl;
std::cout << "Example parameters" << std::endl;
int batch_size = 1;
std::cout << "Batch size: " << batch_size << std::endl;
unsigned msm_size = 1048576;
std::cout << "MSM size: " << msm_size << std::endl;
int N = batch_size * msm_size;
std::cout << "Part I: use G1 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
scalar_bls12377* scalars = new scalar_bls12377[N];
affine_bls12377* points = new affine_bls12377[N];
projective_bls12377 result;
scalar_bls12377::rand_host_many(scalars, N);
projective_bls12377::rand_host_many_affine(points, N);
std::cout << "Using default MSM configuration with on-host inputs" << std::endl;
device_context::DeviceContext ctx = device_context::get_default_device_context();
msm::MSMConfig config = {
ctx, // ctx
0, // points_size
1, // precompute_factor
0, // c
0, // bitsize
10, // large_bucket_factor
1, // batch_size
false, // are_scalars_on_device
false, // are_scalars_montgomery_form
false, // are_points_on_device
false, // are_points_montgomery_form
false, // are_results_on_device
false, // is_big_triangle
false, // is_async
};
config.batch_size = batch_size;
std::cout << "Running MSM kernel with on-host inputs" << std::endl;
cudaStream_t stream = config.ctx.stream;
cudaStreamCreate(&stream);
// Execute the MSM kernel
bls12_377_msm_cuda(scalars, points, msm_size, config, &result);
std::cout << projective_bls12377::to_affine(result) << std::endl;
std::cout << "Copying inputs on-device" << std::endl;
scalar_bls12377* scalars_d_bls;
affine_bls12377* points_d_bls;
projective_bls12377* result_d_bls;
cudaMalloc(&scalars_d_bls, sizeof(scalar_bls12377) * N);
cudaMalloc(&points_d_bls, sizeof(affine_bls12377) * N);
cudaMalloc(&result_d_bls, sizeof(projective_bls12377));
cudaMemcpy(scalars_d_bls, scalars, sizeof(scalar_bls12377) * N, cudaMemcpyHostToDevice);
cudaMemcpy(points_d_bls, points, sizeof(affine_bls12377) * N, cudaMemcpyHostToDevice);
std::cout << "Reconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
config.are_scalars_on_device = true;
config.are_points_on_device = true;
std::cout << "Running MSM kernel with on-device inputs" << std::endl;
// Execute the MSM kernel
bls12_377_msm_cuda(scalars_d_bls, points_d_bls, msm_size, config, result_d_bls);
// Copy the result back to the host
cudaMemcpy(&result, result_d_bls, sizeof(projective_bls12377), cudaMemcpyDeviceToHost);
// Print the result
std::cout << projective_bls12377::to_affine(result) << std::endl;
// Free the device memory
cudaFree(scalars_d_bls);
cudaFree(points_d_bls);
cudaFree(result_d_bls);
// Free the host memory, keep scalars for G2 example
delete[] points;
std::cout << "Part II: use G2 points" << std::endl;
std::cout << "Generating random inputs on-host" << std::endl;
// use the same scalars
g2_affine_bls12377* g2_points = new g2_affine_bls12377[N];
g2_projective_bls12377::rand_host_many_affine(g2_points, N);
std::cout << "Reconfiguring MSM to use on-host inputs" << std::endl;
config.are_results_on_device = false;
config.are_scalars_on_device = false;
config.are_points_on_device = false;
g2_projective_bls12377 g2_result;
bls12_377_g2_msm_cuda(scalars, g2_points, msm_size, config, &g2_result);
std::cout << g2_projective_bls12377::to_affine(g2_result) << std::endl;
std::cout << "Copying inputs on-device" << std::endl;
g2_affine_bls12377* g2_points_d;
g2_projective_bls12377* g2_result_d;
cudaMalloc(&scalars_d_bls, sizeof(scalar_bls12377) * N);
cudaMalloc(&g2_points_d, sizeof(g2_affine_bls12377) * N);
cudaMalloc(&g2_result_d, sizeof(g2_projective_bls12377));
cudaMemcpy(scalars_d_bls, scalars, sizeof(scalar_bls12377) * N, cudaMemcpyHostToDevice);
cudaMemcpy(g2_points_d, g2_points, sizeof(g2_affine_bls12377) * N, cudaMemcpyHostToDevice);
std::cout << "Reconfiguring MSM to use on-device inputs" << std::endl;
config.are_results_on_device = true;
config.are_scalars_on_device = true;
config.are_points_on_device = true;
std::cout << "Running MSM kernel with on-device inputs" << std::endl;
bls12_377_g2_msm_cuda(scalars_d_bls, g2_points_d, msm_size, config, g2_result_d);
cudaMemcpy(&g2_result, g2_result_d, sizeof(g2_projective_bn254), cudaMemcpyDeviceToHost);
std::cout << g2_projective_bls12377::to_affine(g2_result) << std::endl;
cudaFree(scalars_d_bls);
cudaFree(g2_points_d);
cudaFree(g2_result_d);
delete[] g2_points;
delete[] scalars;
return 0;
}
int main(int argc, char* argv[])
{
std::cout << "Starting BN254 MSM" << std::endl;
msm_bn254(argc, argv);
std::cout << "Starting BLS12-377 MSM" << std::endl;
msm_bls12_377(argc, argv);
return 0;
}

View File

@@ -18,7 +18,7 @@ add_executable(
example.cu
)
target_include_directories(example PRIVATE "../../../icicle/include")
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bls12_377.a)
find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
target_link_libraries(example ${NVML_LIBRARY})
set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

View File

@@ -7,9 +7,13 @@ mkdir -p build/example
mkdir -p build/icicle
# Configure and build Icicle
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bls12_377
cmake --build build/icicle
rm build/icicle/CMakeCache.txt
cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
cmake --build build/icicle
# Configure and build the example application
cmake -S . -B build/example
cmake --build build/example

View File

@@ -4,14 +4,17 @@
#include <nvml.h>
#include "api/bn254.h"
#include "api/bls12_377.h"
#include "vec_ops/vec_ops.cuh"
using namespace vec_ops;
using namespace bn254;
// using namespace bn254;
typedef bn254::scalar_t T;
typedef scalar_t T;
typedef bls12_377::scalar_t T_bls;
int vector_mult(T* vec_b, T* vec_a, T* vec_result, size_t n_elments, device_context::DeviceContext ctx)
int vector_mult_bn254(T* vec_b, T* vec_a, T* vec_result, size_t n_elments, device_context::DeviceContext ctx)
{
vec_ops::VecOpsConfig config = vec_ops::DefaultVecOpsConfig();
config.is_a_on_device = true;
@@ -25,10 +28,24 @@ int vector_mult(T* vec_b, T* vec_a, T* vec_result, size_t n_elments, device_cont
return 0;
}
int vector_mult_bls12377(T_bls* vec_b, T_bls* vec_a, T_bls* vec_result, size_t n_elments, device_context::DeviceContext ctx)
{
vec_ops::VecOpsConfig config = vec_ops::DefaultVecOpsConfig();
config.is_a_on_device = true;
config.is_b_on_device = true;
config.is_result_on_device = true;
cudaError_t err = bls12_377_mul_cuda(vec_a, vec_b, n_elments, config, vec_result);
if (err != cudaSuccess) {
std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
return 0;
}
return 0;
}
int main(int argc, char** argv)
{
const unsigned vector_size = 1 << 15;
const unsigned repetitions = 1 << 15;
const unsigned repetitions = 1 ;
cudaError_t err;
nvmlInit();
@@ -61,41 +78,53 @@ int main(int argc, char** argv)
// host data
T* host_in1 = (T*)malloc(vector_size * sizeof(T));
T* host_in2 = (T*)malloc(vector_size * sizeof(T));
T_bls* host_in1_bls12377 = (T_bls*)malloc(vector_size * sizeof(T_bls));
T_bls* host_in2_bls12377 = (T_bls*)malloc(vector_size * sizeof(T_bls));
std::cout << "Initializing vectors with random data" << std::endl;
T::rand_host_many(host_in1, vector_size);
T::rand_host_many(host_in2, vector_size);
T_bls::rand_host_many(host_in1_bls12377, vector_size);
T_bls::rand_host_many(host_in2_bls12377, vector_size);
// device data
device_context::DeviceContext ctx = device_context::get_default_device_context();
T* device_in1;
T* device_in2;
T* device_out;
T* device_in1_bn254;
T* device_in2_bn254;
T* device_out_bn254;
T_bls* device_in1_bls12377;
T_bls* device_in2_bls12377;
T_bls* device_out_bls12377;
err = cudaMalloc((void**)&device_in1, vector_size * sizeof(T));
err = cudaMalloc((void**)&device_in1_bn254, vector_size * sizeof(T));
err = cudaMalloc((void**)&device_in1_bls12377, vector_size * sizeof(T_bls));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_in2, vector_size * sizeof(T));
err = cudaMalloc((void**)&device_in2_bn254, vector_size * sizeof(T));
err = cudaMalloc((void**)&device_in2_bls12377, vector_size * sizeof(T_bls));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMalloc((void**)&device_out, vector_size * sizeof(T));
err = cudaMalloc((void**)&device_out_bn254, vector_size * sizeof(T));
err = cudaMalloc((void**)&device_out_bls12377, vector_size * sizeof(T_bls));
if (err != cudaSuccess) {
std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
return 0;
}
// copy from host to device
err = cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice);
err = cudaMemcpy(device_in1_bn254, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice);
err = cudaMemcpy(device_in1_bls12377, host_in1_bls12377, vector_size * sizeof(T_bls), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
}
err = cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice);
err = cudaMemcpy(device_in2_bn254, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice);
err = cudaMemcpy(device_in2_bls12377, host_in2_bls12377, vector_size * sizeof(T_bls), cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
return 0;
@@ -104,7 +133,10 @@ int main(int argc, char** argv)
std::cout << "Starting warm-up" << std::endl;
// Warm-up loop
for (int i = 0; i < repetitions; i++) {
vector_mult(device_in1, device_in2, device_out, vector_size, ctx);
std::cout << "bn254 mult" << std::endl;
vector_mult_bn254(device_in1_bn254, device_in2_bn254, device_out_bn254, vector_size, ctx);
std::cout << "bls12-377 mult" << std::endl;
vector_mult_bls12377(device_in1_bls12377, device_in2_bls12377, device_out_bls12377, vector_size, ctx);
}
std::cout << "Starting benchmarking" << std::endl;
@@ -122,7 +154,7 @@ int main(int argc, char** argv)
auto start_time = std::chrono::high_resolution_clock::now();
// Benchmark loop
for (int i = 0; i < repetitions; i++) {
vector_mult(device_in1, device_in2, device_out, vector_size, ctx);
vector_mult_bn254(device_in1_bn254, device_in2_bn254, device_out_bn254, vector_size, ctx);
}
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
@@ -146,7 +178,7 @@ int main(int argc, char** argv)
// Optional: validate multiplication
T* host_out = (T*)malloc(vector_size * sizeof(T));
cudaMemcpy(host_out, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
cudaMemcpy(host_out, device_out_bn254, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
// validate multiplication here...
@@ -154,9 +186,9 @@ int main(int argc, char** argv)
free(host_in1);
free(host_in2);
free(host_out);
cudaFree(device_in1);
cudaFree(device_in2);
cudaFree(device_out);
cudaFree(device_in1_bn254);
cudaFree(device_in2_bn254);
cudaFree(device_out_bn254);
nvmlShutdown();
return 0;
}

View File

@@ -11,17 +11,6 @@
using namespace hash;
namespace keccak {
// Input rate in bytes
const int KECCAK_256_RATE = 136;
const int KECCAK_512_RATE = 72;
// Digest size in u64
const int KECCAK_256_DIGEST = 4;
const int KECCAK_512_DIGEST = 8;
// Number of state elements in u64
const int KECCAK_STATE_SIZE = 25;
class Keccak : public Hasher<uint8_t, uint64_t>
{
public:
@@ -33,7 +22,7 @@ namespace keccak {
unsigned int output_len,
const device_context::DeviceContext& ctx) const override;
Keccak(unsigned int rate) : Hasher<uint8_t, uint64_t>(KECCAK_STATE_SIZE, KECCAK_STATE_SIZE, rate, 0) {}
Keccak(unsigned int rate) : Hasher<uint8_t, uint64_t>(25, 25, rate, 0) {}
};
} // namespace keccak

View File

@@ -27,8 +27,6 @@ namespace vec_ops {
* non-blocking and you'd need to synchronize it explicitly by running
* `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the
* function will block the current CPU thread. */
bool is_in_montgomery_form; /**< If true then vec_a, vec_b and result are in montgomery form.
* Default value: false. */
};
/**
@@ -44,7 +42,6 @@ namespace vec_ops {
false, // is_b_on_device
false, // is_result_on_device
false, // is_async
false, // is_in_montgomery_form
};
return config;
}

View File

@@ -11,15 +11,13 @@ namespace keccak {
extern "C" cudaError_t
keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)
{
return Keccak(KECCAK_256_RATE)
.hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, KECCAK_256_DIGEST, config);
return Keccak(136).hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, 4, config);
}
extern "C" cudaError_t
keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)
{
return Keccak(KECCAK_512_RATE)
.hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, KECCAK_512_DIGEST, config);
return Keccak(72).hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, 8, config);
}
extern "C" cudaError_t build_keccak256_merkle_tree_cuda(
@@ -29,7 +27,7 @@ namespace keccak {
unsigned int input_block_len,
const merkle_tree::TreeBuilderConfig& tree_config)
{
Keccak keccak(KECCAK_256_RATE);
Keccak keccak(136);
return merkle_tree::build_merkle_tree<uint8_t, uint64_t>(
leaves, digests, height, input_block_len, keccak, keccak, tree_config);
}
@@ -41,7 +39,7 @@ namespace keccak {
unsigned int input_block_len,
const merkle_tree::TreeBuilderConfig& tree_config)
{
Keccak keccak(KECCAK_512_RATE);
Keccak keccak(72);
return merkle_tree::build_merkle_tree<uint8_t, uint64_t>(
leaves, digests, height, input_block_len, keccak, keccak, tree_config);
}

View File

@@ -161,7 +161,7 @@ namespace keccak {
0x8000000000008002, 0x8000000000000080, 0x000000000000800a, 0x800000008000000a,
0x8000000080008081, 0x8000000000008080, 0x0000000080000001, 0x8000000080008008};
__device__ void keccakf(u64 s[KECCAK_STATE_SIZE])
__device__ void keccakf(u64 s[25])
{
u64 t0, t1, t2, t3, t4;
@@ -188,7 +188,7 @@ namespace keccak {
const uint8_t* b_input = input + sid * input_block_size;
uint64_t* b_output = output + sid * output_len;
uint64_t state[KECCAK_STATE_SIZE] = {}; // Initialize with zeroes
uint64_t state[25] = {}; // Initialize with zeroes
int input_len = input_block_size;
@@ -238,12 +238,12 @@ namespace keccak {
int number_of_gpu_blocks = (number_of_states - 1) / number_of_threads + 1;
switch (rate) {
case KECCAK_256_RATE:
keccak_hash_blocks<KECCAK_256_RATE><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
case 136:
keccak_hash_blocks<136><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
input, input_len, output_len, number_of_states, output);
break;
case KECCAK_512_RATE:
keccak_hash_blocks<KECCAK_512_RATE><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
case 72:
keccak_hash_blocks<72><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
input, input_len, output_len, number_of_states, output);
break;
default:

View File

@@ -95,64 +95,25 @@ namespace vec_ops {
E *d_result, *d_alloc_vec_a, *d_alloc_vec_b;
E* d_vec_a;
const E* d_vec_b;
int is_d_alloc_vec_a_allocated = 0;
if (!config.is_a_on_device) {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
CHK_IF_RETURN(mont::from_montgomery(d_alloc_vec_a, n * sizeof(E), config.ctx.stream, d_alloc_vec_a));
is_d_alloc_vec_a_allocated = 1;
d_vec_a = d_alloc_vec_a;
} else {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
is_d_alloc_vec_a_allocated = 1;
d_vec_a = d_alloc_vec_a;
}
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
d_vec_a = d_alloc_vec_a;
} else {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(
&d_alloc_vec_a, n * sizeof(E), config.ctx.stream)); // Allocate in order not to change the input.
CHK_IF_RETURN(mont::from_montgomery(vec_a, n * sizeof(E), config.ctx.stream, d_alloc_vec_a));
is_d_alloc_vec_a_allocated = 1;
d_vec_a = d_alloc_vec_a;
} else {
d_vec_a = vec_a;
}
d_vec_a = vec_a;
}
int is_d_alloc_vec_b_allocated = 0;
if (!config.is_b_on_device) {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
CHK_IF_RETURN(mont::from_montgomery(d_alloc_vec_b, n * sizeof(E), config.ctx.stream, d_alloc_vec_b));
is_d_alloc_vec_b_allocated = 1;
d_vec_b = d_alloc_vec_b;
} else {
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
is_d_alloc_vec_b_allocated = 1;
d_vec_b = d_alloc_vec_b;
}
CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
d_vec_b = d_alloc_vec_b;
} else {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(cudaMallocAsync(
&d_alloc_vec_b, n * sizeof(E), config.ctx.stream)); // Allocate in order not to change the input.
CHK_IF_RETURN(mont::from_montgomery(vec_b, n * sizeof(E), config.ctx.stream, d_alloc_vec_b));
is_d_alloc_vec_b_allocated = 1;
d_vec_b = d_alloc_vec_b;
} else {
d_vec_b = vec_b;
}
d_vec_b = vec_b;
}
int is_d_result_allocated = 0;
if (!config.is_result_on_device) {
if (!is_in_place) {
CHK_IF_RETURN(cudaMallocAsync(&d_result, n * sizeof(E), config.ctx.stream));
is_d_result_allocated = 1;
} else {
d_result = d_vec_a;
}
@@ -168,21 +129,12 @@ namespace vec_ops {
Kernel<<<num_blocks, num_threads, 0, config.ctx.stream>>>(d_vec_a, d_vec_b, n, d_result);
if (!config.is_result_on_device) {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(mont::to_montgomery(d_result, n * sizeof(E), config.ctx.stream, d_result)); // Convert in-place.
CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
} else {
CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
}
} else {
if (config.is_in_montgomery_form) {
CHK_IF_RETURN(mont::to_montgomery(d_result, n * sizeof(E), config.ctx.stream, d_result)); // Convert in-place.
}
CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
CHK_IF_RETURN(cudaFreeAsync(d_result, config.ctx.stream));
}
if (is_d_alloc_vec_a_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_a, config.ctx.stream)); }
if (is_d_alloc_vec_b_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_b, config.ctx.stream)); }
if (is_d_result_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_result, config.ctx.stream)); }
if (!config.is_a_on_device && !is_in_place) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_a, config.ctx.stream)); }
if (!config.is_b_on_device) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_b, config.ctx.stream)); }
if (!config.is_async) return CHK_STICKY(cudaStreamSynchronize(config.ctx.stream));

View File

@@ -28,8 +28,6 @@ type VecOpsConfig struct {
* non-blocking and you'll need to synchronize it explicitly by calling
* `SynchronizeStream`. If set to false, the function will block the current CPU thread. */
IsAsync bool
/* If true then vec_a, vec_b and result are in montgomery form. Default value: false. */
IsInMontgomeryForm bool
}
/**
@@ -44,7 +42,6 @@ func DefaultVecOpsConfig() VecOpsConfig {
false, // isBOnDevice
false, // isResultOnDevice
false, // IsAsync
false, // IsInMontgomeryForm
}
return config

View File

@@ -15,7 +15,6 @@ func TestVecOpsDefaultConfig(t *testing.T) {
false, // isBOnDevice
false, // isResultOnDevice
false, // IsAsync
false, // IsInMontgomeryForm
}
actual := DefaultVecOpsConfig()

View File

@@ -18,7 +18,7 @@ exclude = [
]
[workspace.package]
version = "2.8.0"
version = "2.7.1"
edition = "2021"
authors = [ "Ingonyama" ]
homepage = "https://www.ingonyama.com"

View File

@@ -20,8 +20,6 @@ pub struct VecOpsConfig<'a> {
/// Whether to run the vector operations asynchronously. If set to `true`, the functions will be non-blocking and you'd need to synchronize
/// it explicitly by running `stream.synchronize()`. If set to false, the functions will block the current CPU thread.
pub is_async: bool,
/// If true then vec_a, vec_b and result are in montgomery form. Default value: false.
pub is_in_montgomery_form: bool,
}
impl<'a> Default for VecOpsConfig<'a> {
@@ -38,7 +36,6 @@ impl<'a> VecOpsConfig<'a> {
is_b_on_device: false,
is_result_on_device: false,
is_async: false,
is_in_montgomery_form: false,
}
}
}

View File

@@ -24,6 +24,6 @@ pub(crate) mod tests {
let ctx = device_context::DeviceContext::default();
let sponge = Poseidon::load(2, &ctx).unwrap();
check_build_field_merkle_tree::<_, _, Bls12_377TreeBuilder>(18, 2, &sponge, &sponge, ScalarField::zero());
check_build_field_merkle_tree::<_, _, Bls12_377TreeBuilder>(25, 2, &sponge, &sponge, ScalarField::zero());
}
}

View File

@@ -24,6 +24,6 @@ pub(crate) mod tests {
let ctx = device_context::DeviceContext::default();
let sponge = Poseidon::load(2, &ctx).unwrap();
check_build_field_merkle_tree::<_, _, Bls12_381TreeBuilder>(18, 2, &sponge, &sponge, ScalarField::zero());
check_build_field_merkle_tree::<_, _, Bls12_381TreeBuilder>(25, 2, &sponge, &sponge, ScalarField::zero());
}
}