Update wrappers/rust/icicle-core/src/vec_ops/mod.rs

Apply suggestions from code review
Remove ## Best-Practices section
2026-01-11 16:38:27 -05:00 · 2024-08-06 13:52:55 +03:00 · 2024-08-06 13:51:00 +03:00 · 2024-08-04 06:19:10 +00:00 · 2024-07-31 14:24:25 +00:00 · 2024-07-31 13:47:45 +00:00
28 changed files with 1145 additions and 96 deletions
--- a/docs/docs/contributor-guide.md
+++ b/docs/docs/contributor-guide.md
@@ -1,6 +1,6 @@
 # Contributor's Guide

-We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build a ecosystem.
+We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build an ecosystem.

 ## Contributing to ICICLE

@@ -14,9 +14,9 @@ We welcome all contributions with open arms. At Ingonyama we take a village appr
 When opening a [pull request](https://github.com/ingonyama-zk/icicle/pulls) please keep the following in mind.

 - `Clear Purpose` - The pull request should solve a single issue and be clean of any unrelated changes.
- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how its best that we test it. For bug fixes please describe the issue and the solution.
+- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how it's best that we test it. For bug fixes please describe the issue and the solution.
 - `Consistent style` - Rust and Golang code should be linted by the official linters (golang fmt and rust fmt) and maintain a proper style. For CUDA and C++ code we use [`clang-format`](https://github.com/ingonyama-zk/icicle/blob/main/.clang-format), [here](https://github.com/ingonyama-zk/icicle/blob/605c25f9d22135c54ac49683b710fe2ce06e2300/.github/workflows/main-format.yml#L46) you can see how we run it.
- `Minimal Tests` - please add test which cover basic usage of your changes .
+- `Minimal Tests` - please add test which cover basic usage of your changes.

 ## Questions?

--- a/docs/docs/icicle/golang-bindings/ecntt.md
+++ b/docs/docs/icicle/golang-bindings/ecntt.md
@@ -47,7 +47,7 @@ type NTTConfig[T any] struct {
 - **`areInputsOnDevice`**: Indicates if input scalars are located on the device.
 - **`areOutputsOnDevice`**: Indicates if results are stored on the device.
 - **`IsAsync`**: Controls whether the NTT operation runs asynchronously.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algoruithm.
+- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algorithm.

 ### Default Configuration

--- a/docs/docs/icicle/golang-bindings/msm.md
+++ b/docs/docs/icicle/golang-bindings/msm.md
@@ -139,7 +139,7 @@ cfg.Ctx.IsBigTriangle = true

 Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `MSM` function.

-The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus its important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.
+The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus it's important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.

 ```go
 ... 
@@ -168,7 +168,7 @@ import (
 )
 ```

-This package include `G2Projective` and `G2Affine` points as well as a `G2Msm` method.
+This package includes `G2Projective` and `G2Affine` points as well as a `G2Msm` method.

 ```go
 package main
--- a/docs/docs/icicle/polynomials/overview.md
+++ b/docs/docs/icicle/polynomials/overview.md
@@ -171,7 +171,7 @@ Polynomial& add_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
 Polynomial& sub_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
 ```

-The ability to add or subtract monomials directly and in-place is an efficient way to manipualte polynomials.
+The ability to add or subtract monomials directly and in-place is an efficient way to manipulate polynomials.

 Example:

--- a/docs/docs/icicle/primitives/keccak.md
+++ b/docs/docs/icicle/primitives/keccak.md
@@ -14,9 +14,62 @@ At its core, Keccak consists of a permutation function operating on a state arra

 ## Using Keccak

-ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree.
+ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree or running multiple hashes in parallel.

 ### Supported Bindings

 - [Golang](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/hash/keccak)
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
+- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
+
+### Example usage
+
+This is an example of running 1024 Keccak-256 hashes in parallel, where input strings are of size 136 bytes:
+
+```rust
+use icicle_core::hash::HashConfig;
+use icicle_cuda_runtime::memory::HostSlice;
+use icicle_hash::keccak::keccak256;
+
+let config = HashConfig::default();
+let input_block_len = 136;
+let number_of_hashes = 1024;
+
+let preimages = vec![1u8; number_of_hashes * input_block_len];
+let mut digests = vec![0u8; number_of_hashes * 64];
+
+let preimages_slice = HostSlice::from_slice(&preimages);
+let digests_slice = HostSlice::from_mut_slice(&mut digests);
+
+keccak256(
+    preimages_slice,
+    input_block_len as u32,
+    number_of_hashes as u32,
+    digests_slice,
+    &config,
+)
+.unwrap();
+```
+
+### Merkle Tree
+
+You can build a keccak merkle tree using the corresponding functions:
+
+```rust
+use icicle_core::tree::{merkle_tree_digests_len, TreeBuilderConfig};
+use icicle_cuda_runtime::memory::HostSlice;
+use icicle_hash::keccak::build_keccak256_merkle_tree;
+
+let mut config = TreeBuilderConfig::default();
+config.arity = 2;
+let height = 22;
+let input_block_len = 136;
+let leaves = vec![1u8; (1 << height) * input_block_len];
+let mut digests = vec![0u64; merkle_tree_digests_len((height + 1) as u32, 2, 1)];
+
+let leaves_slice = HostSlice::from_slice(&leaves);
+let digests_slice = HostSlice::from_mut_slice(&mut digests);
+
+build_keccak256_merkle_tree(leaves_slice, digests_slice, height, input_block_len, &config).unwrap();
+```
+
+In the example above, a binary tree of height 22 is being built. Each leaf is considered to be a 136 byte long array. The leaves and digests are aligned in a flat array. You can also use keccak512 in `build_keccak512_merkle_tree` function.
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -1,6 +1,6 @@
 # MSM - Multi scalar multiplication

-MSM stands for Multi scalar multiplication, its defined as:
+MSM stands for Multi scalar multiplication, it's defined as:

 <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>M</mi>
@@ -43,7 +43,7 @@ $a_0, \ldots, a_n$ - Scalars

 $MSM(a, G) \in G$ - a single EC (elliptic curve) point

-In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. Its obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.
+In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. It's obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.

 Accelerating MSM is crucial to a ZK protocol's performance due to the [large percent of run time](https://hackmd.io/@0xMonia/SkQ6-oRz3#Hardware-acceleration-in-action) they take when generating proofs.

@@ -131,7 +131,7 @@ Large buckets exist in two cases:
 2. When `c` does not divide the scalar bit-size.

 `large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
-The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advanced and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined we a `c` value around $log_2(msmSize) - 1$.
+The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advance and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined with a `c` value around $log_2(msmSize) - 1$.

 ## Memory usage estimation

--- a/docs/docs/icicle/primitives/ntt.md
+++ b/docs/docs/icicle/primitives/ntt.md
@@ -56,7 +56,7 @@ Choosing an algorithm is heavily dependent on your use case. For example Cooley-

 NTT also supports two different modes `Batch NTT` and `Single NTT`

-Deciding weather to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
+Deciding whether to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.

 #### Single NTT

--- a/docs/docs/icicle/primitives/poseidon.md
+++ b/docs/docs/icicle/primitives/poseidon.md
@@ -1,6 +1,6 @@
 # Poseidon

-[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because its optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.
+[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because it's optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.

 Poseidon has also been specifically designed to be efficient when implemented within ZK circuits, Poseidon uses far less constraints compared to other hash functions like Keccak or SHA-256 in the context of ZK circuits.

@@ -42,7 +42,7 @@ To generate a secure hash output, the algorithm goes through a series of "full r

 **Linear Transformation and Round Constants:** A linear transformation is performed and round constants are added. The linear transformation in partial rounds can be designed to be less computationally intensive (this is done by using a sparse matrix) than in full rounds, further optimizing the function's efficiency.

-The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance is highly dependent on the use case.
+The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance are highly dependent on the use case.

 ## Using Poseidon

@@ -53,13 +53,14 @@ So for Poseidon of arity 2 and input of size 1024 * 2, we would expect 1024 elem

 ### Supported Bindings

+[`Go`](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/curves/bn254/poseidon/poseidon.go)
 [`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon)

 ### Constants

 Poseidon is extremely customizable and using different constants will produce different hashes, security levels and performance results.

-We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
+We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.

 If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants/generate_parameters.py).

@@ -91,8 +92,6 @@ primitive_element = 7 # bls12-381
 # primitive_element = 15 # bw6-761
 ```

-We only support `alpha = 5` so if you want to use another alpha for S-box please reach out on discord or open a github issue.
-
 ### Rust API

 This is the most basic way to use the Poseidon API.
@@ -101,71 +100,58 @@ This is the most basic way to use the Poseidon API.
 let test_size = 1 << 10;
 let arity = 2u32;
 let ctx = get_default_device_context();
-let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();
-let config = PoseidonConfig::default();
+let poseidon = Poseidon::load(arity, &ctx).unwrap();
+let config = HashConfig::default();

 let inputs = vec![F::one(); test_size * arity as usize];
 let outputs = vec![F::zero(); test_size];
 let mut input_slice = HostOrDeviceSlice::on_host(inputs);
 let mut output_slice = HostOrDeviceSlice::on_host(outputs);

-poseidon_hash_many::<F>(
+poseidon.hash_many::<F>(
    &mut input_slice,
    &mut output_slice,
    test_size as u32,
    arity as u32,
-    &constants,
+    1, // Output length
    &config,
 )
 .unwrap();
 ```

-The `PoseidonConfig::default()` can be modified, by default the inputs and outputs are set to be on `Host` for example.
+The `HashConfig` can be modified, by default the inputs and outputs are set to be on `Host` for example.

 ```rust
-impl<'a> Default for PoseidonConfig<'a> {
+impl<'a> Default for HashConfig<'a> {
    fn default() -> Self {
        let ctx = get_default_device_context();
        Self {
            ctx,
            are_inputs_on_device: false,
            are_outputs_on_device: false,
-            input_is_a_state: false,
-            aligned: false,
-            loop_state: false,
            is_async: false,
        }
    }
 }
 ```

-In the example above `load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. Its possible to [generate](#constants) your own constants and load them.
+In the example above `Poseidon::load(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. It's possible to [generate](#constants) your own constants and load them.

 ```rust
 let ctx = get_default_device_context();
-    let cargo_manifest_dir = env!("CARGO_MANIFEST_DIR");
-    let constants_file = PathBuf::from(cargo_manifest_dir)
-        .join("tests")
-        .join(format!("{}_constants.bin", field_prefix));
-    let mut constants_buf = vec![];
-    File::open(constants_file)
-        .unwrap()
-        .read_to_end(&mut constants_buf)
-        .unwrap();
-
-    let mut custom_constants = vec![];
-    for chunk in constants_buf.chunks(field_bytes) {
-        custom_constants.push(F::from_bytes_le(chunk));
-    }
-
-    let custom_constants = create_optimized_poseidon_constants::<F>(
-        arity as u32,
-        &ctx,
-        full_rounds_half,
-        partial_rounds,
-        &mut custom_constants,
-    )
-    .unwrap();
+let custom_poseidon = Poseidon::new(
+    arity, // The arity of poseidon hash. The width will be equal to arity + 1
+    alpha, // The S-box power
+    full_rounds_half,
+    partial_rounds,
+    round_constants,
+    mds_matrix, 
+    non_sparse_matrix,
+    sparse_matrices,
+    domain_tag,
+    ctx,
+)
+.unwrap();
 ```

 ## The Tree Builder
@@ -175,21 +161,34 @@ The tree builder allows you to build Merkle trees using Poseidon.
 You can define both the tree's `height` and its `arity`. The tree `height` determines the number of layers in the tree, including the root and the leaf layer. The `arity` determines how many children each internal node can have.

 ```rust
-let height = 20;
-let arity = 2;
-let leaves = vec![F::one(); 1 << (height - 1)];
-let mut digests = vec![F::zero(); merkle_tree_digests_len(height, arity)];
-
-let mut leaves_slice = HostOrDeviceSlice::on_host(leaves);
-
-let ctx = get_default_device_context();
-let constants = load_optimized_poseidon_constants::<F>(arity, &ctx).unwrap()
+use icicle_bn254::tree::Bn254TreeBuilder;
+use icicle_bn254::poseidon::Poseidon;

 let mut config = TreeBuilderConfig::default();
-config.keep_rows = 1;
-build_poseidon_merkle_tree::<F>(&mut leaves_slice, &mut digests, height, arity, &constants, &config).unwrap();
+let arity = 2;
+config.arity = arity as u32;
+let input_block_len = arity;
+let leaves = vec![F::one(); (1 << height) * arity];
+let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];

-println!("Root: {:?}", digests[0..1][0]);
+let leaves_slice = HostSlice::from_slice(&leaves);
+let digests_slice = HostSlice::from_mut_slice(&mut digests);
+
+let ctx = device_context::DeviceContext::default();
+let hash = Poseidon::load(2, &ctx).unwrap();
+
+let mut config = TreeBuilderConfig::default();
+config.keep_rows = 5;
+Bn254TreeBuilder::build_merkle_tree(
+    leaves_slice,
+    digests_slice,
+    height,
+    input_block_len,
+    &hash,
+    &hash,
+    &config,
+)
+.unwrap();
 ```

 Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig::default()`
--- a/docs/docs/icicle/primitives/poseidon2.md
+++ b/docs/docs/icicle/primitives/poseidon2.md
@@ -0,0 +1,88 @@
+# Poseidon2
+
+[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
+
+
+## Using Poseidon2
+
+ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
+We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
+
+For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
+
+### Supported Bindings
+
+[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
+
+### Constants
+
+Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
+
+We support pre-calculated constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon2/constants) and are labeled clearly per curve `<curve_name>_poseidon2.h`.
+
+You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
+
+### Rust API
+
+This is the most basic way to use the Poseidon2 API.
+
+```rust
+let test_size = 1 << 10;
+let width = 16;
+let rate = 8;
+let ctx = get_default_device_context();
+let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
+let config = HashConfig::default();
+
+let inputs = vec![F::one(); test_size * rate as usize];
+let outputs = vec![F::zero(); test_size];
+let mut input_slice = HostOrDeviceSlice::on_host(inputs);
+let mut output_slice = HostOrDeviceSlice::on_host(outputs);
+
+poseidon.hash_many::<F>(
+    &mut input_slice,
+    &mut output_slice,
+    test_size as u32,
+    rate as u32,
+    8, // Output length
+    &config,
+)
+.unwrap();
+```
+
+In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
+
+## The Tree Builder
+
+Similar to Poseidon1, you can use Poseidon2 in a tree builder.
+
+```rust
+use icicle_bn254::tree::Bn254TreeBuilder;
+use icicle_bn254::poseidon2::Poseidon2;
+
+let mut config = TreeBuilderConfig::default();
+let arity = 2;
+config.arity = arity as u32;
+let input_block_len = arity;
+let leaves = vec![F::one(); (1 << height) * arity];
+let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
+
+let leaves_slice = HostSlice::from_slice(&leaves);
+let digests_slice = HostSlice::from_mut_slice(&mut digests);
+
+let ctx = device_context::DeviceContext::default();
+let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
+
+let mut config = TreeBuilderConfig::default();
+config.keep_rows = 5;
+Bn254TreeBuilder::build_merkle_tree(
+    leaves_slice,
+    digests_slice,
+    height,
+    input_block_len,
+    &hash,
+    &hash,
+    &config,
+)
+.unwrap();
+```
--- a/docs/package-lock.json
+++ b/docs/package-lock.json
@@ -3680,6 +3680,8 @@
      "version": "8.12.0",
      "resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
      "integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
+      "optional": true,
+      "peer": true,
      "dependencies": {
        "fast-deep-equal": "^3.1.1",
        "json-schema-traverse": "^1.0.0",
@@ -3694,7 +3696,9 @@
    "node_modules/ajv-formats/node_modules/json-schema-traverse": {
      "version": "1.0.0",
      "resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
-      "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
+      "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
+      "optional": true,
+      "peer": true
    },
    "node_modules/ajv-keywords": {
      "version": "3.5.2",
@@ -16340,14 +16344,13 @@
      "version": "2.1.1",
      "resolved": "https://registry.npmjs.org/ajv-formats/-/ajv-formats-2.1.1.tgz",
      "integrity": "sha512-Wx0Kx52hxE7C18hkMEggYlEifqWZtYaRgouJor+WMdPnQyEK13vgEWyVNup7SoeeoLMsr4kf5h6dOW11I15MUA==",
-      "requires": {
-        "ajv": "^8.0.0"
-      },
+      "requires": {},
      "dependencies": {
        "ajv": {
-          "version": "8.12.0",
-          "resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
+          "version": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
          "integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
+          "optional": true,
+          "peer": true,
          "requires": {
            "fast-deep-equal": "^3.1.1",
            "json-schema-traverse": "^1.0.0",
@@ -16358,7 +16361,9 @@
        "json-schema-traverse": {
          "version": "1.0.0",
          "resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
-          "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
+          "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
+          "optional": true,
+          "peer": true
        }
      }
    },
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@@ -53,6 +53,11 @@ module.exports = {
              label: "Poseidon Hash",
              id: "icicle/primitives/poseidon",
            },
+            {
+              type: "doc",
+              label: "Poseidon2 Hash",
+              id: "icicle/primitives/poseidon2",
+            },
          ],
        },
        {
--- a/examples/c++/mont_vec_ops/CMakeLists.txt
+++ b/examples/c++/mont_vec_ops/CMakeLists.txt
@@ -0,0 +1,24 @@
+cmake_minimum_required(VERSION 3.18)
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CUDA_STANDARD 17)
+set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
+set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
+if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
+    set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
+else()
+    set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
+endif ()
+project(example LANGUAGES CUDA CXX)
+
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
+set(CMAKE_CUDA_FLAGS_RELEASE "")
+set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
+add_executable(
+  example
+  example.cu
+)
+target_include_directories(example PRIVATE "../../../icicle/include")
+target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
+find_library(NVML_LIBRARY nvidia-ml PATHS /usr/local/cuda/targets/x86_64-linux/lib/stubs/ )
+target_link_libraries(example ${NVML_LIBRARY})
+set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
--- a/examples/c++/mont_vec_ops/README.md
+++ b/examples/c++/mont_vec_ops/README.md
@@ -0,0 +1,38 @@
+#Icicle example : Montgomery vector operations(mul, add, sub) for allpossible options:
+is_a_on_device
+is_b_on_device
+is_result_on_device
+is_in_montgomery_form
+(is_async isn't checked)
+
+
+## Key-Takeaway
+
+`Icicle` accelerates multiplication operation `*` using [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
+
+## Concise Usage Explanation
+
+Define field to be used, e. g.:
+
+```c++
+#include "api/bn254.h"
+```
+
+```c++
+using namespace bn254;
+typedef scalar_t T;
+```
+
+## Running the example
+
+- `cd` to your example directory
+- compile with `./compile.sh`
+- run with `./run.sh`
+
+## What's in the example
+
+1. Define the parameters for the example such as vector size 
+2. Generate random vectors on-host
+3. Copy them on-device
+4. Execute element-wise vector multiplication on-device
+5. Copy results on-host
--- a/examples/c++/mont_vec_ops/compile.debug.sh
+++ b/examples/c++/mont_vec_ops/compile.debug.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+# Exit immediately on error
+set -e
+
+mkdir -p build/example
+mkdir -p build/icicle
+
+# Configure and build Icicle
+cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Debug -DCURVE=bn254
+cmake --build build/icicle -j
+
+# Configure and build the example application
+cmake -DCMAKE_BUILD_TYPE=Debug -S. -B build/example
+cmake --build build/example
--- a/examples/c++/mont_vec_ops/compile.sh
+++ b/examples/c++/mont_vec_ops/compile.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+# Exit immediately on error
+set -e
+
+mkdir -p build/example
+mkdir -p build/icicle
+
+# Configure and build Icicle
+cmake -S ../../../icicle/ -B build/icicle -DMSM=OFF -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254
+cmake --build build/icicle
+
+# Configure and build the example application
+cmake -S . -B build/example
+cmake --build build/example
--- a/examples/c++/mont_vec_ops/example.cu
+++ b/examples/c++/mont_vec_ops/example.cu
@@ -0,0 +1,734 @@
+#include <iostream>
+#include <iomanip>
+#include <chrono>
+#include <nvml.h>
+
+#include "api/bn254.h"
+#include "vec_ops/vec_ops.cuh"
+#include <vec_ops/../../include/utils/mont.cuh>
+
+using namespace vec_ops;
+using namespace bn254;
+
+typedef scalar_t T;
+
+enum Op { MUL, ADD, SUB, LAST };
+
+// bn254 p = 21888242871839275222246405745257275088548364400416034343698204186575808495617
+
+int vector_op(
+  T* vec_a,
+  T* vec_b,
+  T* vec_result,
+  size_t n_elements,
+  device_context::DeviceContext ctx,
+  vec_ops::VecOpsConfig config,
+  Op op)
+{
+  cudaError_t err;
+  switch (op) {
+  case MUL:
+    err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
+    break;
+  case ADD:
+    err = bn254_add_cuda(vec_a, vec_b, n_elements, config, vec_result);
+    break;
+  case SUB:
+    err = bn254_sub_cuda(vec_a, vec_b, n_elements, config, vec_result);
+    break;
+  }
+  // cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  return 0;
+}
+int vector_mul(
+  T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
+{
+  cudaError_t err = bn254_mul_cuda(vec_a, vec_b, n_elements, config, vec_result);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  return 0;
+}
+int vector_add(
+  T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
+{
+  cudaError_t err = bn254_add_cuda(vec_a, vec_b, n_elements, config, vec_result);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  return 0;
+}
+int vector_sub(
+  T* vec_a, T* vec_b, T* vec_result, size_t n_elements, device_context::DeviceContext ctx, vec_ops::VecOpsConfig config)
+{
+  cudaError_t err = bn254_sub_cuda(vec_a, vec_b, n_elements, config, vec_result);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to multiply vectors - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  return 0;
+}
+
+int main(int argc, char** argv)
+{
+  const unsigned vector_size = 1 << 0;
+  const unsigned repetitions = 1 << 0;
+
+  cudaError_t err;
+  nvmlInit();
+  nvmlDevice_t device;
+  nvmlDeviceGetHandleByIndex(0, &device); // for GPU 0
+  std::cout << "Icicle-Examples: vector mul / add / sub operations." << std::endl;
+  char name[NVML_DEVICE_NAME_BUFFER_SIZE];
+  if (nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE) == NVML_SUCCESS) {
+    std::cout << "GPU Model: " << name << std::endl;
+  } else {
+    std::cerr << "Failed to get GPU model name." << std::endl;
+  }
+  unsigned power_limit;
+  nvmlDeviceGetPowerManagementLimit(device, &power_limit);
+
+  std::cout << "Vector size: " << vector_size << std::endl;
+  std::cout << "Repetitions: " << repetitions << std::endl;
+  std::cout << "Power limit: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_limit << " W" << std::endl;
+
+  unsigned int baseline_power;
+  nvmlDeviceGetPowerUsage(device, &baseline_power);
+  std::cout << "Baseline power: " << std::fixed << std::setprecision(3) << 1.0e-3 * baseline_power << " W" << std::endl;
+  unsigned baseline_temperature;
+  if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &baseline_temperature) == NVML_SUCCESS) {
+    std::cout << "Baseline GPU Temperature: " << baseline_temperature << " C" << std::endl;
+  } else {
+    std::cerr << "Failed to get GPU temperature." << std::endl;
+  }
+
+  // host data
+  std::cout << "Allocate memory for the input vectors (both normal and Montgomery presentation)" << std::endl;
+  T* host_in1_init = (T*)malloc(vector_size * sizeof(T));
+  T* host_in2_init = (T*)malloc(vector_size * sizeof(T));
+  std::cout << "Initializing vectors with normal presentation random data" << std::endl;
+  T::rand_host_many(host_in1_init, vector_size);
+  T::rand_host_many(host_in2_init, vector_size);
+  std::cout << "Allocate memory for the output vectors" << std::endl;
+  T* host_out = (T*)malloc(vector_size * sizeof(T)); // This memory will be used for the test output.
+  T* host_out_ref_mul = (T*)malloc(
+    vector_size *
+    sizeof(T)); // This memory will be used as a reference result for mul (will be compared to host_out content).
+  T* host_out_ref_add = (T*)malloc(
+    vector_size *
+    sizeof(T)); // This memory will be used as a reference result for add (will be compared to host_out content).
+  T* host_out_ref_sub = (T*)malloc(
+    vector_size *
+    sizeof(T)); // This memory will be used as a reference result for sub (will be compared to host_out content).
+  std::cout << "Initializing output vectors with random data" << std::endl;
+  T::rand_host_many(host_out, vector_size);
+  T::rand_host_many(host_out_ref_mul, vector_size);
+  T::rand_host_many(host_out_ref_add, vector_size);
+  T::rand_host_many(host_out_ref_sub, vector_size);
+  // device data
+  device_context::DeviceContext ctx = device_context::get_default_device_context();
+  T* device_in1;
+  T* device_in2;
+  T* device_out;
+
+  err = cudaMalloc((void**)&device_in1, vector_size * sizeof(T));
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  err = cudaMalloc((void**)&device_in2, vector_size * sizeof(T));
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  err = cudaMalloc((void**)&device_out, vector_size * sizeof(T));
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to allocate device memory - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+
+  vec_ops::VecOpsConfig config = vec_ops::DefaultVecOpsConfig();
+
+  //****************************************
+  // Test warn-up and reference output config. Reference output to be used to check if test passed or not.
+  //****************************************
+  // copy from host to device
+  err = cudaMemcpy(device_in1, host_in1_init, vector_size * sizeof(T), cudaMemcpyHostToDevice);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  err = cudaMemcpy(device_in2, host_in2_init, vector_size * sizeof(T), cudaMemcpyHostToDevice);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to copy data from host to device - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  std::cout << "Starting warm-up run" << std::endl;
+  // Warm-up loop
+  for (int op = MUL; op != LAST; op++) {
+    for (int i = 0; i < repetitions; i++) {
+      // vector_mul(device_in1, device_in2, device_out, vector_size, ctx, config);
+      vector_op(device_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
+    }
+    switch (op) {
+    case MUL:
+      err = cudaMemcpy(host_out_ref_mul, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+      break;
+    case ADD:
+      err = cudaMemcpy(host_out_ref_add, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+      break;
+    case SUB:
+      err = cudaMemcpy(host_out_ref_sub, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+      break;
+    }
+  }
+  // copy the result from device to host_out_ref_mul to keep it for later comparisons.
+  // err = cudaMemcpy(host_out_ref_mul, device_out, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+  if (err != cudaSuccess) {
+    std::cerr << "Failed to copy data from device_out to host - " << cudaGetErrorString(err) << std::endl;
+    return 0;
+  }
+  //****************************************
+  // End of test warn-up and reference output config.
+  //****************************************
+
+  std::cout << "Starting benchmarking" << std::endl;
+  unsigned power_before;
+  nvmlDeviceGetPowerUsage(device, &power_before);
+  std::cout << "Power before: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_before << " W" << std::endl;
+  std::cout << "Power utilization: " << std::fixed << std::setprecision(1) << (float)100.0 * power_before / power_limit
+            << " %" << std::endl;
+  unsigned temperature_before;
+  if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_before) == NVML_SUCCESS) {
+    std::cout << "GPU Temperature before: " << temperature_before << " C" << std::endl;
+  } else {
+    std::cerr << "Failed to get GPU temperature." << std::endl;
+  }
+
+  //*******************************************************
+  // Benchmark test:
+  // Loop for (mul, add, sub):
+  //   Loop (is_a_on_device, is_b_on_device, is_result_on_device, is_in_montgomery_form):
+  //*******************************************************
+  T* host_in1 =
+    (T*)malloc(vector_size * sizeof(T)); // This buffer is used to load the data from host_in1_init for the benchmark.
+  T* host_in2 =
+    (T*)malloc(vector_size * sizeof(T)); // This buffer is used to load the data from host_in2_init for the benchmark.
+  // Test when the result is not in-place
+  for (int op = MUL; op != LAST; op++) {
+    // for (int config_idx = 0; config_idx < 0; config_idx++) {
+    for (int config_idx = 0; config_idx < 16; config_idx++) {
+      std::cout << "Start benchmark loop for config_idx " << config_idx << std::endl;
+      for (int i = 0; i < vector_size; i++) {
+        host_in1[i] = host_in1_init[i];
+        host_in2[i] = host_in2_init[i];
+      }
+      config.is_a_on_device = (config_idx >> 3) & 0x1;
+      config.is_b_on_device = (config_idx >> 2) & 0x1;
+      config.is_result_on_device = (config_idx >> 1) & 0x1;
+      config.is_in_montgomery_form = (config_idx >> 0) & 0x1;
+
+      // Copy from host to device (copy again in order to be used later in the loop and device_inX was already
+      // overwritten by warmup.
+      if (config.is_a_on_device) {
+        if (config.is_in_montgomery_form) {
+          err =
+            cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(
+            mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1)); // Convert in-place.
+        } else {                                                                          // Normal presentation.
+          err =
+            cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      } else {
+        if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
+          err =
+            cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1));
+          err = cudaMemcpy(host_in1, device_in1, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_in1 to host_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      }
+      if (config.is_b_on_device) {
+        if (config.is_in_montgomery_form) {
+          err =
+            cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in2 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(
+            mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2)); // Convert in-place.
+        } else {
+          // Normal presentation.
+          err =
+            cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      } else {
+        if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
+          err =
+            cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2));
+          err = cudaMemcpy(host_in2, device_in2, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_in2 to host_in2 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      }
+      CHK_IF_RETURN(cudaPeekAtLastError());
+
+      auto start_time = std::chrono::high_resolution_clock::now();
+      // Benchmark loop
+      for (int i = 0; i < repetitions; i++) {
+        switch (config_idx >> 1) { // {is_a_on_device, is_b_on_device, is_result_on_device}
+        case 0b000:
+          vector_op(host_in1, host_in2, host_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b001:
+          vector_op(host_in1, host_in2, device_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b010:
+          vector_op(host_in1, device_in2, host_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b011:
+          vector_op(host_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b100:
+          vector_op(device_in1, host_in2, host_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b101:
+          vector_op(device_in1, host_in2, device_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b110:
+          vector_op(device_in1, device_in2, host_out, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b111:
+          vector_op(device_in1, device_in2, device_out, vector_size, ctx, config, (Op)op);
+          break;
+        }
+        CHK_IF_RETURN(cudaPeekAtLastError());
+      }
+
+      auto end_time = std::chrono::high_resolution_clock::now();
+      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
+      switch (op) {
+      case MUL:
+        std::cout << "Elapsed time: " << duration.count() << " microseconds, operation MUL for config_idx "
+                  << config_idx << " and result not in-place" << std::endl;
+        break;
+      case ADD:
+        std::cout << "Elapsed time: " << duration.count() << " microseconds, operation ADD for config_idx "
+                  << config_idx << " and result not in-place" << std::endl;
+        break;
+      case SUB:
+        std::cout << "Elapsed time: " << duration.count() << " microseconds, operation SUB for config_idx "
+                  << config_idx << " and result not in-place" << std::endl;
+        break;
+      }
+
+      if (config.is_result_on_device) { // Copy the data to host_out in order to compare it vs. host_out_ref_mul value.
+        if (config.is_in_montgomery_form) { // Convert to normal from montgomery if needed.
+          CHK_IF_RETURN(mont::from_montgomery(
+            device_out, vector_size, config.ctx.stream,
+            device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
+        }
+        err = cudaMemcpy(
+          host_out, device_out, vector_size * sizeof(T),
+          cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
+        if (err != cudaSuccess) {
+          std::cerr << "Failed to copy data from device_out to host - " << cudaGetErrorString(err) << std::endl;
+          return 0;
+        }
+      } else {                              // Data is not on device but it is in host_out.
+        if (config.is_in_montgomery_form) { // host_out should be written to device, converted to mmontgomery and
+                                            // written back to host. Then compared vs. host_out_ref_mul.
+          err = cudaMemcpy(
+            device_out, host_out, vector_size * sizeof(T),
+            cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_out to device_out - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(mont::from_montgomery(
+            device_out, vector_size, config.ctx.stream,
+            device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
+          err = cudaMemcpy(
+            host_out, device_out, vector_size * sizeof(T),
+            cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        } else { // host_out could be compared vs. host_out_ref_mul as is.
+        }
+      }
+      //****************************************
+      // End of benchmark test.
+      //****************************************
+
+      //***********************************************
+      // Test result check
+      // Check is performed by executing the operation in a normal presentation
+      //   (located in in host_out_ref_mul) and comparing it with the
+      //   benchmark test result.
+      //***********************************************
+      int test_failed = 0;
+      // std::cout << "===>>> host_out_ref_mul[" << i << "]: " << host_out_ref_mul[i] << std::endl;
+      // std::cout << "===>>> host_out[" << i << "] after test run: " << host_out[i] << std::endl;
+      switch (op) {
+      case MUL:
+        for (int i = 0; i < vector_size; i++) {
+          if (host_out_ref_mul[i] != host_out[i]) {
+            std::cout << "===>>> ERROR!!! MUL: Test failed for vector index " << i
+                      << ", config is printed below:" << std::endl;
+            test_failed = 1;
+          }
+        }
+        break;
+      case ADD:
+        for (int i = 0; i < vector_size; i++) {
+          if (host_out_ref_add[i] != host_out[i]) {
+            std::cout << "===>>> ERROR!!! ADD: Test failed for vector index " << i
+                      << ", config is printed below:" << std::endl;
+            test_failed = 1;
+          }
+        }
+        break;
+      case SUB:
+        for (int i = 0; i < vector_size; i++) {
+          if (host_out_ref_sub[i] != host_out[i]) {
+            std::cout << "===>>> ERROR!!! SUB: Test failed for vector index " << i
+                      << ", config is printed below:" << std::endl;
+            test_failed = 1;
+          }
+        }
+        break;
+      }
+      if (test_failed) {
+        // std::cout << "===>>> ERROR!!! Test failed for vector index " << i << ", config is printed below:" <<
+        // std::endl;
+        std::cout << "===>>> result is not in-place: " << std::endl;
+        std::cout << "===>>> is_a_on_device: " << config.is_a_on_device << std::endl;
+        std::cout << "===>>> is_b_on_device: " << config.is_b_on_device << std::endl;
+        std::cout << "===>>> is_result_on_device: " << config.is_result_on_device << std::endl;
+        std::cout << "===>>> is_in_montgomery_form: " << config.is_in_montgomery_form << std::endl;
+        exit(2);
+      }
+
+      unsigned power_after;
+      nvmlDeviceGetPowerUsage(device, &power_after);
+      std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
+      std::cout << "Power utilization: " << std::fixed << std::setprecision(1)
+                << (float)100.0 * power_after / power_limit << " %" << std::endl;
+      unsigned temperature_after;
+      if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
+        std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
+      } else {
+        std::cerr << "Failed to get GPU temperature." << std::endl;
+      }
+
+      // Report performance in GMPS: Giga Multiplications Per Second
+      double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
+      std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
+    }
+  }
+
+  // Test when the result is in-place
+  for (int op = MUL; op != LAST; op++) {
+    for (int config_idx = 0; config_idx < 16; config_idx++) {
+      for (int i = 0; i < vector_size; i++) {
+        host_in1[i] = host_in1_init[i];
+        host_in2[i] = host_in2_init[i];
+      }
+      config.is_a_on_device = (config_idx >> 4) & 0x1;
+      config.is_b_on_device = (config_idx >> 3) & 0x1;
+      config.is_result_on_device = (config_idx >> 2) & 0x1;
+      config.is_in_montgomery_form = (config_idx >> 1) & 0x1;
+      if (config.is_a_on_device ^ config.is_result_on_device == 1) { continue; }
+
+      // Copy from host to device (copy again in order to be used later in the loop and device_inX was already
+      // overwritten by warmup.
+      if (config.is_a_on_device) {
+        if (config.is_in_montgomery_form) {
+          err =
+            cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(
+            mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1)); // Convert in-place.
+        } else {                                                                          // Normal presentation.
+          err =
+            cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      } else {
+        if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
+          err =
+            cudaMemcpy(device_in1, host_in1, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(mont::to_montgomery(device_in1, vector_size, config.ctx.stream, device_in1));
+          err = cudaMemcpy(host_in1, device_in1, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_in1 to host_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      }
+      if (config.is_b_on_device) {
+        if (config.is_in_montgomery_form) {
+          err =
+            cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in2 to device_in1 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(
+            mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2)); // Convert in-place.
+        } else {
+          // Normal presentation.
+          err =
+            cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      } else {
+        if (config.is_in_montgomery_form) { // Copy to device, cnvert to montgomery and copy back to host.
+          err =
+            cudaMemcpy(device_in2, host_in2, vector_size * sizeof(T), cudaMemcpyHostToDevice); // Copy data to device.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in2 to device_in2 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(mont::to_montgomery(device_in2, vector_size, config.ctx.stream, device_in2));
+          err = cudaMemcpy(host_in2, device_in2, vector_size * sizeof(T), cudaMemcpyDeviceToHost);
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_in2 to host_in2 - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      }
+      CHK_IF_RETURN(cudaPeekAtLastError());
+
+      auto start_time = std::chrono::high_resolution_clock::now();
+      // Benchmark loop
+      for (int i = 0; i < repetitions; i++) {
+        switch (config_idx >> 2) { // {is_a_on_device, is_b_on_device, is_result_on_device}
+        case 0b000:
+          vector_op(host_in1, host_in2, host_in1, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b001:
+          break;
+        case 0b010:
+          vector_op(host_in1, device_in2, host_in1, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b011:
+          break;
+        case 0b100:
+          break;
+        case 0b101:
+          vector_op(device_in1, host_in2, device_in1, vector_size, ctx, config, (Op)op);
+          break;
+        case 0b110:
+          break;
+        case 0b111:
+          vector_op(device_in1, device_in2, device_in1, vector_size, ctx, config, (Op)op);
+          break;
+        }
+        CHK_IF_RETURN(cudaPeekAtLastError());
+      }
+
+      auto end_time = std::chrono::high_resolution_clock::now();
+      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
+      switch (op) {
+      case MUL:
+        std::cout << "Elapsed time: " << duration.count() << " microseconds, operation MUL for config_idx "
+                  << config_idx << " and result in-place" << std::endl;
+        break;
+      case ADD:
+        std::cout << "Elapsed time: " << duration.count() << " microseconds, operation ADD for config_idx "
+                  << config_idx << " and result in-place" << std::endl;
+        break;
+      case SUB:
+        std::cout << "Elapsed time: " << duration.count() << " microseconds, operation SUB for config_idx "
+                  << config_idx << " and result in-place" << std::endl;
+        break;
+      }
+
+      if (config.is_result_on_device) { // Copy the data to host_out in order to compare it vs. host_out_ref_mul value.
+        if (config.is_in_montgomery_form) { // Convert to normal from montgomery if needed.
+          CHK_IF_RETURN(mont::from_montgomery(
+            device_in1, vector_size, config.ctx.stream,
+            device_in1)); // Convert to normal in order to check vs. host_out_ref_mul.
+        }
+        err = cudaMemcpy(
+          host_out, device_in1, vector_size * sizeof(T),
+          cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
+        if (err != cudaSuccess) {
+          std::cerr << "Failed to copy data from device_in1 to host_out - " << cudaGetErrorString(err) << std::endl;
+          return 0;
+        }
+      } else { // Data is not on device but it is in host_in1. It should be moved to host_out for test pass/fail check.
+        if (config.is_in_montgomery_form) { // host_out should be written to device, converted to mmontgomery and
+                                            // written back to host. Then compared vs. host_out_ref_mul.
+          err = cudaMemcpy(
+            device_out, host_in1, vector_size * sizeof(T),
+            cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_out - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          CHK_IF_RETURN(mont::from_montgomery(
+            device_out, vector_size, config.ctx.stream,
+            device_out)); // Convert to normal in order to check vs. host_out_ref_mul.
+          err = cudaMemcpy(
+            host_out, device_out, vector_size * sizeof(T),
+            cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        } else { // host_out could be compared vs. host_out_ref_mul as is.
+          err = cudaMemcpy(
+            device_out, host_in1, vector_size * sizeof(T),
+            cudaMemcpyHostToDevice); // Copy to host_out in order to check vs. host_out_ref_mul.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from host_in1 to device_out - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+          err = cudaMemcpy(
+            host_out, device_out, vector_size * sizeof(T),
+            cudaMemcpyDeviceToHost); // Copy to host_out in order to check vs. host_out_ref_mul.
+          if (err != cudaSuccess) {
+            std::cerr << "Failed to copy data from device_out to host_out - " << cudaGetErrorString(err) << std::endl;
+            return 0;
+          }
+        }
+      }
+      //****************************************
+      // End of benchmark test.
+      //****************************************
+
+      //***********************************************
+      // Test result check
+      // Check is performed by executing the operation in a normal presentation
+      //   (located in in host_out_ref_mul) and comparing it with the
+      //   benchmark test result.
+      //***********************************************
+      int test_failed = 0;
+      // std::cout << "===>>> host_out_ref_mul[" << i << "]: " << host_out_ref_mul[i] << std::endl;
+      // std::cout << "===>>> host_out[" << i << "] after test run: " << host_out[i] << std::endl;
+      switch (op) {
+      case MUL:
+        for (int i = 0; i < vector_size; i++) {
+          if (host_out_ref_mul[i] != host_out[i]) {
+            std::cout << "===>>> ERROR!!! MUL: Test failed for vector index " << i
+                      << ", config is printed below:" << std::endl;
+            std::cout << "host_out_ref_mul[0] = " << host_out_ref_mul[0] << std::endl;
+            test_failed = 1;
+          }
+        }
+        break;
+      case ADD:
+        for (int i = 0; i < vector_size; i++) {
+          if (host_out_ref_add[i] != host_out[i]) {
+            std::cout << "===>>> ERROR!!! ADD: Test failed for vector index " << i
+                      << ", config is printed below:" << std::endl;
+            std::cout << "host_out_ref_add[0] = " << host_out_ref_add[0] << std::endl;
+            test_failed = 1;
+          }
+        }
+        break;
+      case SUB:
+        for (int i = 0; i < vector_size; i++) {
+          if (host_out_ref_sub[i] != host_out[i]) {
+            std::cout << "===>>> ERROR!!! SUB: Test failed for vector index " << i
+                      << ", config is printed below:" << std::endl;
+            std::cout << "host_out_ref_sub[0] = " << host_out_ref_sub[0] << std::endl;
+            test_failed = 1;
+          }
+        }
+        break;
+      }
+      if (test_failed) {
+        // std::cout << "===>>> ERROR!!! Test failed for vector index " << i << ", config is printed below:" <<
+        // std::endl;
+        std::cout << "===>>> result is in-place: " << std::endl;
+        std::cout << "===>>> is_a_on_device: " << config.is_a_on_device << std::endl;
+        std::cout << "===>>> is_b_on_device: " << config.is_b_on_device << std::endl;
+        std::cout << "===>>> is_result_on_device: " << config.is_result_on_device << std::endl;
+        std::cout << "===>>> is_in_montgomery_form: " << config.is_in_montgomery_form << std::endl;
+        std::cout << "host_out[0] = " << host_out[0] << std::endl;
+        exit(2);
+      }
+
+      unsigned power_after;
+      nvmlDeviceGetPowerUsage(device, &power_after);
+      std::cout << "Power after: " << std::fixed << std::setprecision(3) << 1.0e-3 * power_after << " W" << std::endl;
+      std::cout << "Power utilization: " << std::fixed << std::setprecision(1)
+                << (float)100.0 * power_after / power_limit << " %" << std::endl;
+      unsigned temperature_after;
+      if (nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature_after) == NVML_SUCCESS) {
+        std::cout << "GPU Temperature after: " << temperature_after << " C" << std::endl;
+      } else {
+        std::cerr << "Failed to get GPU temperature." << std::endl;
+      }
+
+      // Report performance in GMPS: Giga Multiplications Per Second
+      double GMPS = 1.0e-9 * repetitions * vector_size / (1.0e-6 * duration.count());
+      std::cout << "Performance: " << GMPS << " Giga Multiplications Per Second" << std::endl;
+    }
+  }
+
+  // clean up and exit
+  free(host_in1_init);
+  free(host_in2_init);
+  free(host_in1);
+  free(host_in2);
+  free(host_out);
+  free(host_out_ref_mul);
+  cudaFree(device_in1);
+  cudaFree(device_in2);
+  cudaFree(device_out);
+  nvmlShutdown();
+  return 0;
+}
--- a/examples/c++/mont_vec_ops/run.sh
+++ b/examples/c++/mont_vec_ops/run.sh
@@ -0,0 +1,2 @@
+#! /bin/bash
+./build/example/example
--- a/icicle/include/hash/keccak/keccak.cuh
+++ b/icicle/include/hash/keccak/keccak.cuh
@@ -11,6 +11,17 @@
 using namespace hash;

 namespace keccak {
+  // Input rate in bytes
+  const int KECCAK_256_RATE = 136;
+  const int KECCAK_512_RATE = 72;
+
+  // Digest size in u64
+  const int KECCAK_256_DIGEST = 4;
+  const int KECCAK_512_DIGEST = 8;
+
+  // Number of state elements in u64
+  const int KECCAK_STATE_SIZE = 25;
+
  class Keccak : public Hasher<uint8_t, uint64_t>
  {
  public:
@@ -22,7 +33,7 @@ namespace keccak {
      unsigned int output_len,
      const device_context::DeviceContext& ctx) const override;

-    Keccak(unsigned int rate) : Hasher<uint8_t, uint64_t>(25, 25, rate, 0) {}
+    Keccak(unsigned int rate) : Hasher<uint8_t, uint64_t>(KECCAK_STATE_SIZE, KECCAK_STATE_SIZE, rate, 0) {}
  };
 } // namespace keccak

--- a/icicle/include/vec_ops/vec_ops.cuh
+++ b/icicle/include/vec_ops/vec_ops.cuh
@@ -27,6 +27,8 @@ namespace vec_ops {
                    *   non-blocking and you'd need to synchronize it explicitly by running
                    *   `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the
                    *   function will block the current CPU thread. */
+    bool is_in_montgomery_form; /**< If true then vec_a, vec_b and result are in montgomery form.
+                                 * Default value: false.  */
  };

  /**
@@ -42,6 +44,7 @@ namespace vec_ops {
      false, // is_b_on_device
      false, // is_result_on_device
      false, // is_async
+      false, // is_in_montgomery_form
    };
    return config;
  }
--- a/icicle/src/hash/keccak/extern.cu
+++ b/icicle/src/hash/keccak/extern.cu
@@ -11,13 +11,15 @@ namespace keccak {
  extern "C" cudaError_t
  keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)
  {
-    return Keccak(136).hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, 4, config);
+    return Keccak(KECCAK_256_RATE)
+      .hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, KECCAK_256_DIGEST, config);
  }

  extern "C" cudaError_t
  keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, HashConfig& config)
  {
-    return Keccak(72).hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, 8, config);
+    return Keccak(KECCAK_512_RATE)
+      .hash_many(input, (uint64_t*)output, number_of_blocks, input_block_size, KECCAK_512_DIGEST, config);
  }

  extern "C" cudaError_t build_keccak256_merkle_tree_cuda(
@@ -27,7 +29,7 @@ namespace keccak {
    unsigned int input_block_len,
    const merkle_tree::TreeBuilderConfig& tree_config)
  {
-    Keccak keccak(136);
+    Keccak keccak(KECCAK_256_RATE);
    return merkle_tree::build_merkle_tree<uint8_t, uint64_t>(
      leaves, digests, height, input_block_len, keccak, keccak, tree_config);
  }
@@ -39,7 +41,7 @@ namespace keccak {
    unsigned int input_block_len,
    const merkle_tree::TreeBuilderConfig& tree_config)
  {
-    Keccak keccak(72);
+    Keccak keccak(KECCAK_512_RATE);
    return merkle_tree::build_merkle_tree<uint8_t, uint64_t>(
      leaves, digests, height, input_block_len, keccak, keccak, tree_config);
  }
--- a/icicle/src/hash/keccak/keccak.cu
+++ b/icicle/src/hash/keccak/keccak.cu
@@ -161,7 +161,7 @@ namespace keccak {
                                 0x8000000000008002, 0x8000000000000080, 0x000000000000800a, 0x800000008000000a,
                                 0x8000000080008081, 0x8000000000008080, 0x0000000080000001, 0x8000000080008008};

-  __device__ void keccakf(u64 s[25])
+  __device__ void keccakf(u64 s[KECCAK_STATE_SIZE])
  {
    u64 t0, t1, t2, t3, t4;

@@ -188,7 +188,7 @@ namespace keccak {

    const uint8_t* b_input = input + sid * input_block_size;
    uint64_t* b_output = output + sid * output_len;
-    uint64_t state[25] = {}; // Initialize with zeroes
+    uint64_t state[KECCAK_STATE_SIZE] = {}; // Initialize with zeroes

    int input_len = input_block_size;

@@ -238,12 +238,12 @@ namespace keccak {
    int number_of_gpu_blocks = (number_of_states - 1) / number_of_threads + 1;

    switch (rate) {
-    case 136:
-      keccak_hash_blocks<136><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
+    case KECCAK_256_RATE:
+      keccak_hash_blocks<KECCAK_256_RATE><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
        input, input_len, output_len, number_of_states, output);
      break;
-    case 72:
-      keccak_hash_blocks<72><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
+    case KECCAK_512_RATE:
+      keccak_hash_blocks<KECCAK_512_RATE><<<number_of_gpu_blocks, number_of_threads, 0, ctx.stream>>>(
        input, input_len, output_len, number_of_states, output);
      break;
    default:
--- a/icicle/src/vec_ops/vec_ops.cu
+++ b/icicle/src/vec_ops/vec_ops.cu
@@ -95,25 +95,64 @@ namespace vec_ops {
    E *d_result, *d_alloc_vec_a, *d_alloc_vec_b;
    E* d_vec_a;
    const E* d_vec_b;
+
+    int is_d_alloc_vec_a_allocated = 0;
    if (!config.is_a_on_device) {
-      CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
-      CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
-      d_vec_a = d_alloc_vec_a;
+      if (config.is_in_montgomery_form) {
+        CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
+        CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
+        CHK_IF_RETURN(mont::from_montgomery(d_alloc_vec_a, n * sizeof(E), config.ctx.stream, d_alloc_vec_a));
+        is_d_alloc_vec_a_allocated = 1;
+        d_vec_a = d_alloc_vec_a;
+      } else {
+        CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_a, n * sizeof(E), config.ctx.stream));
+        CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_a, vec_a, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
+        is_d_alloc_vec_a_allocated = 1;
+        d_vec_a = d_alloc_vec_a;
+      }
    } else {
-      d_vec_a = vec_a;
+      if (config.is_in_montgomery_form) {
+        CHK_IF_RETURN(cudaMallocAsync(
+          &d_alloc_vec_a, n * sizeof(E), config.ctx.stream)); // Allocate in order not to change the input.
+        CHK_IF_RETURN(mont::from_montgomery(vec_a, n * sizeof(E), config.ctx.stream, d_alloc_vec_a));
+        is_d_alloc_vec_a_allocated = 1;
+        d_vec_a = d_alloc_vec_a;
+      } else {
+        d_vec_a = vec_a;
+      }
    }

+    int is_d_alloc_vec_b_allocated = 0;
    if (!config.is_b_on_device) {
-      CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
-      CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
-      d_vec_b = d_alloc_vec_b;
+      if (config.is_in_montgomery_form) {
+        CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
+        CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
+        CHK_IF_RETURN(mont::from_montgomery(d_alloc_vec_b, n * sizeof(E), config.ctx.stream, d_alloc_vec_b));
+        is_d_alloc_vec_b_allocated = 1;
+        d_vec_b = d_alloc_vec_b;
+      } else {
+        CHK_IF_RETURN(cudaMallocAsync(&d_alloc_vec_b, n * sizeof(E), config.ctx.stream));
+        CHK_IF_RETURN(cudaMemcpyAsync(d_alloc_vec_b, vec_b, n * sizeof(E), cudaMemcpyHostToDevice, config.ctx.stream));
+        is_d_alloc_vec_b_allocated = 1;
+        d_vec_b = d_alloc_vec_b;
+      }
    } else {
-      d_vec_b = vec_b;
+      if (config.is_in_montgomery_form) {
+        CHK_IF_RETURN(cudaMallocAsync(
+          &d_alloc_vec_b, n * sizeof(E), config.ctx.stream)); // Allocate in order not to change the input.
+        CHK_IF_RETURN(mont::from_montgomery(vec_b, n * sizeof(E), config.ctx.stream, d_alloc_vec_b));
+        is_d_alloc_vec_b_allocated = 1;
+        d_vec_b = d_alloc_vec_b;
+      } else {
+        d_vec_b = vec_b;
+      }
    }

+    int is_d_result_allocated = 0;
    if (!config.is_result_on_device) {
      if (!is_in_place) {
        CHK_IF_RETURN(cudaMallocAsync(&d_result, n * sizeof(E), config.ctx.stream));
+        is_d_result_allocated = 1;
      } else {
        d_result = d_vec_a;
      }
@@ -129,12 +168,21 @@ namespace vec_ops {
    Kernel<<<num_blocks, num_threads, 0, config.ctx.stream>>>(d_vec_a, d_vec_b, n, d_result);

    if (!config.is_result_on_device) {
-      CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
-      CHK_IF_RETURN(cudaFreeAsync(d_result, config.ctx.stream));
+      if (config.is_in_montgomery_form) {
+        CHK_IF_RETURN(mont::to_montgomery(d_result, n * sizeof(E), config.ctx.stream, d_result)); // Convert in-place.
+        CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
+      } else {
+        CHK_IF_RETURN(cudaMemcpyAsync(result, d_result, n * sizeof(E), cudaMemcpyDeviceToHost, config.ctx.stream));
+      }
+    } else {
+      if (config.is_in_montgomery_form) {
+        CHK_IF_RETURN(mont::to_montgomery(d_result, n * sizeof(E), config.ctx.stream, d_result)); // Convert in-place.
+      }
    }

-    if (!config.is_a_on_device && !is_in_place) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_a, config.ctx.stream)); }
-    if (!config.is_b_on_device) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_b, config.ctx.stream)); }
+    if (is_d_alloc_vec_a_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_a, config.ctx.stream)); }
+    if (is_d_alloc_vec_b_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_alloc_vec_b, config.ctx.stream)); }
+    if (is_d_result_allocated) { CHK_IF_RETURN(cudaFreeAsync(d_result, config.ctx.stream)); }

    if (!config.is_async) return CHK_STICKY(cudaStreamSynchronize(config.ctx.stream));

--- a/wrappers/golang/core/vec_ops.go
+++ b/wrappers/golang/core/vec_ops.go
@@ -28,6 +28,8 @@ type VecOpsConfig struct {
 	*  non-blocking and you'll need to synchronize it explicitly by calling
 	*  `SynchronizeStream`. If set to false, the function will block the current CPU thread. */
 	IsAsync bool
+	/* If true then vec_a, vec_b and result are in montgomery form. Default value: false. */
+	IsInMontgomeryForm bool
 }

 /**
@@ -42,6 +44,7 @@ func DefaultVecOpsConfig() VecOpsConfig {
 		false, // isBOnDevice
 		false, // isResultOnDevice
 		false, // IsAsync
+		false, // IsInMontgomeryForm
 	}

 	return config
--- a/wrappers/golang/core/vec_ops_test.go
+++ b/wrappers/golang/core/vec_ops_test.go
@@ -15,6 +15,7 @@ func TestVecOpsDefaultConfig(t *testing.T) {
 		false, // isBOnDevice
 		false, // isResultOnDevice
 		false, // IsAsync
+		false, // IsInMontgomeryForm
 	}

 	actual := DefaultVecOpsConfig()
--- a/wrappers/rust/Cargo.toml
+++ b/wrappers/rust/Cargo.toml
@@ -18,7 +18,7 @@ exclude = [
 ]

 [workspace.package]
-version = "2.7.1"
+version = "2.8.0"
 edition = "2021"
 authors = [ "Ingonyama" ]
 homepage = "https://www.ingonyama.com"
--- a/wrappers/rust/icicle-core/src/vec_ops/mod.rs
+++ b/wrappers/rust/icicle-core/src/vec_ops/mod.rs
@@ -20,6 +20,8 @@ pub struct VecOpsConfig<'a> {
    /// Whether to run the vector operations asynchronously. If set to `true`, the functions will be non-blocking and you'd need to synchronize
    /// it explicitly by running `stream.synchronize()`. If set to false, the functions will block the current CPU thread.
    pub is_async: bool,
+    /// If true then vec_a, vec_b and result are in montgomery form. Default value: false.
+    pub is_in_montgomery_form: bool,
 }

 impl<'a> Default for VecOpsConfig<'a> {
@@ -36,6 +38,7 @@ impl<'a> VecOpsConfig<'a> {
            is_b_on_device: false,
            is_result_on_device: false,
            is_async: false,
+            is_in_montgomery_form: false,
        }
    }
 }
--- a/wrappers/rust/icicle-curves/icicle-bls12-377/src/tree/mod.rs
+++ b/wrappers/rust/icicle-curves/icicle-bls12-377/src/tree/mod.rs
@@ -24,6 +24,6 @@ pub(crate) mod tests {
        let ctx = device_context::DeviceContext::default();
        let sponge = Poseidon::load(2, &ctx).unwrap();

-        check_build_field_merkle_tree::<_, _, Bls12_377TreeBuilder>(25, 2, &sponge, &sponge, ScalarField::zero());
+        check_build_field_merkle_tree::<_, _, Bls12_377TreeBuilder>(18, 2, &sponge, &sponge, ScalarField::zero());
    }
 }
--- a/wrappers/rust/icicle-curves/icicle-bls12-381/src/tree/mod.rs
+++ b/wrappers/rust/icicle-curves/icicle-bls12-381/src/tree/mod.rs
@@ -24,6 +24,6 @@ pub(crate) mod tests {
        let ctx = device_context::DeviceContext::default();
        let sponge = Poseidon::load(2, &ctx).unwrap();

-        check_build_field_merkle_tree::<_, _, Bls12_381TreeBuilder>(25, 2, &sponge, &sponge, ScalarField::zero());
+        check_build_field_merkle_tree::<_, _, Bls12_381TreeBuilder>(18, 2, &sponge, &sponge, ScalarField::zero());
    }
 }
Author	SHA1	Message	Date
Leon Hibnik	080071870f	Update wrappers/rust/icicle-core/src/vec_ops/mod.rs	2024-08-06 13:52:55 +03:00
Leon Hibnik	1fc640c8b3	Apply suggestions from code review	2024-08-06 13:51:00 +03:00
danny-shterman	b10ee73b5e	Remove ## Best-Practices section	2024-08-04 06:19:10 +00:00
danny-shterman	117ee4c4a4	Fix formatting.	2024-07-31 14:24:25 +00:00
danny-shterman	993652ae8d	Formatting for rust and golang.	2024-07-31 13:47:45 +00:00
danny-shterman	9e452c47d0	Add montgomery config in rust and golang files.	2024-07-31 09:35:53 +00:00
danny-shterman	132abbb2d4	Fixed style by clang-format	2024-07-30 05:39:32 +00:00
danny-shterman	118c82f829	Fixed style by clang-format	2024-07-30 05:38:30 +00:00
danny-shterman	7e2b42f756	Add montgomery to vec_ops and example of that	2024-07-29 06:39:49 +00:00
omahs	53f34aade5	Fix typos (#558 ) Fix typos	2024-07-18 11:58:04 +03:00
release-bot	aacec3f72f	Bump rust crates' version icicle-babybear@2.8.0 icicle-bls12-377@2.8.0 icicle-bls12-381@2.8.0 icicle-bn254@2.8.0 icicle-bw6-761@2.8.0 icicle-core@2.8.0 icicle-cuda-runtime@2.8.0 icicle-grumpkin@2.8.0 icicle-hash@2.8.0 icicle-m31@2.8.0 icicle-stark252@2.8.0 Generated by cargo-workspaces	2024-07-16 13:57:56 +00:00
ChickenLover	a8fa05d0e3	Feat/roman/hash docs (#556 ) ## Describe the changes This PR... ## Linked Issues Resolves # --------- Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>	2024-07-16 16:39:35 +03:00