vec ops compiles

poseidon compiles
polynomial compiles
2026-01-13 09:27:58 -05:00 · 2024-05-12 14:01:17 +03:00 · 2024-05-12 13:43:47 +03:00 · 2024-05-12 13:28:11 +03:00 · 2024-05-12 12:55:07 +03:00 · 2024-05-09 11:31:22 +03:00
289 changed files with 6206 additions and 2684 deletions
--- a/.github/workflows/golang.yml
+++ b/.github/workflows/golang.yml
@@ -99,11 +99,40 @@ jobs:
        path: |
          icicle/build/lib/libingo_field_${{ matrix.field.name }}.a
        retention-days: 1
+    
+  build-hashes-linux:
+    name: Build hashes on Linux
+    runs-on: [self-hosted, Linux, X64, icicle]
+    needs: [check-changed-files, check-format]
+    strategy:
+      matrix:
+        hash:
+          - name: keccak
+            build_args:
+    steps:
+    - name: Checkout Repo
+      uses: actions/checkout@v4
+    - name: Setup go
+      uses: actions/setup-go@v5
+      with:
+        go-version: '1.20.0'
+    - name: Build
+      working-directory: ./wrappers/golang
+      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
+      run: ./build.sh -hash=${{ matrix.hash.name }} ${{ matrix.hash.build_args }} # builds a single hash algorithm
+    - name: Upload ICICLE lib artifacts
+      uses: actions/upload-artifact@v4
+      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
+      with:
+        name: icicle-builds-${{ matrix.hash.name }}-${{ github.workflow }}-${{ github.sha }}
+        path: |
+          icicle/build/lib/libingo_hash.a
+        retention-days: 1
  
  test-linux:
    name: Test on Linux
    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, build-curves-linux, build-fields-linux]
+    needs: [check-changed-files, build-curves-linux, build-fields-linux, build-hashes-linux]
    steps:
    - name: Checkout Repo
      uses: actions/checkout@v4
--- a/.github/workflows/rust.yml
+++ b/.github/workflows/rust.yml
@@ -60,10 +60,10 @@ jobs:
      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
      # Running tests from the root workspace will run all workspace members' tests by default
      # We need to limit the number of threads to avoid running out of memory on weaker machines
-      # ignored tests are polynomial tests. Since they conflict with NTT tests, they are executed sperately
+      # ignored tests are polynomial tests. Since they conflict with NTT tests, they are executed separately
      run: |
-        cargo test --workspace --exclude icicle-babybear --release --verbose --features=g2 -- --test-threads=2 --ignored
-        cargo test --workspace --exclude icicle-babybear --release --verbose --features=g2 -- --test-threads=2
+        cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --release --verbose --features=g2 -- --test-threads=2 --ignored
+        cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --release --verbose --features=g2 -- --test-threads=2

    - name: Run baby bear tests
      working-directory: ./wrappers/rust/icicle-fields/icicle-babybear
@@ -72,6 +72,13 @@ jobs:
        cargo test --release --verbose -- --ignored
        cargo test --release --verbose

+    - name: Run stark252 tests
+      working-directory: ./wrappers/rust/icicle-fields/icicle-stark252
+      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
+      run: |
+        cargo test --release --verbose -- --ignored
+        cargo test --release --verbose
+
  build-windows:
    name: Build on Windows
    runs-on: windows-2022
--- a/README.md
+++ b/README.md
@@ -119,6 +119,7 @@ This will ensure our custom hooks are run and will make it easier to follow our
 - [nonam3e](https://github.com/nonam3e), for adding Grumpkin curve support into ICICLE
 - [alxiong](https://github.com/alxiong), for adding warmup for CudaStream
 - [cyl19970726](https://github.com/cyl19970726), for updating go install source in Dockerfile
+- [PatStiles](https://github.com/PatStiles), for adding Stark252 field

 ## Help & Support

--- a/docs/docs/icicle/core.md
+++ b/docs/docs/icicle/core.md
@@ -2,34 +2,54 @@

 ICICLE Core is a library written in C++/CUDA. All the ICICLE primitives are implemented within ICICLE Core.

-The Core is split into logical modules that can be compiled into static libraries using different [strategies](#compilation-strategies). You can then [link](#linking) these libraries with your C++ project or write your own [bindings](#writing-new-bindings-for-icicle) for other programming languages. If you want to use ICICLE with existing bindings please refer to [Rust](/icicle/rust-bindings) / [Golang](/icicle/golang-bindings).
+The Core is split into logical modules that can be compiled into static libraries using different [strategies](#compilation-strategies). You can then [link](#linking) these libraries with your C++ project or write your own [bindings](#writing-new-bindings-for-icicle) for other programming languages. If you want to use ICICLE with existing bindings please refer to the [Rust](/icicle/rust-bindings) or [Golang](/icicle/golang-bindings) bindings documentation.
+
+## Supported curves, fields and operations
+
+### Supported curves and operations
+
+| Operation\Curve | [bn254](https://neuromancer.sk/std/bn/bn254) | [bls12-377](https://neuromancer.sk/std/bls/BLS12-377) | [bls12-381](https://neuromancer.sk/std/bls/BLS12-381) | [bw6-761](https://eprint.iacr.org/2020/351) | grumpkin |
+| --- | :---: | :---: | :---: | :---: | :---: |
+| [MSM][MSM_DOCS] | ✅ | ✅ | ✅ | ✅ | ✅ |
+| G2  | ✅ | ✅ | ✅ | ✅ | ❌ |
+| [NTT][NTT_DOCS] | ✅ | ✅ | ✅ | ✅ | ❌ |
+| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
+| [VecOps][VECOPS_CODE] | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [Polynomials][POLY_DOCS] | ✅ | ✅ | ✅ | ✅ | ❌ |
+| [Poseidon](primitives/poseidon) | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [Merkle Tree](primitives/poseidon#the-tree-builder) | ✅ | ✅ | ✅ | ✅ | ✅ |
+
+### Supported fields and operations
+
+| Operation\Field | [babybear](https://eprint.iacr.org/2023/824.pdf) | [Stark252](https://docs.starknet.io/documentation/architecture_and_concepts/Cryptography/p-value/) |
+| --- | :---: | :---: |
+| [VecOps][VECOPS_CODE] | ✅ | ✅ |
+| [Polynomials][POLY_DOCS] | ✅ | ✅ |
+| [NTT][NTT_DOCS] | ✅ | ✅ |
+| Extension Field | ✅ | ❌ |
+
+### Supported hashes
+
+| Hash | Sizes |
+| --- | :---: |
+| Keccak | 256, 512 |

 ## Compilation strategies

-Most of the codebase is curve/field agnostic, which means it can be compiled for different curves and fields. When you build ICICLE Core you choose a single curve or field. If you need multiple curves or fields  - you just compile ICICLE into multiple static libraries. It's that simple. Currently, the following choices are supported:
+Most of the codebase is curve/field agnostic, which means it can be compiled for different curves and fields. When you build ICICLE Core you choose a single curve or field. If you need multiple curves or fields, you compile ICICLE once per curve or field that is needed. It's that simple. Currently, the following choices are supported:

- - [Field mode](#compiling-for-a-field) - used for STARK fields like BabyBear / Mersenne / Goldilocks. Includes field arithmetic, NTT, Poseidon, Extension fields and other primitives.
- - [Curve mode](#compiling-for-a-curve) - used for SNARK curves like BN254/ BLS curves / Grumpkin / etc. Curve mode is built upon field mode, so it includes everything that field does. It also includes curve operations / MSM / ECNTT / G2 and other curve-related primitives.
+- [Field mode][COMPILE_FIELD_MODE] - used for STARK fields like BabyBear / Mersenne / Goldilocks. Includes field arithmetic, NTT, Poseidon, Extension fields and other primitives.
+- [Curve mode][COMPILE_CURVE_MODE] - used for SNARK curves like BN254 / BLS curves / Grumpkin / etc. Curve mode is built upon field mode, so it includes everything that field does It also includes curve operations / MSM / ECNTT / G2 and other curve-related primitives.

 :::info

-If you only want to use curve's scalar/base field, you still need to go with a curve mode. You can disable MSM with [options](#compilation-options)
+If you only want to use a curve's scalar or base field, you still need to use curve mode. You can disable MSM with [options](#compilation-options)

 :::

 ### Compiling for a field

-ICICLE supports the following STARK fields:
- - [BabyBear](https://eprint.iacr.org/2023/824.pdf)
-
-Field mode includes:
- - [Field arithmetic](https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/fields/field.cuh) - field multiplication, addition, subtraction
- - [NTT](icicle/primitives/ntt) - FFT / iFFT
- - [Poseidon Hash](icicle/primitives/poseidon)
- - [Vector operations](https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/vec_ops/vec_ops.cuh)
- - [Polynomial](#) - structs and methods to work with polynomials
-
-You can compile ICICLE for a STARK field using this command:
+You can compile ICICLE for a field using this command:

 ```sh
 cd icicle
@@ -38,24 +58,10 @@ cmake -DFIELD=<FIELD> -S . -B build
 cmake --build build -j
 ```

-Icicle Supports the following `<FIELD>` FIELDS:
- `babybear`
-
 This command will output `libingo_field_<FIELD>.a` into `build/lib`.

 ### Compiling for a curve

-ICICLE supports the following SNARK curves:
- - [BN254](https://neuromancer.sk/std/bn/bn254)
- - [BLS12-377](https://neuromancer.sk/std/bls/BLS12-377)
- - [BLS12-381](https://neuromancer.sk/std/bls/BLS12-381)
- - [BW6-761](https://eprint.iacr.org/2020/351)
- - Grumpkin
-
-Curve mode includes everything you can find in field mode with addition of:
- - [MSM](icicle/primitives/msm) - MSM / Batched MSM
- - [ECNTT](#)
-
 :::note

 Field related primitives will be compiled for the scalar field of the curve
@@ -81,7 +87,7 @@ There exist multiple options that allow you to customize your build or enable ad

 #### EXT_FIELD

-Used only in a [field mode](#compiling-for-a-field) to add Extension field into a build. Adds NTT for the extension field.
+Used only in [field mode][COMPILE_FIELD_MODE] to add an Extension field. Adds all supported field operations for the extension field.

 Default: `OFF`

@@ -89,7 +95,7 @@ Usage: `-DEXT_FIELD=ON`

 #### G2

-Used only in a [curve mode](#compiling-for-a-curve) to add G2 definitions into a build. Also adds G2 MSM.
+Used only in [curve mode][COMPILE_CURVE_MODE] to add G2 definitions. Also adds G2 MSM.

 Default: `OFF`

@@ -97,7 +103,7 @@ Usage: `-DG2=ON`

 #### ECNTT

-Used only in a [curve mode](#compiling-for-a-curve) to add ECNTT function into a build.
+Used only in [curve mode][COMPILE_CURVE_MODE] to add ECNTT function.

 Default: `OFF`

@@ -105,7 +111,7 @@ Usage: `-DECNTT=ON`

 #### MSM

-Used only in a [curve mode](#compiling-for-a-curve) to add MSM function into a build. As MSM takes a lot of time to build, you can disable it with this option to reduce compilation time.
+Used only in [curve mode][COMPILE_CURVE_MODE] to add MSM function. As MSM takes a lot of time to build, you can disable it with this option to reduce compilation time.

 Default: `ON`

@@ -149,14 +155,13 @@ To link ICICLE with your project you first need to compile ICICLE with options o

 Refer to our [c++ examples](https://github.com/ingonyama-zk/icicle/tree/main/examples/c%2B%2B) for more info. Take a look at this [CMakeLists.txt](https://github.com/ingonyama-zk/icicle/blob/main/examples/c%2B%2B/msm/CMakeLists.txt#L22)

-
 ## Writing new bindings for ICICLE

 Since ICICLE Core is written in CUDA / C++ its really simple to generate static libraries. These static libraries can be installed on any system and called by higher level languages such as Golang.

 Static libraries can be loaded into memory once and used by multiple programs, reducing memory usage and potentially improving performance. They also allow you to separate functionality into distinct modules so your static library may need to compile only specific features that you want to use.

-Let's review the [Golang bindings](golang-bindings.md) since its a pretty verbose example (compared to rust which hides it pretty well) of using static libraries. Golang has a library named `CGO` which can be used to link static libraries. Here's a basic example on how you can use cgo to link these libraries:
+Let's review the [Golang bindings][GOLANG_BINDINGS] since its a pretty verbose example (compared to rust which hides it pretty well) of using static libraries. Golang has a library named `CGO` which can be used to link static libraries. Here's a basic example on how you can use cgo to link these libraries:

 ```go
 /*
@@ -178,4 +183,14 @@ func main() {

 The comments on the first line tell `CGO` which libraries to import as well as which header files to include. You can then call methods which are part of the static library and defined in the header file, `C.projective_from_affine_bn254` is an example.

-If you wish to create your own bindings for a language of your choice we suggest you start by investigating how you can call static libraries.
+If you wish to create your own bindings for a language of your choice we suggest you start by investigating how you can call static libraries.
+
+<!-- Begin Links -->
+[GOLANG_BINDINGS]: golang-bindings.md
+[COMPILE_CURVE_MODE]: #compiling-for-a-curve
+[COMPILE_FIELD_MODE]: #compiling-for-a-field
+[NTT_DOCS]: primitives/ntt
+[MSM_DOCS]: primitives/msm
+[POLY_DOCS]: polynomials/overview
+[VECOPS_CODE]: https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/vec_ops/vec_ops.cuh
+<!-- End Links -->
--- a/docs/docs/icicle/golang-bindings.md
+++ b/docs/docs/icicle/golang-bindings.md
@@ -1,7 +1,7 @@
 # Golang bindings

 Golang bindings allow you to use ICICLE as a golang library.
-The source code for all Golang libraries can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang).
+The source code for all Golang packages can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang).

 The Golang bindings are comprised of multiple packages.

@@ -9,7 +9,7 @@ The Golang bindings are comprised of multiple packages.

 [`cuda-runtime`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/cuda_runtime) which defines abstractions for CUDA methods for allocating memory, initializing and managing streams, and `DeviceContext` which enables users to define and keep track of devices.

-Each curve has its own package which you can find [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves). If your project uses BN254 you only need to install that single package named [`bn254`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves/bn254).
+Each supported curve, field, and hash has its own package which you can find in the respective directories [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang). If your project uses BN254 you only need to import that single package named [`bn254`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves/bn254).

 ## Using ICICLE Golang bindings in your project

@@ -31,22 +31,30 @@ For a specific commit
 go get github.com/ingonyama-zk/icicle@<commit_id>
 ```

-To build the shared libraries you can run this script:
+To build the shared libraries you can run [this](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/build.sh) script:

-```bash
-./build.sh [-curve=<curve> | -field=<field>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
+```sh
+./build.sh [-curve=<curve>] [-field=<field>] [-hash=<hash>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
+
+curve - The name of the curve to build or "all" to build all supported curves
+field - The name of the field to build or "all" to build all supported fields
+hash - The name of the hash to build or "all" to build all supported hashes
+-g2 - Optional - build with G2 enabled 
+-ecntt - Optional - build with ECNTT enabled
+-devmode - Optional - build in devmode
+-help - Optional - Displays usage information
 ```
- **`curve`** - The name of the curve to build or "all" to build all curves
- **`field`** - The name of the field to build or "all" to build all fields
- **`g2`** - Optional - build with G2 enabled 
- **`ecntt`** - Optional - build with ECNTT enabled
- **`devmode`** - Optional - build in devmode
- Usage can be displayed with the flag `-help`
+
+:::note
+
+If more than one curve or more than one field or more than one hash is supplied, the last one supplied will be built
+
+:::

 To build ICICLE libraries for all supported curves with G2 and ECNTT enabled.

 ```bash
-./build.sh all -g2 -ecntt
+./build.sh -curve=all -g2 -ecntt
 ```

 If you wish to build for a specific curve, for example bn254, without G2 or ECNTT enabled.
@@ -62,8 +70,8 @@ import (
    "github.com/stretchr/testify/assert"
    "testing"

-    "github.com/ingonyama-zk/icicle/wrappers/golang/core"
-    cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
+    "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+    cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
 )
 ...
 ```
@@ -73,11 +81,9 @@ import (
 To run all tests, for all curves:

 ```bash
-go test --tags=g2 ./... -count=1
+go test ./... -count=1
 ```

-If you dont want to include g2 tests then drop `--tags=g2`.
-
 If you wish to run test for a specific curve:

 ```bash
@@ -106,3 +112,25 @@ func main() {
 ```

 Replace `/path/to/shared/libs` with the actual path where the shared libraries are located on your system.
+
+## Supported curves, fields and operations
+
+### Supported curves and operations
+
+| Operation\Curve | bn254 | bls12_377 | bls12_381 | bw6-761 | grumpkin |
+| --- | :---: | :---: | :---: | :---: | :---: |
+| MSM | ✅ | ✅ | ✅ | ✅ | ✅ |
+| G2  | ✅ | ✅ | ✅ | ✅ | ❌ |
+| NTT | ✅ | ✅ | ✅ | ✅ | ❌ |
+| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
+| VecOps | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Polynomials | ✅ | ✅ | ✅ | ✅ | ❌ |
+
+### Supported fields and operations
+
+| Operation\Field | babybear |
+| --- | :---: |
+| VecOps | ✅ |
+| Polynomials | ✅ |
+| NTT | ✅ |
+| Extension Field | ✅ |
--- a/docs/docs/icicle/golang-bindings/ecntt.md
+++ b/docs/docs/icicle/golang-bindings/ecntt.md
@@ -1,9 +1,5 @@
 # ECNTT

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn254`
-
 ## ECNTT Method

 The `ECNtt[T any]()` function performs the Elliptic Curve Number Theoretic Transform (EC-NTT) on the input points slice, using the provided dir (direction), cfg (configuration), and stores the results in the results slice.
@@ -12,14 +8,13 @@ The `ECNtt[T any]()` function performs the Elliptic Curve Number Theoretic Trans
 func ECNtt[T any](points core.HostOrDeviceSlice, dir core.NTTDir, cfg *core.NTTConfig[T], results core.HostOrDeviceSlice) core.IcicleError
 ```

-### Parameters:
+### Parameters

 - **`points`**: A slice of elliptic curve points (in projective coordinates) that will be transformed. The slice can be stored on the host or the device, as indicated by the `core.HostOrDeviceSlice` type.
 - **`dir`**: The direction of the EC-NTT transform, either `core.KForward` or `core.KInverse`.
 - **`cfg`**: A pointer to an `NTTConfig` object, containing configuration options for the NTT operation.
 - **`results`**: A slice that will store the transformed elliptic curve points (in projective coordinates). The slice can be stored on the host or the device, as indicated by the `core.HostOrDeviceSlice` type.

-
 ### Return Value

 - **`CudaError`**: A `core.IcicleError` value, which will be `core.IcicleErrorCode(0)` if the EC-NTT operation was successful, or an error if something went wrong.
@@ -68,8 +63,8 @@ func GetDefaultNTTConfig[T any](cosetGen T) NTTConfig[T]
 package main

 import (
-    "github.com/ingonyama-zk/icicle/wrappers/golang/core"
-    cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
+    "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+    cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
 )

 func Main() {
@@ -94,4 +89,4 @@ func Main() {
        panic("ECNTT operation failed")
    }
 }
-```
+```
--- a/docs/docs/icicle/golang-bindings/msm-pre-computation.md
+++ b/docs/docs/icicle/golang-bindings/msm-pre-computation.md
@@ -2,15 +2,11 @@

 To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `grumpkin`
-
 ## Core package

-## MSM `PrecomputeBases`
+### MSM PrecomputeBases

-`PrecomputeBases` and `G2PrecomputeBases` exists for all supported curves. 
+`PrecomputeBases` and `G2PrecomputeBases` exists for all supported curves.

 #### Description

@@ -42,9 +38,9 @@ package main
 import (
 	"log"

-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
 )

 func main() {
@@ -85,9 +81,9 @@ package main
 import (
 	"log"

-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	g2 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254/g2"
+	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+	g2 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
 )

 func main() {
--- a/docs/docs/icicle/golang-bindings/msm.md
+++ b/docs/docs/icicle/golang-bindings/msm.md
@@ -1,62 +1,57 @@
 # MSM

-
-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `grumpkin`
-
 ## MSM Example

 ```go
 package main

 import (
-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+  "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+  cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+  bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
 )

 func main() {
-	// Obtain the default MSM configuration.
-	cfg := bn254.GetDefaultMSMConfig()
+  // Obtain the default MSM configuration.
+  cfg := bn254.GetDefaultMSMConfig()

-	// Define the size of the problem, here 2^18.
-	size := 1 << 18
+  // Define the size of the problem, here 2^18.
+  size := 1 << 18

-	// Generate scalars and points for the MSM operation.
-	scalars := bn254.GenerateScalars(size)
-	points := bn254.GenerateAffinePoints(size)
+  // Generate scalars and points for the MSM operation.
+  scalars := bn254.GenerateScalars(size)
+  points := bn254.GenerateAffinePoints(size)

-	// Create a CUDA stream for asynchronous operations.
-	stream, _ := cr.CreateStream()
-	var p bn254.Projective
+  // Create a CUDA stream for asynchronous operations.
+  stream, _ := cr.CreateStream()
+  var p bn254.Projective

-	// Allocate memory on the device for the result of the MSM operation.
-	var out core.DeviceSlice
-	_, e := out.MallocAsync(p.Size(), p.Size(), stream)
+  // Allocate memory on the device for the result of the MSM operation.
+  var out core.DeviceSlice
+  _, e := out.MallocAsync(p.Size(), p.Size(), stream)

-	if e != cr.CudaSuccess {
-		panic(e)
-	}
+  if e != cr.CudaSuccess {
+    panic(e)
+  }

-	// Set the CUDA stream in the MSM configuration.
-	cfg.Ctx.Stream = &stream
-	cfg.IsAsync = true
+  // Set the CUDA stream in the MSM configuration.
+  cfg.Ctx.Stream = &stream
+  cfg.IsAsync = true

-	// Perform the MSM operation.
-	e = bn254.Msm(scalars, points, &cfg, out)
+  // Perform the MSM operation.
+  e = bn254.Msm(scalars, points, &cfg, out)

-	if e != cr.CudaSuccess {
-		panic(e)
-	}
+  if e != cr.CudaSuccess {
+    panic(e)
+  }

-	// Allocate host memory for the results and copy the results from the device.
-	outHost := make(core.HostSlice[bn254.Projective], 1)
-	cr.SynchronizeStream(&stream)
-	outHost.CopyFromDevice(&out)
+  // Allocate host memory for the results and copy the results from the device.
+  outHost := make(core.HostSlice[bn254.Projective], 1)
+  cr.SynchronizeStream(&stream)
+  outHost.CopyFromDevice(&out)

-	// Free the device memory allocated for the results.
-	out.Free()
+  // Free the device memory allocated for the results.
+  out.Free()
 }

 ```
@@ -124,7 +119,6 @@ Use `GetDefaultMSMConfig` to obtain a default configuration, which can then be c
 func GetDefaultMSMConfig() MSMConfig
 ```

-
 ## How do I toggle between the supported algorithms?

 When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle accumulation and `cfg.Ctx.IsBigTriangle = false` will activate Bucket accumulation.
@@ -161,13 +155,11 @@ out.Malloc(batchSize*p.Size(), p.Size())

 To activate G2 support first you must make sure you are building the static libraries with G2 feature enabled as described in the [Golang building instructions](../golang-bindings.md#using-icicle-golang-bindings-in-your-project).

-
-
 Now you may import `g2` package of the specified curve.

 ```go
 import (
-    "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bls254/g2"
+    "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
 )
 ```

@@ -177,23 +169,23 @@ This package include `G2Projective` and `G2Affine` points as well as a `G2Msm` m
 package main

 import (
-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
-	g2 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254/g2"
+  "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+  bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
+  g2 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
 )

 func main() {
-	cfg := bn254.GetDefaultMSMConfig()
-	size := 1 << 12
-	batchSize := 3
-	totalSize := size * batchSize
-	scalars := bn254.GenerateScalars(totalSize)
-	points := g2.G2GenerateAffinePoints(totalSize)
+  cfg := bn254.GetDefaultMSMConfig()
+  size := 1 << 12
+  batchSize := 3
+  totalSize := size * batchSize
+  scalars := bn254.GenerateScalars(totalSize)
+  points := g2.G2GenerateAffinePoints(totalSize)

-	var p g2.G2Projective
-	var out core.DeviceSlice
-	out.Malloc(batchSize*p.Size(), p.Size())
-	g2.G2Msm(scalars, points, &cfg, out)
+  var p g2.G2Projective
+  var out core.DeviceSlice
+  out.Malloc(batchSize*p.Size(), p.Size())
+  g2.G2Msm(scalars, points, &cfg, out)
 }

 ```
--- a/docs/docs/icicle/golang-bindings/multi-gpu.md
+++ b/docs/docs/icicle/golang-bindings/multi-gpu.md
@@ -2,8 +2,7 @@

 To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation.

-Here we will cover the core multi GPU apis and a [example](#a-multi-gpu-example)
-
+Here we will cover the core multi GPU apis and an [example](#a-multi-gpu-example)

 ## A Multi GPU example

@@ -13,7 +12,6 @@ In this example we will display how you can
 2. For every GPU launch a thread and set an active device per thread.
 3. Execute a MSM on each GPU

-
 ```go
 package main

@@ -21,9 +19,9 @@ import (
 	"fmt"
 	"sync"

-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
 )

 func main() {
@@ -79,13 +77,13 @@ To streamline device management we offer as part of `cuda_runtime` package metho

 Runs a given function on a specific GPU device, ensuring that all CUDA calls within the function are executed on the selected device.

-In Go, most concurrency can be done via Goroutines. However, there is no guarantee that a goroutine stays on a specific host thread. 
+In Go, most concurrency can be done via Goroutines. However, there is no guarantee that a goroutine stays on a specific host thread.

-`RunOnDevice` was designed to solve this caveat and insure that the goroutine will stay on a specific host thread.
+`RunOnDevice` was designed to solve this caveat and ensure that the goroutine will stay on a specific host thread.

-`RunOnDevice` will lock a goroutine into a specific host thread, sets a current GPU device, runs a provided function, and unlocks the goroutine from the host thread after the provided function finishes.
+`RunOnDevice` locks a goroutine into a specific host thread, sets a current GPU device, runs a provided function, and unlocks the goroutine from the host thread after the provided function finishes.

-While the goroutine is locked to the host thread, the Go runtime will not assign other goroutine's to that host thread.
+While the goroutine is locked to the host thread, the Go runtime will not assign other goroutines to that host thread.

 **Parameters:**

@@ -96,7 +94,10 @@ While the goroutine is locked to the host thread, the Go runtime will not assign
 **Behavior:**

 - The function `funcToRun` is executed in a new goroutine that is locked to a specific OS thread to ensure that all CUDA calls within the function target the specified device.
- It's important to note that any goroutines launched within `funcToRun` are not automatically bound to the same GPU device. If necessary, `RunOnDevice` should be called again within such goroutines with the same `deviceId`.
+
+:::note
+Any goroutines launched within `funcToRun` are not automatically bound to the same GPU device. If necessary, `RunOnDevice` should be called again within such goroutines with the same `deviceId`.
+:::

 **Example:**

@@ -111,6 +112,10 @@ RunOnDevice(0, func(args ...any) {

 Sets the active device for the current host thread. All subsequent CUDA calls made from this thread will target the specified device.

+:::warning
+This function should not be used directly in conjunction with goroutines. If you want to run multi-gpu scenarios with goroutines you should use [RunOnDevice](#runondevice)
+:::
+
 **Parameters:**

 - **`device int`**: The ID of the device to set as the current device.
@@ -147,4 +152,4 @@ Retrieves the device associated with a given pointer.

 - **`int`**: The device ID associated with the memory pointed to by `ptr`.

-This documentation should provide a clear understanding of how to effectively manage multiple GPUs in Go applications using CUDA, with a particular emphasis on the `RunOnDevice` function for executing tasks on specific GPUs.
+This documentation should provide a clear understanding of how to effectively manage multiple GPUs in Go applications using CUDA, with a particular emphasis on the `RunOnDevice` function for executing tasks on specific GPUs.
--- a/docs/docs/icicle/golang-bindings/ntt.md
+++ b/docs/docs/icicle/golang-bindings/ntt.md
@@ -1,58 +1,54 @@
 # NTT

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn254`, `bw6-761`
-
 ## NTT Example

 ```go
 package main

 import (
-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+  "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+  cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+  bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"

-	"github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
+  "github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
 )

 func init() {
-	cfg := bn254.GetDefaultNttConfig()
-	initDomain(18, cfg)
+  cfg := bn254.GetDefaultNttConfig()
+  initDomain(18, cfg)
 }

 func initDomain[T any](largestTestSize int, cfg core.NTTConfig[T]) core.IcicleError {
-	rouMont, _ := fft.Generator(uint64(1 << largestTestSize))
-	rou := rouMont.Bits()
-	rouIcicle := bn254.ScalarField{}
+  rouMont, _ := fft.Generator(uint64(1 << largestTestSize))
+  rou := rouMont.Bits()
+  rouIcicle := bn254.ScalarField{}

-	rouIcicle.FromLimbs(rou[:])
-	e := bn254.InitDomain(rouIcicle, cfg.Ctx, false)
-	return e
+  rouIcicle.FromLimbs(rou[:])
+  e := bn254.InitDomain(rouIcicle, cfg.Ctx, false)
+  return e
 }

 func main() {
-	// Obtain the default NTT configuration with a predefined coset generator.
-	cfg := bn254.GetDefaultNttConfig()
+  // Obtain the default NTT configuration with a predefined coset generator.
+  cfg := bn254.GetDefaultNttConfig()

-	// Define the size of the input scalars.
-	size := 1 << 18
+  // Define the size of the input scalars.
+  size := 1 << 18

-	// Generate scalars for the NTT operation.
-	scalars := bn254.GenerateScalars(size)
+  // Generate scalars for the NTT operation.
+  scalars := bn254.GenerateScalars(size)

-	// Set the direction of the NTT (forward or inverse).
-	dir := core.KForward
+  // Set the direction of the NTT (forward or inverse).
+  dir := core.KForward

-	// Allocate memory for the results of the NTT operation.
-	results := make(core.HostSlice[bn254.ScalarField], size)
+  // Allocate memory for the results of the NTT operation.
+  results := make(core.HostSlice[bn254.ScalarField], size)

-	// Perform the NTT operation.
-	err := bn254.Ntt(scalars, dir, &cfg, results)
-	if err.CudaErrorCode != cr.CudaSuccess {
-		panic("NTT operation failed")
-	}
+  // Perform the NTT operation.
+  err := bn254.Ntt(scalars, dir, &cfg, results)
+  if err.CudaErrorCode != cr.CudaSuccess {
+    panic("NTT operation failed")
+  }
 }
 ```

@@ -146,10 +142,10 @@ import (
 )

 func example() {
-    cfg := GetDefaultNttConfig()
-	err := ReleaseDomain(cfg.Ctx)
-    if err != nil {
-        // Handle the error
-    }
+  cfg := GetDefaultNttConfig()
+  err := ReleaseDomain(cfg.Ctx)
+  if err != nil {
+      // Handle the error
+  }
 }
 ```
--- a/docs/docs/icicle/golang-bindings/vec-ops.md
+++ b/docs/docs/icicle/golang-bindings/vec-ops.md
@@ -1,12 +1,14 @@
 # Vector Operations

 ## Overview
-Icicle is exposing a number of vector operations which a user can control:
+
+Icicle exposes a number of vector operations which a user can use:
+
 * The VecOps API provides efficient vector operations such as addition, subtraction, and multiplication.
 * MatrixTranspose API allows a user to perform a transpose on a vector representation of a matrix

-
 ## VecOps API Documentation
+
 ### Example

 #### Vector addition
@@ -15,9 +17,9 @@ Icicle is exposing a number of vector operations which a user can control:
 package main

 import (
-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
 )

 func main() {
@@ -41,9 +43,9 @@ func main() {
 package main

 import (
-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
 )

 func main() {
@@ -67,9 +69,9 @@ func main() {
 package main

 import (
-	"github.com/ingonyama-zk/icicle/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/wrappers/golang/curves/bn254"
+	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
+	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
+	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
 )

 func main() {
@@ -183,4 +185,4 @@ if err.IcicleErrorCode != core.IcicleErrorCode(0) {
 // ...
 ```

-In this example, the `TransposeMatrix` function is used to transpose a 5x4 matrix stored in a 1D slice. The input and output slices are stored on the host (CPU), and the operation is executed synchronously.
+In this example, the `TransposeMatrix` function is used to transpose a 5x4 matrix stored in a 1D slice. The input and output slices are stored on the host (CPU), and the operation is executed synchronously.
--- a/docs/docs/icicle/introduction.md
+++ b/docs/docs/icicle/introduction.md
@@ -165,7 +165,36 @@ cargo bench

 #### ICICLE Golang

-Golang is WIP in v1, coming soon. Please checkout a previous [release v0.1.0](https://github.com/ingonyama-zk/icicle/releases/tag/v0.1.0) for golang bindings.
+The Golang bindings require compiling ICICLE Core first. We supply a [build script](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/build.sh) to help build what you need.
+
+Script usage:
+
+```sh
+./build.sh [-curve=<curve>] [-field=<field>] [-hash=<hash>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
+
+curve - The name of the curve to build or "all" to build all supported curves
+field - The name of the field to build or "all" to build all supported fields
+hash - The name of the hash to build or "all" to build all supported hashes
+-g2 - Optional - build with G2 enabled 
+-ecntt - Optional - build with ECNTT enabled
+-devmode - Optional - build in devmode
+```
+
+:::note
+
+If more than one curve or more than one field or more than one hash is supplied, the last one supplied will be built
+
+:::
+
+Once the library has been built, you can use and test the Golang bindings.
+
+To test a specific curve, field or hash, change to it's directory and then run:
+
+```sh
+go test ./tests -count=1 -failfast -timeout 60m -p 2 -v
+```
+
+You will be able to see each test that runs, how long it takes and whether it passed or failed

 ### Running ICICLE examples

@@ -185,8 +214,8 @@ Read through the compile.sh and CMakeLists.txt to understand how to link your ow

 :::

-
 #### Running with Docker
+
 In each example directory, ZK-container files are located in a subdirectory `.devcontainer`.

 ```sh
@@ -215,4 +244,4 @@ Inside the container you can run the same commands:
 ./run.sh
 ```

-You can now experiment with our other examples, perhaps try to run a rust or golang example next.
+You can now experiment with our other examples, perhaps try to run a rust or golang example next.
--- a/docs/docs/icicle/multi-gpu.md
+++ b/docs/docs/icicle/multi-gpu.md
@@ -2,7 +2,7 @@

 :::info

-If you are looking for the Multi GPU API documentation refer here for [Rust](./rust-bindings/multi-gpu.md).
+If you are looking for the Multi GPU API documentation refer [here](./rust-bindings/multi-gpu.md) for Rust and [here](./golang-bindings/multi-gpu.md) for Golang.

 :::

@@ -10,12 +10,11 @@ One common challenge with Zero-Knowledge computation is managing the large input

 Multi-GPU programming involves developing software to operate across multiple GPU devices. Lets first explore different approaches to Multi-GPU programming then we will cover how ICICLE allows you to easily develop youR ZK computations to run across many GPUs.

-
 ## Approaches to Multi GPU programming

 There are many [different strategies](https://github.com/NVIDIA/multi-gpu-programming-models) available for implementing multi GPU, however, it can be split into two categories.

-### GPU Server approach 
+### GPU Server approach

 This approach usually involves a single or multiple CPUs opening threads to read / write from multiple GPUs. You can think about it as a scaled up HOST - Device model.

@@ -23,8 +22,7 @@ This approach usually involves a single or multiple CPUs opening threads to read

 This approach won't let us tackle larger computation sizes but it will allow us to compute multiple computations which we wouldn't be able to load onto a single GPU.

-For example let's say that you had to compute two MSMs of size 2^26 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel. 
-
+For example let's say that you had to compute two MSMs of size 2^26 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel.

 ### Inter GPU approach

@@ -32,18 +30,17 @@ This approach involves a more sophisticated approach to multi GPU computation. U

 This approach requires redesigning the algorithm at the software level to be compatible with splitting amongst devices. In some cases, to lower latency to a minimum, special inter GPU connections would be installed on a server to allow direct communication between multiple GPUs.

-
-# Writing ICICLE Code for Multi GPUs
+## Writing ICICLE Code for Multi GPUs

 The approach we have taken for the moment is a GPU Server approach; we assume you have a machine with multiple GPUs and you wish to run some computation on each GPU.

 To dive deeper and learn about the API check out the docs for our different ICICLE API

 - [Rust Multi GPU APIs](./rust-bindings/multi-gpu.md)
+- [Golang Multi GPU APIs](./golang-bindings/multi-gpu.md)
 - C++ Multi GPU APIs

-
-## Best practices 
+## Best practices

 - Never hardcode device IDs, if you want your software to take advantage of all GPUs on a machine use methods such as `get_device_count` to support arbitrary number of GPUs.

@@ -57,7 +54,7 @@ Multi GPU support should work with ZK-Containers by simply defining which device
 docker run -it --gpus '"device=0,2"' zk-container-image
 ```

-If you wish to expose all GPUs 
+If you wish to expose all GPUs

 ```sh
 docker run --gpus all zk-container-image
--- a/docs/docs/icicle/overview.md
+++ b/docs/docs/icicle/overview.md
@@ -2,10 +2,6 @@

 [![GitHub Release](https://img.shields.io/github/v/release/ingonyama-zk/icicle)](https://github.com/ingonyama-zk/icicle/releases)

-
-
-
-
 [ICICLE](https://github.com/ingonyama-zk/icicle) is a cryptography library for ZK using GPUs. ICICLE implements blazing fast cryptographic primitives such as EC operations, MSM, NTT, Poseidon hash and more on GPU.

 ICICLE allows developers with minimal GPU experience to effortlessly accelerate their ZK application; from our experiments, even the most naive implementation may yield 10X improvement in proving times.
@@ -17,28 +13,26 @@ ICICLE has been used by many leading ZK companies such as [Celer Network](https:
 We understand that not all developers have access to a GPU and we don't want this to limit anyone from developing with ICICLE.
 Here are some ways we can help you gain access to GPUs:

+:::note
+
+If none of the following options suit your needs, contact us on [telegram](https://t.me/RealElan) for assistance. We're committed to ensuring that a lack of a GPU doesn't become a bottleneck for you. If you need help with setup or any other issues, we're here to help you.
+
+:::
+
 ### Grants

 At Ingonyama we are interested in accelerating the progress of ZK and cryptography. If you are an engineer, developer or an academic researcher we invite you to checkout [our grant program](https://www.ingonyama.com/blog/icicle-for-researchers-grants-challenges). We will give you access to GPUs and even pay you to do your dream research!

 ### Google Colab

-This is a great way to get started with ICICLE instantly. Google Colab offers free GPU access to a NVIDIA T4 instance, it's acquired with 16 GB of memory which should be enough for experimenting and even prototyping with ICICLE.
+This is a great way to get started with ICICLE instantly. Google Colab offers free GPU access to a NVIDIA T4 instance with 16 GB of memory which should be enough for experimenting and even prototyping with ICICLE.

 For an extensive guide on how to setup Google Colab with ICICLE refer to [this article](./colab-instructions.md).

-If none of these options are appropriate for you reach out to us on [telegram](https://t.me/RealElan) we will do our best to help you.
-
 ### Vast.ai

 [Vast.ai](https://vast.ai/) is a global GPU marketplace where you can rent many different types of GPUs by the hour for [competitive pricing](https://vast.ai/pricing). They provide on-demand and interruptible rentals depending on your need or use case; you can learn more about their rental types [here](https://vast.ai/faq#rental-types).

-:::note
-
-If none of these options suit your needs, contact us on [telegram](https://t.me/RealElan) for assistance. We're committed to ensuring that a lack of a GPU doesn't become a bottleneck for you. If you need help with setup or any other issues, we're here to do our best to help you.
-
-:::
-
 ## What can you do with ICICLE?

 [ICICLE](https://github.com/ingonyama-zk/icicle) can be used in the same way you would use any other cryptography library. While developing and integrating ICICLE into many proof systems, we found some use case categories:
--- a/docs/docs/icicle/polynomials/overview.md
+++ b/docs/docs/icicle/polynomials/overview.md
@@ -7,6 +7,7 @@ The Polynomial API offers a robust framework for polynomial operations within a
 ## Key Features

 ### Backend Agnostic Architecture
+
 Our API is structured to be independent of any specific computational backend. While a CUDA backend is currently implemented, the architecture facilitates easy integration of additional backends. This capability allows users to perform polynomial operations without the need to tailor their code to specific hardware, enhancing code portability and scalability.

 ### Templating in the Polynomial API
@@ -27,15 +28,19 @@ In this template:
 - **`Image`**: Defines the type of the output values of the polynomial. This is typically the same as the coefficients.

 #### Default instantiation
+
 ```cpp
 extern template class Polynomial<scalar_t>;
 ```

 #### Extended use cases
+
 The templated nature of the Polynomial API also supports more complex scenarios. For example, coefficients and images could be points on an elliptic curve (EC points), which are useful in cryptographic applications and advanced algebraic structures. This approach allows the API to be extended easily to support new algebraic constructions without modifying the core implementation.

 ### Supported Operations
+
 The Polynomial class encapsulates a polynomial, providing a variety of operations:
+
 - **Construction**: Create polynomials from coefficients or evaluations on roots-of-unity domains.
 - **Arithmetic Operations**: Perform addition, subtraction, multiplication, and division.
 - **Evaluation**: Directly evaluate polynomials at specific points or across a domain.
@@ -47,6 +52,7 @@ The Polynomial class encapsulates a polynomial, providing a variety of operation
 This section outlines how to use the Polynomial API in C++. Bindings for Rust and Go are detailed under the Bindings sections.

 ### Backend Initialization
+
 Initialization with an appropriate factory is required to configure the computational context and backend.

 ```cpp
@@ -57,10 +63,12 @@ Initialization with an appropriate factory is required to configure the computat
 Polynomial::initialize(std::make_shared<CUDAPolynomialFactory>());
 ```

-:::note Icicle is built to a library per field/curve. Initialization must be done per library. That is, applications linking to multiple curves/fields should do it per curve/field.
+:::note
+Initialization of a factory must be done per linked curve or field.
 :::

 ### Construction
+
 Polynomials can be constructed from coefficients, from evaluations on roots-of-unity domains, or by cloning existing polynomials.

 ```cpp
@@ -80,10 +88,11 @@ auto p_cloned = p.clone(); // p_cloned and p do not share memory
 ```

 :::note
-The coefficients or evaluations may be allocated either on host or device memory. In both cases the memory is copied to backend device.
+The coefficients or evaluations may be allocated either on host or device memory. In both cases the memory is copied to the backend device.
 :::

 ### Arithmetic
+
 Constructed polynomials can be used for various arithmetic operations:

 ```cpp
@@ -105,7 +114,8 @@ Polynomial operator%(const Polynomial& rhs) const; // returns remainder R(x)
 Polynomial divide_by_vanishing_polynomial(uint64_t degree) const; // sdivision by the vanishing polynomial V(x)=X^N-1
 ```

-#### Example:
+#### Example
+
 Given polynomials A(x),B(x),C(x) and V(x) the vanishing polynomial.

 $$
@@ -117,6 +127,7 @@ auto H = (A*B-C).divide_by_vanishing_polynomial(N);
 ```

 ### Evaluation
+
 Evaluate polynomials at arbitrary domain points or across a domain.

 ```cpp
@@ -138,7 +149,9 @@ auto evaluations = std::make_unique<scalar_t[]>(domain_size); // can be device m
 f.evaluate_on_domain(domain, domain_size, evaluations);
 ```

-:::note For special domains such as roots of unity this method is not the most efficient for two reasons:
+:::note
+For special domains such as roots of unity, this method is not the most efficient for two reasons:
+
 - Need to build the domain of size N.
 - The implementation is not trying to identify this special domain.

@@ -146,11 +159,12 @@ Therefore the computation is typically $O(n^2)$ rather than $O(nlogn)$.
 See the 'device views' section for more details.
 :::

-
 ### Manipulations
+
 Beyond arithmetic, the API supports efficient polynomial manipulations:

 #### Monomials
+
 ```cpp
 // Monomial operations
 Polynomial& add_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
@@ -160,31 +174,35 @@ Polynomial& sub_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
 The ability to add or subtract monomials directly and in-place is an efficient way to manipualte polynomials.

 Example:
+
 ```cpp
 f.add_monomial_in_place(scalar_t::from(5)); // f(x) += 5
 f.sub_monomial_in_place(scalar_t::from(3), 8); // f(x) -= 3x^8
 ```

 #### Computing the degree of a Polynomial
+
 ```cpp
 // Degree computation
 int64_t degree();
 ```

 The degree of a polynomial is a fundamental characteristic that describes the highest power of the variable in the polynomial expression with a non-zero coefficient.
-The `degree()` function in the API returns the degree of the polynomial, corresponding to the highest exponent with a non-zero coefficient. 
+The `degree()` function in the API returns the degree of the polynomial, corresponding to the highest exponent with a non-zero coefficient.

 - For the polynomial $f(x) = x^5 + 2x^3 + 4$, the degree is 5 because the highest power of $x$ with a non-zero coefficient is 5.
 - For a scalar value such as a constant term (e.g., $f(x) = 7$, the degree is considered 0, as it corresponds to $x^0$.
 - The degree of the zero polynomial, $f(x) = 0$, where there are no non-zero coefficients, is defined as -1. This special case often represents an "empty" or undefined state in many mathematical contexts.

 Example:
+
 ```cpp
 auto f = /*some expression*/;
 auto degree_of_f = f.degree();
 ```

 #### Slicing
+
 ```cpp
 // Slicing and selecting even or odd components.
 Polynomial slice(uint64_t offset, uint64_t stride, uint64_t size = 0 /*0 means take all elements*/);
@@ -195,6 +213,7 @@ Polynomial odd();
 The Polynomial API provides methods for slicing polynomials and selecting specific components, such as even or odd indexed terms. Slicing allows extracting specific sections of a polynomial based on an offset, stride, and size.

 The following examples demonstrate folding a polynomial's even and odd parts and arbitrary slicing;
+
 ```cpp
 // folding a polynomials even and odd parts with randomness
 auto x = rand();
@@ -207,13 +226,15 @@ auto first_quarter = f.slice(0 /*offset*/, 1 /*stride*/, f.degree()/4 /*size*/);
 ```

 ### Memory access (copy/view)
-Access to the polynomial's internal state can be vital for operations like commitment schemes or when more efficient custom operations are necessary. This can be done in one of two ways:
- **Copy** the coefficients or evaluations to user allocated memory or
- **View** into the device memory without copying.

-#### Copy
-Copy the polynomial coefficients to either host or device allocated memory.
-:::note copying to host memory is backend agnostic while copying to device memory requires the memory to be allocated on the corresponding backend.
+Access to the polynomial's internal state can be vital for operations like commitment schemes or when more efficient custom operations are necessary. This can be done either by copying or viewing the polynomial
+
+#### Copying
+
+Copies the polynomial coefficients to either host or device allocated memory.
+
+:::note
+Copying to host memory is backend agnostic while copying to device memory requires the memory to be allocated on the corresponding backend.
 :::

 ```cpp
@@ -222,6 +243,7 @@ uint64_t copy_coeffs(Coeff* coeffs, uint64_t start_idx, uint64_t end_idx) const;
 ```

 Example:
+
 ```cpp
 auto coeffs_device = /*allocate CUDA or host memory*/
 f.copy_coeffs(coeffs_device, 0/*start*/, f.degree());
@@ -232,7 +254,8 @@ auto rv = msm::MSM(coeffs_device, points, msm_size, cfg, results);
 ```

 #### Views
-The Polynomial API supports efficient data handling through the use of memory views. These views provide direct access to the polynomial's internal state, such as coefficients or evaluations, without the need to copy data. This feature is particularly useful for operations that require direct access to device memory, enhancing both performance and memory efficiency.
+
+The Polynomial API supports efficient data handling through the use of memory views. These views provide direct access to the polynomial's internal state, such as coefficients or evaluations without the need to copy data. This feature is particularly useful for operations that require direct access to device memory, enhancing both performance and memory efficiency.

 ##### What is a Memory View?

@@ -268,6 +291,7 @@ gpu_accelerated_function(coeffs_view.get(),...);
 ```

 ##### Integrity-Pointer: Managing Memory Views
+
 Within the Polynomial API, memory views are managed through a specialized tool called the Integrity-Pointer. This pointer type is designed to safeguard operations by monitoring the validity of the memory it points to. It can detect if the memory has been modified or released, thereby preventing unsafe access to stale or non-existent data.
 The Integrity-Pointer not only acts as a regular pointer but also provides additional functionality to ensure the integrity of the data it references. Here are its key features:

@@ -305,8 +329,10 @@ if (coeff_view.isValid()) {
 ```

 #### Evaluations View: Accessing Polynomial Evaluations Efficiently
+
 The Polynomial API offers a specialized method, `get_rou_evaluations_view(...)`, which facilitates direct access to the evaluations of a polynomial. This method is particularly useful for scenarios where polynomial evaluations need to be accessed frequently or manipulated externally without the overhead of copying data.
 This method provides a memory view into the device memory where polynomial evaluations are stored. It allows for efficient interpolation on larger domains, leveraging the raw evaluations directly from memory.
+
 :::warning
 Invalid request: requesting evaluations on a domain smaller than the degree of the polynomial is not supported and is considered invalid.
 :::
@@ -334,7 +360,9 @@ cudaSetDevice(int deviceID);
 This function sets the active CUDA device. All subsequent operations that allocate or deal with polynomial data will be performed on this device.

 ### Allocation Consistency
+
 Polynomials are always allocated on the current CUDA device at the time of their creation. It is crucial to ensure that the device context is correctly set before initiating any operation that involves memory allocation:
+
 ```cpp
 // Set the device before creating polynomials
 cudaSetDevice(0);
@@ -345,6 +373,7 @@ Polynomial p2 = Polynomial::from_coefficients(coeffs, size);
 ```

 ### Matching Devices for Operations
+
 When performing operations that result in the creation of new polynomials (such as addition or multiplication), it is imperative that both operands are on the same CUDA device. If the operands reside on different devices, an exception is thrown:

 ```cpp
@@ -354,7 +383,9 @@ auto p3 = p1 + p2; // Throws an exception if p1 and p2 are not on the same devic
 ```

 ### Device-Agnostic Operations
+
 Operations that do not involve the creation of new polynomials, such as computing the degree of a polynomial or performing in-place modifications, can be executed regardless of the current device setting:
+
 ```cpp
 // 'degree' and in-place operations do not require device matching
 int deg = p1.degree();
@@ -362,9 +393,11 @@ p1 += p2; // Valid if p1 and p2 are on the same device, throws otherwise
 ```

 ### Error Handling
+
 The API is designed to throw exceptions if operations are attempted across polynomials that are not located on the same GPU. This ensures that all polynomial operations are performed consistently and without data integrity issues due to device mismatches.

 ### Best Practices
+
 To maximize the performance and avoid runtime errors in a multi-GPU setup, always ensure that:

 - The CUDA device is set correctly before polynomial allocation.
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -49,13 +49,6 @@ Accelerating MSM is crucial to a ZK protocol's performance due to the [large per

 You can learn more about how MSMs work from this [video](https://www.youtube.com/watch?v=Bl5mQA7UL2I) and from our resource list on [Ingopedia](https://www.ingonyama.com/ingopedia/msm).

-## Supported curves
-
-MSM supports the following curves:
-
-`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `grumpkin`
-
-
 ## Supported Bindings

 - [Golang](../golang-bindings/msm.md)
@@ -81,16 +74,16 @@ Large Triangle Accumulation is a method for optimizing MSM which focuses on redu

 #### When should I use Large triangle accumulation?

-The Large Triangle Accumulation algorithm is more sequential in nature, as it builds upon each step sequentially (accumulating sums and then performing doubling). This structure can make it less suitable for parallelization but potentially more efficient for a <b>large batch of smaller MSM computations</b>.
+The Large Triangle Accumulation algorithm is more sequential in nature, as it builds upon each step sequentially (accumulating sums and then performing doubling). This structure can make it less suitable for parallelization but potentially more efficient for a **large batch of smaller MSM computations**.

 ## MSM Modes

 ICICLE MSM also supports two different modes `Batch MSM` and `Single MSM`

-Batch MSM allows you to run many MSMs with a single API call, Single MSM will launch a single MSM computation.
+Batch MSM allows you to run many MSMs with a single API call while single MSM will launch a single MSM computation.

 ### Which mode should I use?

-This decision is highly dependent on your use case and design. However, if your design allows for it, using batch mode can significantly improve efficiency. Batch processing allows you to perform multiple MSMs leveraging the parallel processing capabilities of GPUs.
+This decision is highly dependent on your use case and design. However, if your design allows for it, using batch mode can significantly improve efficiency. Batch processing allows you to perform multiple MSMs simultaneously, leveraging the parallel processing capabilities of GPUs.

 Single MSM mode should be used when batching isn't possible or when you have to run a single MSM.
--- a/docs/docs/icicle/primitives/ntt.md
+++ b/docs/docs/icicle/primitives/ntt.md
@@ -11,24 +11,19 @@ A_k = \sum_{n=0}^{N-1} a_n \cdot \omega^{nk} \mod p
 $$

 where:
+
 - $N$ is the size of the input sequence and is a power of 2,
 - $p$ is a prime number such that $p = kN + 1$ for some integer $k$, ensuring that $p$ supports the existence of $N$th roots of unity,
 - $\omega$ is a primitive $N$th root of unity modulo $p$, meaning $\omega^N \equiv 1 \mod p$ and no smaller positive power of $\omega$ is congruent to 1 modulo $p$,
 - $k$ ranges from 0 to $N-1$, and it indexes the output sequence.

-The NTT is particularly useful because it enables efficient polynomial multiplication under modulo arithmetic, crucial for algorithms in cryptographic protocols, and other areas requiring fast modular arithmetic operations. 
+NTT is particularly useful because it enables efficient polynomial multiplication under modulo arithmetic, crucial for algorithms in cryptographic protocols and other areas requiring fast modular arithmetic operations.

 There exists also INTT which is the inverse operation of NTT. INTT can take as input an output sequence of integers from an NTT and reconstruct the original sequence.

-# Using NTT
+## Using NTT

-### Supported curves
-
-NTT supports the following curves:
-
-`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`
-
-## Supported Bindings
+### Supported Bindings

 - [Golang](../golang-bindings/ntt.md)
 - [Rust](../rust-bindings/ntt.md)
@@ -61,19 +56,17 @@ Choosing an algorithm is heavily dependent on your use case. For example Cooley-

 NTT also supports two different modes `Batch NTT` and `Single NTT`

-Batch NTT allows you to run many NTTs with a single API call, Single MSM will launch a single MSM computation.
-
 Deciding weather to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.

-**Single NTT Mode**
+#### Single NTT

- Choose this mode when your application requires processing individual NTT operations in isolation.
+Single NTT will launch a single NTT computation.

-**Batch NTT Mode**
+Choose this mode when your application requires processing individual NTT operations in isolation.

- Batch NTT mode can significantly reduce read/write as well as computation overhead by executing multiple NTT operations in parallel.
+#### Batch NTT Mode

- Batch mode may also offer better utilization of computational resources (memory and compute).
+Batch NTT allows you to run many NTTs with a single API call. Batch NTT mode can significantly reduce read/write times as well as computation overhead by executing multiple NTT operations in parallel. Batch mode may also offer better utilization of computational resources (memory and compute).

 ## Supported algorithms

@@ -90,8 +83,8 @@ At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-prob
   The algorithm recursively divides the input sequence into smaller sequences. At each step, it separates the sequence into even-indexed and odd-indexed elements, forming two subsequences that are then processed independently.

 3. **Butterfly Operations:**
-   The core computational element of the Radix-2 NTT is the "butterfly" operation, which combines pairs of elements from the sequences obtained in the decomposition step. 
-   
+   The core computational element of the Radix-2 NTT is the "butterfly" operation, which combines pairs of elements from the sequences obtained in the decomposition step.
+
   Each butterfly operation involves multiplication by a "twiddle factor," which is a root of unity in the finite field, and addition or subtraction of the results, all performed modulo the prime modulus.

   $$
@@ -108,7 +101,6 @@ At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-prob

   $k$ - The index of the current operation within the butterfly or the transform stage

-
   The twiddle factors are precomputed to save runtime and improve performance.

 4. **Bit-Reversal Permutation:**
@@ -116,7 +108,7 @@ At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-prob

 ### Mixed Radix

-The Mixed Radix NTT algorithm extends the concepts of the Radix-2 algorithm by allowing the decomposition of the input sequence based on various factors of its length. Specifically ICICLEs implementation splits the input into blocks of sizes 16,32,64 compared to radix2 which is always splitting such that we end with NTT of size 2. This approach offers enhanced flexibility and efficiency, especially for input sizes that are composite numbers, by leveraging the "divide and conquer" strategy across multiple radixes.
+The Mixed Radix NTT algorithm extends the concepts of the Radix-2 algorithm by allowing the decomposition of the input sequence based on various factors of its length. Specifically ICICLEs implementation splits the input into blocks of sizes 16, 32, or 64 compared to radix2 which is always splitting such that we end with NTT of size 2. This approach offers enhanced flexibility and efficiency, especially for input sizes that are composite numbers, by leveraging the "divide and conquer" strategy across multiple radices.

 The NTT blocks in Mixed Radix are implemented more efficiently based on winograd NTT but also optimized memory and register usage is better compared to Radix-2.

@@ -126,11 +118,11 @@ Mixed Radix can reduce the number of stages required to compute for large inputs
   The input to the Mixed Radix NTT is a sequence of integers $a_0, a_1, \ldots, a_{N-1}$, where $N$ is not strictly required to be a power of two. Instead, $N$ can be any composite number, ideally factorized into primes or powers of primes.

 2. **Factorization and Decomposition:**
-   Unlike the Radix-2 algorithm, which strictly divides the computational problem into halves, the Mixed Radix NTT algorithm implements a flexible decomposition approach which isn't limited to prime factorization. 
-   
+   Unlike the Radix-2 algorithm, which strictly divides the computational problem into halves, the Mixed Radix NTT algorithm implements a flexible decomposition approach which isn't limited to prime factorization.
+
   For example, an NTT of size 256 can be decomposed into two stages of $16 \times \text{NTT}_{16}$, leveraging a composite factorization strategy rather than decomposing into eight stages of $\text{NTT}_{2}$. This exemplifies the use of composite factors (in this case, $256 = 16 \times 16$) to apply smaller NTT transforms, optimizing computational efficiency by adapting the decomposition strategy to the specific structure of $N$.

-3. **Butterfly Operations with Multiple Radixes:**
+3. **Butterfly Operations with Multiple Radices:**
   The Mixed Radix algorithm utilizes butterfly operations for various radix sizes. Each sub-transform involves specific butterfly operations characterized by multiplication with twiddle factors appropriate for the radix in question.

   The generalized butterfly operation for a radix-$r$ element can be expressed as:
@@ -139,7 +131,15 @@ Mixed Radix can reduce the number of stages required to compute for large inputs
   X_{k,r} = \sum_{j=0}^{r-1} (A_{j,k} \cdot W^{jk}) \mod p
   $$

-   where $X_{k,r}$ is the output of the $radix-r$ butterfly operation for the $k-th$ set of inputs, $A_{j,k}$ represents the $j-th$ input element for the $k-th$ operation, $W$ is the twiddle factor, and $p$ is the prime modulus.
+   where:
+
+   $X_{k,r}$ - is the output of the $radix-r$ butterfly operation for the $k-th$ set of inputs
+
+   $A_{j,k}$ - represents the $j-th$ input element for the $k-th$ operation
+
+   $W$ - is the twiddle factor
+
+   $p$ - is the prime modulus

 4. **Recombination and Reordering:**
   After applying the appropriate butterfly operations across all decomposition levels, the Mixed Radix algorithm recombines the results into a single output sequence. Due to the varied sizes of the sub-transforms, a more complex reordering process may be required compared to Radix-2. This involves digit-reversal permutations to ensure that the final output sequence is correctly ordered.
@@ -154,6 +154,6 @@ Mixed radix on the other hand works better for larger NTTs with larger input siz

 Performance really depends on logn size, batch size, ordering, inverse, coset, coeff-field and which GPU you are using.

-For this reason we implemented our [heuristic auto-selection](https://github.com/ingonyama-zk/icicle/blob/774250926c00ffe84548bc7dd97aea5227afed7e/icicle/appUtils/ntt/ntt.cu#L474) which should choose the most efficient algorithm in most cases. 
+For this reason we implemented our [heuristic auto-selection](https://github.com/ingonyama-zk/icicle/blob/main/icicle/src/ntt/ntt.cu#L573) which should choose the most efficient algorithm in most cases.

 We still recommend you benchmark for your specific use case if you think a different configuration would yield better results.
--- a/docs/docs/icicle/primitives/poseidon.md
+++ b/docs/docs/icicle/primitives/poseidon.md
@@ -8,39 +8,38 @@ Poseidon has been used in many popular ZK protocols such as Filecoin and [Plonk]

 Our implementation of Poseidon is implemented in accordance with the optimized [Filecoin version](https://spec.filecoin.io/algorithms/crypto/poseidon/).

-Let understand how Poseidon works.
+Lets understand how Poseidon works.

-### Initialization
+## Initialization

-Poseidon starts with the initialization of its internal state, which is composed of the input elements and some pregenerated constants. An initial round constant is added to each element of the internal state. Adding The round constants ensure the state is properly mixed from the outset.
+Poseidon starts with the initialization of its internal state, which is composed of the input elements and some pre-generated constants. An initial round constant is added to each element of the internal state. Adding the round constants ensures the state is properly mixed from the beginning.

 This is done to prevent collisions and to prevent certain cryptographic attacks by ensuring that the internal state is sufficiently mixed and unpredictable.

 ![Alt text](image.png)

-### Applying full and partial rounds
+## Applying full and partial rounds

-To generate a secure hash output, the algorithm goes through a series of "full rounds" and "partial rounds" as well as transformations between these sets of rounds.
+To generate a secure hash output, the algorithm goes through a series of "full rounds" and "partial rounds" as well as transformations between these sets of rounds in the following order:

-First full rounds => apply SBox and Round constants => partial rounds => Last full rounds => Apply SBox
+```First full rounds -> apply S-box and Round constants -> partial rounds -> Last full rounds -> Apply S-box```

-#### Full rounds
+### Full rounds

 ![Alt text](image-1.png)

-**Uniform Application of S-Box:** In full rounds, the S-box (a non-linear transformation) is applied uniformly to every element of the hash function's internal state. This ensures a high degree of mixing and diffusion, contributing to the hash function's security. The functions S-box involves raising each element of the state to a certain power denoted by `α` a member of the finite field defined by the prime `p`, `α` can be different depending on the the implementation and user configuration.
+**Uniform Application of S-box:** In full rounds, the S-box (a non-linear transformation) is applied uniformly to every element of the hash function's internal state. This ensures a high degree of mixing and diffusion, contributing to the hash function's security. The functions S-box involves raising each element of the state to a certain power denoted by `α` a member of the finite field defined by the prime `p`; `α` can be different depending on the the implementation and user configuration.

 **Linear Transformation:** After applying the S-box, a linear transformation is performed on the state. This involves multiplying the state by a MDS (Maximum Distance Separable) Matrix. which further diffuses the transformations applied by the S-box across the entire state.

 **Addition of Round Constants:** Each element of the state is then modified by adding a unique round constant. These constants are different for each round and are precomputed as part of the hash function's initialization. The addition of round constants ensures that even minor changes to the input produce significant differences in the output.

-#### Partial Rounds
+### Partial Rounds

 **Selective Application of S-Box:** Partial rounds apply the S-box transformation to only one element of the internal state per round, rather than to all elements. This selective application significantly reduces the computational complexity of the hash function without compromising its security. The choice of which element to apply the S-box to can follow a specific pattern or be fixed, depending on the design of the hash function.

 **Linear Transformation and Round Constants:** A linear transformation is performed and round constants are added. The linear transformation in partial rounds can be designed to be less computationally intensive (this is done by using a sparse matrix) than in full rounds, further optimizing the function's efficiency.

-
 The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance is highly dependent on the use case.

 ![Alt text](image-2.png)
@@ -52,25 +51,20 @@ What that means is we calculate multiple hash-sums over multiple pre-images in p

 So for Poseidon of arity 2 and input of size 1024 * 2, we would expect 1024 elements of output. Which means each block would be of size 2 and that would result in 1024 Poseidon hashes being performed.

-### Supported API
+### Supported Bindings

-[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon), [`C++`](https://github.com/ingonyama-zk/icicle/tree/main/icicle/appUtils/poseidon)
-
-### Supported curves
-
-Poseidon supports the following curves:
-
-`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`
+[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon)

 ### Constants

 Poseidon is extremely customizable and using different constants will produce different hashes, security levels and performance results.

-We support pre-calculated and optimized constants for each of the [supported curves](#supported-curves).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/appUtils/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
+We support pre-calculated and optimized constants for each of the [supported curves](#supported-curves).The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.

-If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/blob/b6dded89cdef18348a5d4e2748b71ce4211c63ad/icicle/appUtils/poseidon/constants/generate_parameters.py#L1).
+If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants/generate_parameters.py).

 Prerequisites:
+
 - Install python 3
 - `pip install poseidon-hash`
 - `pip install galois==0.3.7`
@@ -97,7 +91,7 @@ primitive_element = 7 # bls12-381
 # primitive_element = 15 # bw6-761
 ```

-We only support `alpha = 5` so if you want to use another alpha for SBox please reach out on discord or open a github issue.
+We only support `alpha = 5` so if you want to use another alpha for S-box please reach out on discord or open a github issue.

 ### Rust API

@@ -128,8 +122,7 @@ poseidon_hash_many::<F>(

 The `PoseidonConfig::default()` can be modified, by default the inputs and outputs are set to be on `Host` for example.

-
-```
+```rust
 impl<'a> Default for PoseidonConfig<'a> {
    fn default() -> Self {
        let ctx = get_default_device_context();
@@ -174,11 +167,10 @@ let ctx = get_default_device_context();
    )
    .unwrap();
 ```
-For more examples using different configurations refer here.

 ## The Tree Builder

-The tree builder allows you to build Merkle trees using Poseidon. 
+The tree builder allows you to build Merkle trees using Poseidon.

 You can define both the tree's `height` and its `arity`. The tree `height` determines the number of layers in the tree, including the root and the leaf layer. The `arity` determines how many children each internal node can have.

@@ -206,9 +198,9 @@ Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig:
 - `are_inputs_on_device`: Have the inputs been loaded to device memory ?
 - `is_async`: Should the TreeBuilder run asynchronously? `False` will block the current CPU thread. `True` will require you call `cudaStreamSynchronize` or `cudaDeviceSynchronize` to retrieve the result.

-### Benchmarks 
+### Benchmarks

-We ran the Poseidon tree builder on: 
+We ran the Poseidon tree builder on:

 **CPU**: 12th Gen Intel(R) Core(TM) i9-12900K/

@@ -218,9 +210,8 @@ We ran the Poseidon tree builder on:

 The benchmarks include copying data from and to the device.

-
 | Rows to keep parameter      | Run time, Icicle | Supranational PC2
-| ----------- | ----------- | ----------- |  
+| ----------- | ----------- | -----------
 | 10          | 9.4 seconds       |    13.6 seconds
 | 20          | 9.5 seconds       |    13.6 seconds
 | 29          | 13.7 seconds       |    13.6 seconds
--- a/docs/docs/icicle/rust-bindings.md
+++ b/docs/docs/icicle/rust-bindings.md
@@ -12,7 +12,7 @@ Rust bindings allow you to use ICICLE as a rust library.

 Simply add the following to your `Cargo.toml`.

-```
+```toml
 # GPU Icicle integration
 icicle-cuda-runtime = { git = "https://github.com/ingonyama-zk/icicle.git" }
 icicle-core = { git = "https://github.com/ingonyama-zk/icicle.git" }
@@ -25,7 +25,7 @@ If you wish to point to a specific ICICLE branch add `branch = "<name_of_branch>

 When you build your project ICICLE will be built as part of the build command.

-# How do the rust bindings work?
+## How do the rust bindings work?

 The rust bindings are just rust wrappers for ICICLE Core static libraries which can be compiled. We integrate the compilation of the static libraries into rusts toolchain to make usage seamless and easy. This is achieved by [extending rusts build command](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-curves/icicle-bn254/build.rs).

@@ -55,3 +55,33 @@ fn main() {
    println!("cargo:rustc-link-lib=cudart");
 }
 ```
+
+## Supported curves, fields and operations
+
+### Supported curves and operations
+
+| Operation\Curve | bn254 | bls12_377 | bls12_381 | bw6-761 | grumpkin |
+| --- | :---: | :---: | :---: | :---: | :---: |
+| MSM | ✅ | ✅ | ✅ | ✅ | ✅ |
+| G2  | ✅ | ✅ | ✅ | ✅ | ❌ |
+| NTT | ✅ | ✅ | ✅ | ✅ | ❌ |
+| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
+| VecOps | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Polynomials | ✅ | ✅ | ✅ | ✅ | ❌ |
+| Poseidon | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Merkle Tree | ✅ | ✅ | ✅ | ✅ | ✅ |
+
+### Supported fields and operations
+
+| Operation\Field | babybear | stark252 |
+| --- | :---: | :---: |
+| VecOps | ✅ | ✅ |
+| Polynomials | ✅ | ✅ |
+| NTT | ✅ | ✅ |
+| Extension Field | ✅ | ❌ |
+
+### Supported hashes
+
+| Hash | Sizes |
+| --- | :---: |
+| Keccak | 256, 512 |
--- a/docs/docs/icicle/rust-bindings/ecntt.md
+++ b/docs/docs/icicle/rust-bindings/ecntt.md
@@ -1,9 +1,5 @@
 # ECNTT

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn254`
-
 ## ECNTT Method

 The `ecntt` function computes the Elliptic Curve Number Theoretic Transform (EC-NTT) or its inverse on a batch of points of a curve.
@@ -25,7 +21,7 @@ where

 ## Parameters

- **`input`**: The input data as a slice of `Projective<C>`. This represents points on a specific elliptic curve `C`. 
+- **`input`**: The input data as a slice of `Projective<C>`. This represents points on a specific elliptic curve `C`.
 - **`dir`**: The direction of the NTT. It can be `NTTDir::kForward` for forward NTT or `NTTDir::kInverse` for inverse NTT.
 - **`cfg`**: The NTT configuration object of type `NTTConfig<C::ScalarField>`. This object specifies parameters for the NTT computation, such as the batch size and algorithm to use.
 - **`output`**: The output buffer to write the results into. This should be a slice of `Projective<C>` with the same size as the input.
--- a/docs/docs/icicle/rust-bindings/msm-pre-computation.md
+++ b/docs/docs/icicle/rust-bindings/msm-pre-computation.md
@@ -2,11 +2,7 @@

 To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn254`, `bw6-761`, `Grumpkin`
-
-### `precompute_bases`
+## `precompute_bases`

 Precomputes bases for the multi-scalar multiplication (MSM) by extending each base point with its multiples, facilitating more efficient MSM calculations.

@@ -20,8 +16,7 @@ pub fn precompute_bases<C: Curve + MSM<C>>(
 ) -> IcicleResult<()>
 ```

-
-#### Parameters
+### Parameters

 - **`points`**: The original set of affine points (\(P_1, P_2, ..., P_n\)) to be used in the MSM. For batch MSM operations, this should include all unique points concatenated together.
 - **`precompute_factor`**: Specifies the total number of points to precompute for each base, including the base point itself. This parameter directly influences the memory requirements and the potential speedup of the MSM operation.
--- a/docs/docs/icicle/rust-bindings/msm.md
+++ b/docs/docs/icicle/rust-bindings/msm.md
@@ -1,9 +1,5 @@
 # MSM

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`, `grumpkin`
-
 ## Example

 ```rust
@@ -84,7 +80,7 @@ pub struct MSMConfig<'a> {
 ```

 - **`ctx: DeviceContext`**: Specifies the device context, device id and the CUDA stream for asynchronous execution.
- **`point_size: i32`**: 
+- **`point_size: i32`**:
 - **`precompute_factor: i32`**: Determines the number of extra points to pre-compute for each point, affecting memory footprint and performance.
 - **`c: i32`**: The "window bitsize," a parameter controlling the computational complexity and memory footprint of the MSM operation.
 - **`bitsize: i32`**: The number of bits of the largest scalar, typically equal to the bit size of the scalar field.
@@ -120,7 +116,6 @@ msm::msm(&scalars, &points, &cfg, &mut msm_results).unwrap();

 You may reference the rust code [here](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/wrappers/rust/icicle-core/src/msm/mod.rs#L54).

-
 ## How do I toggle between MSM modes?

 Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `msm::msm` function. If you are expecting an array of `msm_results`, ICICLE will automatically split `scalars` and `points` into equal parts and run them as multiple MSMs in parallel.
@@ -136,7 +131,6 @@ msm::msm(&scalars, &points, &cfg, &mut msm_result).unwrap();

 In the example above we allocate a single expected result which the MSM method will interpret as `batch_size=1` and run a single MSM.

-
 In the next example, we are expecting 10 results which sets `batch_size=10` and runs 10 MSMs in batch mode.

 ```rust
@@ -152,7 +146,7 @@ Here is a [reference](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961

 ## Support for G2 group

-MSM also supports G2 group. 
+MSM also supports G2 group.

 Using MSM in G2 requires a G2 config, and of course your Points should also be G2 Points.

--- a/docs/docs/icicle/rust-bindings/ntt.md
+++ b/docs/docs/icicle/rust-bindings/ntt.md
@@ -1,10 +1,6 @@
 # NTT

-### Supported curves
-
-`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`
-
-## Example 
+## Example

 ```rust
 use icicle_bn254::curve::{ScalarCfg, ScalarField};
@@ -61,14 +57,13 @@ pub fn ntt<F>(

 `ntt:ntt` expects:

- **`input`** - buffer to read the inputs of the NTT from. <br/>
- **`dir`** - whether to compute forward or inverse NTT. <br/>
- **`cfg`** - config used to specify extra arguments of the NTT. <br/>
+- **`input`** - buffer to read the inputs of the NTT from.
+- **`dir`** - whether to compute forward or inverse NTT.
+- **`cfg`** - config used to specify extra arguments of the NTT.
 - **`output`** - buffer to write the NTT outputs into. Must be of the same  size as input.

 The `input` and `output` buffers can be on device or on host. Being on host means that they will be transferred to device during runtime.

-
 ### NTT Config

 ```rust
@@ -107,8 +102,7 @@ The `NTTConfig` struct is a configuration object used to specify parameters for

 - **`ntt_algorithm: NttAlgorithm`**: Can be one of `Auto`, `Radix2`, `MixedRadix`.
 `Auto` will select `Radix 2` or `Mixed Radix` algorithm based on heuristics.
-`Radix2` and `MixedRadix` will force the use of an algorithm regardless of the input size or other considerations. You should use one of these options when you know for sure that you want to 
-
+`Radix2` and `MixedRadix` will force the use of an algorithm regardless of the input size or other considerations. You should use one of these options when you know for sure that you want to

 #### Usage

@@ -134,7 +128,6 @@ let custom_config = NTTConfig {
 };
 ```

-
 ### Modes

 NTT supports two different modes `Batch NTT` and `Single NTT`
@@ -205,4 +198,3 @@ where
 #### Returns

 The function returns an `IcicleResult<()>`, which represents the result of the operation. If the operation is successful, the function returns `Ok(())`, otherwise it returns an error.
-
--- a/docs/docs/icicle/rust-bindings/polynomials.md
+++ b/docs/docs/icicle/rust-bindings/polynomials.md
@@ -1,14 +1,16 @@
-:::note Please refer to the Polynomials overview page for a deep overview. This section is a brief description of the Rust FFI bindings. 
+# Rust FFI Bindings for Univariate Polynomial
+
+:::note
+Please refer to the Polynomials overview page for a deep overview. This section is a brief description of the Rust FFI bindings.
 :::

-# Rust FFI Bindings for Univariate Polynomial
 This documentation is designed to provide developers with a clear understanding of how to utilize the Rust bindings for polynomial operations efficiently and effectively, leveraging the robust capabilities of both Rust and C++ in their applications.

 ## Introduction
+
 The Rust FFI bindings for the Univariate Polynomial serve as a "shallow wrapper" around the underlying C++ implementation. These bindings provide a straightforward Rust interface that directly calls functions from a C++ library, effectively bridging Rust and C++ operations. The Rust layer handles simple interface translations without delving into complex logic or data structures, which are managed on the C++ side. This design ensures efficient data handling, memory management, and execution of polynomial operations directly via C++.
 Currently, these bindings are tailored specifically for polynomials where the coefficients, domain, and images are represented as scalar fields.

-
 ## Initialization Requirements

 Before utilizing any functions from the polynomial API, it is mandatory to initialize the appropriate polynomial backend (e.g., CUDA). Additionally, the NTT (Number Theoretic Transform) domain must also be initialized, as the CUDA backend relies on this for certain operations. Failing to properly initialize these components can result in errors.
@@ -19,12 +21,12 @@ Before utilizing any functions from the polynomial API, it is mandatory to initi
 The ICICLE library is structured such that each field or curve has its dedicated library implementation. As a result, initialization must be performed individually for each field or curve to ensure the correct setup and functionality of the library.
 :::

-
 ## Core Trait: `UnivariatePolynomial`

 The `UnivariatePolynomial` trait encapsulates the essential functionalities required for managing univariate polynomials in the Rust ecosystem. This trait standardizes the operations that can be performed on polynomials, regardless of the underlying implementation details. It allows for a unified approach to polynomial manipulation, providing a suite of methods that are fundamental to polynomial arithmetic.

 ### Trait Definition
+
 ```rust
 pub trait UnivariatePolynomial
 where
@@ -77,6 +79,7 @@ where
 ```

 ## `DensePolynomial` Struct
+
 The DensePolynomial struct represents a dense univariate polynomial in Rust, leveraging a handle to manage its underlying memory within the CUDA device context. This struct acts as a high-level abstraction over complex C++ memory management practices, facilitating the integration of high-performance polynomial operations through Rust's Foreign Function Interface (FFI) bindings.

 ```rust
@@ -88,15 +91,19 @@ pub struct DensePolynomial {
 ### Traits implementation and methods

 #### `Drop`
+
 Ensures proper resource management by releasing the CUDA memory when a DensePolynomial instance goes out of scope. This prevents memory leaks and ensures that resources are cleaned up correctly, adhering to Rust's RAII (Resource Acquisition Is Initialization) principles.

 #### `Clone`
+
 Provides a way to create a new instance of a DensePolynomial with its own unique handle, thus duplicating the polynomial data in the CUDA context. Cloning is essential since the DensePolynomial manages external resources, which cannot be safely shared across instances without explicit duplication.

 #### Operator Overloading: `Add`, `Sub`, `Mul`, `Rem`, `Div`
+
 These traits are implemented for references to DensePolynomial (i.e., &DensePolynomial), enabling natural mathematical operations such as addition (+), subtraction (-), multiplication (*), division (/), and remainder (%). This syntactic convenience allows users to compose complex polynomial expressions in a way that is both readable and expressive.

 #### Key Methods
+
 In addition to the traits, the following methods are implemented:

 ```rust
@@ -107,16 +114,16 @@ impl DensePolynomial {
 }      
 ```

-:::note Might be consolidated with `UnivariatePolynomial` trait
-:::
-
 ## Flexible Memory Handling With `HostOrDeviceSlice`
+
 The DensePolynomial API is designed to accommodate a wide range of computational environments by supporting both host and device memory through the `HostOrDeviceSlice` trait. This approach ensures that polynomial operations can be seamlessly executed regardless of where the data resides, making the API highly adaptable and efficient for various hardware configurations.

 ### Overview of `HostOrDeviceSlice`
+
 The HostOrDeviceSlice is a Rust trait that abstracts over slices of memory that can either be on the host (CPU) or the device (GPU), as managed by CUDA. This abstraction is crucial for high-performance computing scenarios where data might need to be moved between different memory spaces depending on the operations being performed and the specific hardware capabilities available.

 ### Usage in API Functions
+
 Functions within the DensePolynomial API that deal with polynomial coefficients or evaluations use the HostOrDeviceSlice trait to accept inputs. This design allows the functions to be agnostic of the actual memory location of the data, whether it's in standard system RAM accessible by the CPU or in GPU memory accessible by CUDA cores.

 ```rust
@@ -132,10 +139,13 @@ let p_from_evals = PolynomialBabyBear::from_rou_evals(&evals, evals.len());
 ```

 ## Usage
+
 This section outlines practical examples demonstrating how to utilize the `DensePolynomial` Rust API. The API is flexible, supporting multiple scalar fields. Below are examples showing how to use polynomials defined over different fields and perform a variety of operations.

 ### Initialization and Basic Operations
+
 First, choose the appropriate field implementation for your polynomial operations, initializing the CUDA backend if necessary
+
 ```rust
 use icicle_babybear::polynomials::DensePolynomial as PolynomialBabyBear;

@@ -151,10 +161,10 @@ use icicle_bn254::polynomials::DensePolynomial as PolynomialBn254;
 ```

 ### Creation
+
 Polynomials can be created from coefficients or evaluations:

 ```rust
-// Assume F is the field type (e.g. icicle_bn254::curve::ScalarField or a type parameter)
 let coeffs = ...;
 let p_from_coeffs = PolynomialBabyBear::from_coeffs(HostSlice::from_slice(&coeffs), size);

@@ -164,6 +174,7 @@ let p_from_evals = PolynomialBabyBear::from_rou_evals(HostSlice::from_slice(&eva
 ```

 ### Arithmetic Operations
+
 Utilize overloaded operators for intuitive mathematical expressions:

 ```rust
@@ -174,6 +185,7 @@ let mul_scalar = &f * &scalar;  // Scalar multiplication
 ```

 ### Division and Remainder
+
 Compute quotient and remainder or perform division by a vanishing polynomial:

 ```rust
@@ -186,6 +198,7 @@ let h = f.div_by_vanishing(N);  // Division by V(x) = X^N - 1
 ```

 ### Monomial Operations
+
 Add or subtract monomials in-place for efficient polynomial manipulation:

 ```rust
@@ -194,6 +207,7 @@ f.sub_monomial_inplace(&one, 0 /*monmoial*/);   // Subtracts 1 from f
 ```

 ### Slicing
+
 Extract specific components:

 ```rust
@@ -203,6 +217,7 @@ let arbitrary_slice = f.slice(offset, stride, size);
 ```

 ### Evaluate
+
 Evaluate the polynoomial:

 ```rust
@@ -216,6 +231,7 @@ f.eval_on_domain(HostSlice::from_slice(&domain), HostSlice::from_mut_slice(&mut
 ```

 ### Read coefficients
+
 Read or copy polynomial coefficients for further processing:

 ```rust
@@ -227,6 +243,7 @@ f.copy_coeffs(0, &mut device_mem[..]);
 ```

 ### Polynomial Degree
+
 Determine the highest power of the variable with a non-zero coefficient:

 ```rust
@@ -234,6 +251,7 @@ let deg = f.degree();  // Degree of the polynomial
 ```

 ### Memory Management: Views (rust slices)
+
 Rust enforces correct usage of views at compile time, eliminating the need for runtime checks:

 ```rust
--- a/docs/docs/icicle/rust-bindings/vec-ops.md
+++ b/docs/docs/icicle/rust-bindings/vec-ops.md
@@ -1,13 +1,6 @@
 # Vector Operations API

-Our vector operations API which is part of `icicle-cuda-runtime` package, includes fundamental methods for addition, subtraction, and multiplication of vectors, with support for both host and device memory. 
-
-
-## Supported curves
-
-Vector operations are supported on the following curves:
-
-`bls12-377`, `bls12-381`, `bn-254`, `bw6-761`, `grumpkin`
+Our vector operations API which is part of `icicle-cuda-runtime` package, includes fundamental methods for addition, subtraction, and multiplication of vectors, with support for both host and device memory.

 ## Examples

@@ -59,7 +52,6 @@ let cfg = VecOpsConfig::default();
 mul_scalars(&a, &ones, &mut result, &cfg).unwrap();
 ```

-
 ## Vector Operations Configuration

 The `VecOpsConfig` struct encapsulates the settings for vector operations, including device context and operation modes.
@@ -90,7 +82,7 @@ pub struct VecOpsConfig<'a> {

 `VecOpsConfig` can be initialized with default settings tailored for a specific device:

-```
+```rust
 let cfg = VecOpsConfig::default();
 ```

@@ -118,7 +110,7 @@ impl<'a> VecOpsConfig<'a> {

 ## Vector Operations

-Vector operations are implemented through the `VecOps` trait, these traits are implemented for all [supported curves](#supported-curves) providing methods for addition, subtraction, and multiplication of vectors.
+Vector operations are implemented through the `VecOps` trait, providing methods for addition, subtraction, and multiplication of vectors.

 ### `VecOps` Trait

@@ -155,7 +147,6 @@ All operations are element-wise operations, and the results placed into the `res
 - **`sub`**: Computes the element-wise difference between two vectors.
 - **`mul`**: Performs element-wise multiplication of two vectors.

-
 ## MatrixTranspose API Documentation

 This section describes the functionality of the `TransposeMatrix` function used for matrix transposition.
@@ -186,8 +177,8 @@ where
 - **`column_size`**: The number of columns in the input matrix.
 - **`output`**: A mutable slice to store the transposed matrix. The slice can be stored on either the host or the device.
 - **`ctx`**: A reference to the `DeviceContext`, which provides information about the device where the operation will be performed.
- **`on_device`**: A boolean flag indicating whether the inputs and outputs are on the device. 
- **`is_async`**: A boolean flag indicating whether the operation should be performed asynchronously. 
+- **`on_device`**: A boolean flag indicating whether the inputs and outputs are on the device.
+- **`is_async`**: A boolean flag indicating whether the operation should be performed asynchronously.

 ### Return Value

@@ -209,9 +200,8 @@ transpose_matrix(&input, 5, 4, &mut output, &ctx, true, false)
    .expect("Failed to transpose matrix");
 ```

-
 The function takes a matrix represented as a 1D slice, transposes it, and stores the result in another 1D slice. The input and output slices can be stored on either the host or the device, and the operation can be performed synchronously or asynchronously.

 The function is generic and can work with any type `F` that implements the `FieldImpl` trait. The `<F as FieldImpl>::Config` type must also implement the `VecOps<F>` trait, which provides the `transpose` method used to perform the actual transposition.

-The function returns an `IcicleResult<()>`, indicating whether the operation was successful or not.
+The function returns an `IcicleResult<()>`, indicating whether the operation was successful or not.
--- a/docs/docs/introduction.md
+++ b/docs/docs/introduction.md
@@ -11,7 +11,7 @@ Ingonyama is a next-generation semiconductor company, focusing on Zero-Knowledge
 Currently our flagship products are:

 - **ICICLE**:
-  [ICICLE](https://github.com/ingonyama-zk/icicle) is a fully featured GPU accelerated cryptography library for building ZK provers. ICICLE allows you to accelerate your ZK existing protocols in a matter of hours or implement your protocol from scratch on GPU.
+  [ICICLE](https://github.com/ingonyama-zk/icicle) is a fully featured GPU accelerated cryptography library for building ZK provers. ICICLE allows you to accelerate your existing ZK protocols in a matter of hours or implement your protocol from scratch on GPU.

 ---

@@ -39,7 +39,7 @@ Learn more about ICICLE and GPUs [here][ICICLE-OVERVIEW].

 ## Get in Touch

-If you have any questions, ideas, or are thinking of building something in this space join the discussion on [Discord]. You can explore our code on [github](https://github.com/ingonyama-zk) or read some of [our research papers](https://github.com/ingonyama-zk/papers).
+If you have any questions, ideas, or are thinking of building something in this space, join the discussion on [Discord]. You can explore our code on [github](https://github.com/ingonyama-zk) or read some of [our research papers](https://github.com/ingonyama-zk/papers).

 Follow us on [Twitter](https://x.com/Ingo_zk) and [YouTube](https://www.youtube.com/@ingo_ZK) and sign up for our [mailing list](https://wkf.ms/3LKCbdj) to get our latest announcements.

--- a/docs/package-lock.json
+++ b/docs/package-lock.json
@@ -3680,8 +3680,6 @@
      "version": "8.12.0",
      "resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
      "integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
-      "optional": true,
-      "peer": true,
      "dependencies": {
        "fast-deep-equal": "^3.1.1",
        "json-schema-traverse": "^1.0.0",
@@ -3696,9 +3694,7 @@
    "node_modules/ajv-formats/node_modules/json-schema-traverse": {
      "version": "1.0.0",
      "resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
-      "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
-      "optional": true,
-      "peer": true
+      "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
    },
    "node_modules/ajv-keywords": {
      "version": "3.5.2",
@@ -16344,13 +16340,14 @@
      "version": "2.1.1",
      "resolved": "https://registry.npmjs.org/ajv-formats/-/ajv-formats-2.1.1.tgz",
      "integrity": "sha512-Wx0Kx52hxE7C18hkMEggYlEifqWZtYaRgouJor+WMdPnQyEK13vgEWyVNup7SoeeoLMsr4kf5h6dOW11I15MUA==",
-      "requires": {},
+      "requires": {
+        "ajv": "^8.0.0"
+      },
      "dependencies": {
        "ajv": {
-          "version": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
+          "version": "8.12.0",
+          "resolved": "https://registry.npmjs.org/ajv/-/ajv-8.12.0.tgz",
          "integrity": "sha512-sRu1kpcO9yLtYxBKvqfTeh9KzZEwO3STyX1HT+4CaDzC6HpTGYhIhPIzj9XuKU7KYDwnaeh5hcOwjy1QuJzBPA==",
-          "optional": true,
-          "peer": true,
          "requires": {
            "fast-deep-equal": "^3.1.1",
            "json-schema-traverse": "^1.0.0",
@@ -16361,9 +16358,7 @@
        "json-schema-traverse": {
          "version": "1.0.0",
          "resolved": "https://registry.npmjs.org/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz",
-          "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==",
-          "optional": true,
-          "peer": true
+          "integrity": "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="
        }
      }
    },
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@@ -24,6 +24,42 @@ module.exports = {
          label: "ICICLE Core",
          id: "icicle/core",
        },
+        {
+          type: "category",
+          label: "Primitives",
+          link: {
+            type: `doc`,
+            id: 'icicle/primitives/overview',
+          },
+          collapsed: true,
+          items: [
+            {
+              type: "doc",
+              label: "MSM",
+              id: "icicle/primitives/msm",
+            },
+            {
+              type: "doc",
+              label: "NTT",
+              id: "icicle/primitives/ntt",
+            },
+            {
+              type: "doc",
+              label: "Poseidon Hash",
+              id: "icicle/primitives/poseidon",
+            },
+          ],
+        },
+        {
+          type: "doc",
+          label: "Polynomials",
+          id: "icicle/polynomials/overview",
+        },
+        {
+          type: "doc",
+          label: "Multi GPU Support",
+          id: "icicle/multi-gpu",
+        },
        {
          type: "category",
          label: "Golang bindings",
@@ -123,42 +159,6 @@ module.exports = {
            },
          ],
        },
-        {
-          type: "category",
-          label: "Primitives",
-          link: {
-            type: `doc`,
-            id: 'icicle/primitives/overview',
-          },
-          collapsed: true,
-          items: [
-            {
-              type: "doc",
-              label: "MSM",
-              id: "icicle/primitives/msm",
-            },
-            {
-              type: "doc",
-              label: "NTT",
-              id: "icicle/primitives/ntt",
-            },
-            {
-              type: "doc",
-              label: "Poseidon Hash",
-              id: "icicle/primitives/poseidon",
-            },
-          ],
-        },
-        {
-          type: "doc",
-          label: "Polynomials",
-          id: "icicle/polynomials/overview",
-        },
-        {
-          type: "doc",
-          label: "Multi GPU Support",
-          id: "icicle/multi-gpu",
-        },
        {
          type: "doc",
          label: "Google Colab Instructions",
@@ -190,6 +190,7 @@ module.exports = {
      type: "category",
      label: "Additional Resources",
      collapsed: false,
+      collapsible: false,
      items: [
        {
          type: "link",
--- a/examples/c++/pedersen-commitment/example.cu
+++ b/examples/c++/pedersen-commitment/example.cu
@@ -88,7 +88,7 @@ void point_near_x(T x, affine_t *point) {
 }

 static int seed = 0;
-static HOST_INLINE T rand_host_seed()
+static T rand_host_seed()
  {
    std::mt19937_64 generator(seed++);
    std::uniform_int_distribution<unsigned> distribution;
--- a/examples/c++/polynomial-api/CMakeLists.txt
+++ b/examples/c++/polynomial-api/CMakeLists.txt
@@ -0,0 +1,27 @@
+cmake_minimum_required(VERSION 3.18)
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CUDA_STANDARD 17)
+set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
+set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
+if (${CMAKE_VERSION} VERSION_LESS "3.24.0")
+    set(CMAKE_CUDA_ARCHITECTURES ${CUDA_ARCH})
+else()
+    set(CMAKE_CUDA_ARCHITECTURES native) # on 3.24+, on earlier it is ignored, and the target is not passed
+endif ()
+project(example LANGUAGES CUDA CXX)
+
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr -DCURVE_ID=BN254")
+set(CMAKE_CUDA_FLAGS_RELEASE "")
+set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -g -G -O0")
+
+add_executable(
+  example
+  example.cu
+)
+
+set_target_properties(example PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
+target_include_directories(example PRIVATE "../../../icicle/include")
+
+# can link to another curve/field by changing the following lib and FIELD_ID
+target_link_libraries(example ${CMAKE_SOURCE_DIR}/build/icicle/lib/libingo_field_bn254.a)
+target_compile_definitions(example PUBLIC FIELD_ID BN254)
--- a/examples/c++/polynomial-api/README.md
+++ b/examples/c++/polynomial-api/README.md
@@ -0,0 +1,49 @@
+# ICICLE examples: computations with polynomials
+
+## Best-Practices
+
+We recommend to run our examples in [ZK-containers](../../ZK-containers.md) to save your time and mental energy.
+
+## Key-Takeaway
+
+Polynomials are crucial for Zero-Knowledge Proofs (ZKPs): they enable efficient representation and verification of computational statements, facilitate privacy-preserving protocols, and support complex mathematical operations essential for constructing and verifying proofs without revealing underlying data. Polynomial API is documented [here](https://dev.ingonyama.com/icicle/polynomials/overview)
+
+## Running the example
+
+To run example, from project root directory:
+
+```sh
+cd examples/c++/polynomial-api
+./compile.sh
+./run.sh
+```
+
+To change the scalar field, modify `compile.h` to build the corresponding lib and `CMakeLists.txt` to link to that lib and set `FIELD_ID` correspondingly.
+
+## What's in the examples
+
+- `example_evaluate`: Make polynomial from coefficients and evalue it at random point.
+
+- `example_clone`: Make a separate copy of a polynomial.
+
+- `example_from_rou`: Reconstruct polynomial from values at the roots of unity. This operation is a cornerstone in the efficient implementation of zero-knowledge proofs, particularly in the areas of proof construction, verification, and polynomial arithmetic. By leveraging the algebraic structure and computational properties of roots of unity, ZKP protocols can achieve the scalability, efficiency, and privacy necessary for practical applications in blockchain, secure computation, and beyond.
+
+- `example_addition`, `example_addition_inplace`: Different flavors of polynomial addition.
+
+- `example_multiplication`: A product of two polynimials
+
+- `example_multiplicationScalar`: A product of scalar and a polynomial.
+
+- `example_monomials`: Add/subtract a monomial to a polynom. Monomial is a single term, which is the product of a constant coefficient and a variable raised to a non-negative integer power.
+
+- `example_ReadCoeffsToHost`: Download coefficients of a polynomial to a host. `ICICLE` keeps all polynomials on GPU, for on-host operation one needs such an operation.
+
+- `example_divisionSmall`, `example_divisionLarge`: Different flavors of division.
+
+- `example_divideByVanishingPolynomial`: A vanishing polynomial over a set S is a polynomial that evaluates to zero for every element in S. For a simple case, consider the set S={a}, a single element. The polynomial f(x)=x−a vanishes over S because f(a)=0. Mathematically, dividing a polynomial P(x) by a vanishing polynomial V(x) typically involves finding another polynomial Q(x) and possibly a remainder R(x) such that P(x)=Q(x)V(x)+R(x), where R(x) has a lower degree than V(x). In many cryptographic applications, the focus is on ensuring that P(x) is exactly divisible by V(x), meaning R(x)=0.
+
+- `example_EvenOdd`: even (odd) methods keep even (odd) coefficients of the original polynomial. For $f(x) = 1+2x+3x^2+4x^3$, even polynomial is $1+3x$, odd polynomial is $2+4x$.
+
+- `example_Slice`: extends even/odd methods and keeps coefficients for a given offset and stride. For $f(x) = 1+2x+3x^2+4x^3$, origin 0 stride 3 slice gives $1+4x$
+
+- `example_DeviceMemoryView`: device-memory views of polynomials allow "pass" polynomials to other GPU functions. In this example the coefficients of a polynomial are committed to a Merkle tree bypassing the host.
--- a/examples/c++/polynomial-api/compile.sh
+++ b/examples/c++/polynomial-api/compile.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+# Exit immediately on error
+set -e
+
+mkdir -p build/example
+mkdir -p build/icicle
+
+# Configure and build Icicle
+cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DG2=OFF
+cmake --build build/icicle
+
+# Configure and build the example application
+cmake -S . -B build/example
+cmake --build build/example
--- a/examples/c++/polynomial-api/example.cu
+++ b/examples/c++/polynomial-api/example.cu
@@ -0,0 +1,333 @@
+#include <iostream>
+
+#include "polynomials/polynomials.h"
+#include "polynomials/cuda_backend/polynomial_cuda_backend.cuh"
+#include "ntt/ntt.cuh"
+#include "poseidon/tree/merkle.cuh"
+
+// using namespace field_config;
+using namespace polynomials;
+using namespace merkle;
+
+// define the polynomial type
+typedef Polynomial<scalar_t> Polynomial_t;
+
+// we'll use the following constants in the examples
+const auto zero = scalar_t::zero();
+const auto one = scalar_t::one();
+const auto two = scalar_t::from(2);
+const auto three = scalar_t::from(3);
+const auto four = scalar_t::from(4);
+const auto five = scalar_t::from(5);
+const auto minus_one = zero - one;
+
+void example_evaluate()
+{
+  std::cout << std::endl << "Example: Polynomial evaluation on random value" << std::endl;
+  const scalar_t coeffs[3] = {one, two, three};
+  auto f = Polynomial_t::from_coefficients(coeffs, 3);
+  std::cout << "f = " << f << std::endl;
+  scalar_t x = scalar_t::rand_host();
+  std::cout << "x = " << x << std::endl;
+  auto fx = f(x);
+  std::cout << "f(x) = " << fx << std::endl;
+}
+
+void example_from_rou(const int size)
+{
+  std::cout << std::endl << "Example: Reconstruct polynomial from values at roots of unity" << std::endl;
+  const int log_size = (int)ceil(log2(size));
+  const int nof_evals = 1 << log_size;
+  auto coeff = std::make_unique<scalar_t[]>(size);
+  for (int i = 0; i < size; i++)
+    coeff[i] = scalar_t::rand_host();
+  auto f = Polynomial_t::from_coefficients(coeff.get(), size);
+  // rou: root of unity
+  auto omega = scalar_t::omega(log_size);
+  scalar_t evals[nof_evals] = {scalar_t::zero()};
+  auto x = scalar_t::one();
+  for (int i = 0; i < nof_evals; ++i) {
+    evals[i] = f(x);
+    x = x * omega;
+  }
+  // reconstruct f from evaluations
+  auto fr = Polynomial_t::from_rou_evaluations(evals, nof_evals);
+  // check for equality f-fr==0
+  auto h = f - fr;
+  std::cout << "degree of f - fr = " << h.degree() << std::endl;
+}
+
+static Polynomial_t randomize_polynomial(uint32_t size)
+{
+  auto coeff = std::make_unique<scalar_t[]>(size);
+  for (int i = 0; i < size; i++)
+    coeff[i] = scalar_t::rand_host();
+  return Polynomial_t::from_coefficients(coeff.get(), size);
+}
+
+static Polynomial_t incremental_values(uint32_t size)
+{
+  auto coeff = std::make_unique<scalar_t[]>(size);
+  for (int i = 0; i < size; i++) {
+    coeff[i] = i ? coeff[i - 1] + scalar_t::one() : scalar_t::one();
+  }
+  return Polynomial_t::from_coefficients(coeff.get(), size);
+}
+
+static bool is_equal(Polynomial_t& lhs, Polynomial_t& rhs)
+{
+  const int deg_lhs = lhs.degree();
+  const int deg_rhs = rhs.degree();
+  if (deg_lhs != deg_rhs) { return false; }
+  auto lhs_coeffs = std::make_unique<scalar_t[]>(deg_lhs);
+  auto rhs_coeffs = std::make_unique<scalar_t[]>(deg_rhs);
+  lhs.copy_coeffs(lhs_coeffs.get(), 1, deg_lhs - 1);
+  rhs.copy_coeffs(rhs_coeffs.get(), 1, deg_rhs - 1);
+  return memcmp(lhs_coeffs.get(), rhs_coeffs.get(), deg_lhs * sizeof(scalar_t)) == 0;
+}
+
+void example_addition(const int size0, const int size1)
+{
+  std::cout << std::endl << "Example: Polynomial addition" << std::endl;
+  auto f = randomize_polynomial(size0);
+  auto g = randomize_polynomial(size1);
+  auto x = scalar_t::rand_host();
+  auto f_x = f(x);
+  auto g_x = g(x);
+  auto fx_plus_gx = f_x + g_x;
+  auto h = f + g;
+  auto h_x = h(x);
+  std::cout << "evaluate and add: " << fx_plus_gx << std::endl;
+  std::cout << "add and evaluate: " << h_x << std::endl;
+}
+
+void example_addition_inplace(const int size0, const int size1)
+{
+  std::cout << std::endl << "Example: Polynomial inplace addition" << std::endl;
+  auto f = randomize_polynomial(size0);
+  auto g = randomize_polynomial(size1);
+
+  auto x = scalar_t::rand_host();
+  auto f_x = f(x);
+  auto g_x = g(x);
+  auto fx_plus_gx = f_x + g_x;
+  f += g;
+  auto s_x = f(x);
+  std::cout << "evaluate and add: " << fx_plus_gx << std::endl;
+  std::cout << "add and evaluate: " << s_x << std::endl;
+}
+
+void example_multiplication(const int log0, const int log1)
+{
+  std::cout << std::endl << "Example: Polynomial multiplication" << std::endl;
+  const int size0 = 1 << log0, size1 = 1 << log1;
+  auto f = randomize_polynomial(size0);
+  auto g = randomize_polynomial(size1);
+  scalar_t x = scalar_t::rand_host();
+  auto fx = f(x);
+  auto gx = g(x);
+  auto fx_mul_gx = fx * gx;
+  auto m = f * g;
+  auto mx = m(x);
+  std::cout << "evaluate and multiply: " << fx_mul_gx << std::endl;
+  std::cout << "multiply and evaluate: " << mx << std::endl;
+}
+
+void example_multiplicationScalar(const int log0)
+{
+  std::cout << std::endl << "Example: Scalar by Polynomial multiplication" << std::endl;
+  const int size = 1 << log0;
+  auto f = randomize_polynomial(size);
+  auto s = scalar_t::from(2);
+  auto g = s * f;
+  auto x = scalar_t::rand_host();
+  auto fx = f(x);
+  auto fx2 = s * fx;
+  auto gx = g(x);
+  std::cout << "Compare (2*f)(x) and 2*f(x): " << std::endl;
+  std::cout << gx << std::endl;
+  std::cout << fx2 << std::endl;
+}
+
+void example_monomials()
+{
+  std::cout << std::endl << "Example: Monomials" << std::endl;
+  const scalar_t coeffs[3] = {one, zero, two}; // 1+2x^2
+  auto f = Polynomial_t::from_coefficients(coeffs, 3);
+  const auto x = three;
+  auto fx = f(x);
+  f.add_monomial_inplace(three, 1); // add 3x
+  const auto expected_addmonmon_f_x = fx + three * x;
+  const auto addmonom_f_x = f(x);
+  std::cout << "Computed f'(x) = " << addmonom_f_x << std::endl;
+  std::cout << "Expected f'(x) = " << expected_addmonmon_f_x << std::endl;
+}
+
+void example_ReadCoeffsToHost()
+{
+  std::cout << std::endl << "Example: Read coefficients to host" << std::endl;
+  const scalar_t coeffs_f[3] = {zero, one, two}; // 0+1x+2x^2
+  auto f = Polynomial_t::from_coefficients(coeffs_f, 3);
+  const scalar_t coeffs_g[3] = {one, one, one}; // 1+x+x^2
+  auto g = Polynomial_t::from_coefficients(coeffs_g, 3);
+  auto h = f + g; // 1+2x+3x^3
+  std::cout << "Get one coefficient of h() at a time: " << std::endl;
+  const auto h0 = h.get_coeff(0);
+  const auto h1 = h.get_coeff(1);
+  const auto h2 = h.get_coeff(2);
+  std::cout << "Coefficients of h: " << std::endl;
+  std::cout << "0:" << h0 << " expected: " << one << std::endl;
+  std::cout << "1:" << h1 << " expected: " << two << std::endl;
+  std::cout << "2:" << h2 << " expected: " << three << std::endl;
+  std::cout << "Get all coefficients of h() at a time: " << std::endl;
+
+  scalar_t h_coeffs[3] = {0};
+  // fetch the coefficients for a given range
+  auto nof_coeffs = h.copy_coeffs(h_coeffs, 0, 2);
+  scalar_t expected_h_coeffs[nof_coeffs] = {one, two, three};
+  for (int i = 0; i < nof_coeffs; ++i) {
+    std::cout << i << ":" << h_coeffs[i] << " expected: " << expected_h_coeffs[i] << std::endl;
+  }
+}
+
+void example_divisionSmall()
+{
+  std::cout << std::endl << "Example: Polynomial division (small)" << std::endl;
+  const scalar_t coeffs_a[4] = {five, zero, four, three}; // 3x^3+4x^2+5
+  const scalar_t coeffs_b[3] = {minus_one, zero, one};    // x^2-1
+  auto a = Polynomial_t::from_coefficients(coeffs_a, 4);
+  auto b = Polynomial_t::from_coefficients(coeffs_b, 3);
+  auto [q, r] = a.divide(b);
+  scalar_t q_coeffs[2] = {0}; // 3x+4
+  scalar_t r_coeffs[2] = {0}; // 3x+9
+  const auto q_nof_coeffs = q.copy_coeffs(q_coeffs, 0, 1);
+  const auto r_nof_coeffs = r.copy_coeffs(r_coeffs, 0, 1);
+  std::cout << "Quotient: 0:" << q_coeffs[0] << " expected: " << scalar_t::from(4) << std::endl;
+  std::cout << "Quotient: 1:" << q_coeffs[1] << " expected: " << scalar_t::from(3) << std::endl;
+  std::cout << "Reminder: 0:" << r_coeffs[0] << " expected: " << scalar_t::from(9) << std::endl;
+  std::cout << "Reminder: 1:" << r_coeffs[1] << " expected: " << scalar_t::from(3) << std::endl;
+}
+
+void example_divisionLarge(const int log0, const int log1)
+{
+  std::cout << std::endl << "Example: Polynomial division (large)" << std::endl;
+  const int size0 = 1 << log0, size1 = 1 << log1;
+  auto a = randomize_polynomial(size0);
+  auto b = randomize_polynomial(size1);
+  auto [q, r] = a.divide(b);
+  scalar_t x = scalar_t::rand_host();
+  auto ax = a(x);
+  auto bx = b(x);
+  auto qx = q(x);
+  auto rx = r(x);
+  // check if a(x) == b(x)*q(x)+r(x)
+  std::cout << "a(x) == b(x)*q(x)+r(x)" << std::endl;
+  std::cout << "lhs = " << ax << std::endl;
+  std::cout << "rhs = " << bx * qx + rx << std::endl;
+}
+
+void example_divideByVanishingPolynomial()
+{
+  std::cout << std::endl << "Example: Polynomial division by vanishing polynomial" << std::endl;
+  const scalar_t coeffs_v[5] = {minus_one, zero, zero, zero, one}; // x^4-1 vanishes on 4th roots of unity
+  auto v = Polynomial_t::from_coefficients(coeffs_v, 5);
+  auto h = incremental_values(1 << 11);
+  auto hv = h * v;
+  auto [h_div, R] = hv.divide(v);
+  std::cout << "h_div == h: " << is_equal(h_div, h) << std::endl;
+  auto h_div_by_vanishing = hv.divide_by_vanishing_polynomial(4);
+  std::cout << "h_div_by_vanishing == h: " << is_equal(h_div_by_vanishing, h) << std::endl;
+}
+
+void example_clone(const int log0)
+{
+  std::cout << std::endl << "Example: clone polynomial" << std::endl;
+  const int size = 1 << log0;
+  auto f = randomize_polynomial(size);
+  const auto x = scalar_t::rand_host();
+  const auto fx = f(x);
+  Polynomial_t g;
+  g = f.clone();
+  g += f;
+  auto h = g.clone();
+  std::cout << "g(x) = " << g(x) << " expected: " << two * fx << std::endl;
+  std::cout << "h(x) = " << h(x) << " expected: " << g(x) << std::endl;
+}
+
+void example_EvenOdd() {
+  std::cout << std::endl << "Example: Split into even and odd powers " << std::endl;
+  const scalar_t coeffs[4] = {one, two, three, four}; // 1+2x+3x^2+4x^3
+  auto f = Polynomial_t::from_coefficients(coeffs, 4);
+  auto f_even = f.even();
+  auto f_odd = f.odd();
+  scalar_t even_coeffs[2] = {0};
+  scalar_t odd_coeffs[2] = {0};
+  const auto even_nof_coeffs = f_even.copy_coeffs(even_coeffs, 0, 1);
+  const auto odd_nof_coeffs = f_odd.copy_coeffs(odd_coeffs, 0, 1);
+  std::cout << "Even: 0:" << even_coeffs[0] << " expected: " << one << std::endl;
+  std::cout << "Even: 1:" << even_coeffs[1] << " expected: " << three << std::endl;
+  std::cout << "Odd: 0:" << odd_coeffs[0] << " expected: " << two << std::endl;
+  std::cout << "Odd: 1:" << odd_coeffs[1] << " expected: " << four << std::endl;
+}
+
+void example_Slice() {
+  std::cout << std::endl << "Example: Slice polynomial " << std::endl;
+  const scalar_t coeffs[4] = {one, two, three, four}; // 1+2x+3x^2+4x^3
+  auto f = Polynomial_t::from_coefficients(coeffs, 4);
+  auto f_slice = f.slice(0/=offset/, 3/=stride/, 2*/=size/); // 1+4x
+  scalar_t slice_coeffs[2] = {0};
+  const auto slice_nof_coeffs = f_slice.copy_coeffs(slice_coeffs, 0, 1);
+  std::cout << "Slice: 0:" << slice_coeffs[0] << " expected: " << one << std::endl;
+  std::cout << "Slice: 1:" << slice_coeffs[1] << " expected: " << four << std::endl;
+} 
+
+void example_DeviceMemoryView() {
+  const int log_size = 6;
+  const int size = 1 << log_size;
+  auto f = randomize_polynomial(size);
+  auto [d_coeff, N, device_id] = f.get_coefficients_view();
+  // commit coefficients to Merkle tree
+  device_context::DeviceContext ctx = device_context::get_default_device_context();
+  PoseidonConstants<scalar_t> constants;
+  init_optimized_poseidon_constants<scalar_t>(2, ctx, &constants);
+  uint32_t tree_height = log_size + 1;
+  int keep_rows = 0; // keep all rows
+  size_t digests_len = log_size - 1;
+  scalar_t* digests = static_cast<scalar_t*>(malloc(sizeof(scalar_t) * digests_len));
+  TreeBuilderConfig config = default_merkle_config();
+  config.keep_rows = keep_rows;
+  config.are_inputs_on_device = true;
+  build_merkle_tree<scalar_t, (2+1)>(d_coeff.get(), digests, tree_height, constants, config);
+  std::cout << "Merkle tree root: " << digests[0] << std::endl;
+  free(digests);
+}
+
+int main(int argc, char** argv)
+{
+  // Initialize NTT. TODO: can we hide this in the library?
+  static const int MAX_NTT_LOG_SIZE = 24;
+  auto ntt_config = ntt::default_ntt_config<scalar_t>();
+  const scalar_t basic_root = scalar_t::omega(MAX_NTT_LOG_SIZE);
+  ntt::init_domain(basic_root, ntt_config.ctx);
+
+  // Virtual factory design pattern: initializing polynomimals factory for CUDA backend
+  Polynomial_t::initialize(std::make_unique<CUDAPolynomialFactory<>>());
+
+  example_evaluate();
+  example_clone(10);
+  example_from_rou(100);
+  example_addition(12, 17);
+  example_addition_inplace(2, 2);
+  example_multiplication(15, 12);
+  example_multiplicationScalar(15);
+  example_monomials();
+  example_ReadCoeffsToHost();
+  example_divisionSmall();
+  example_divisionLarge(12, 2);
+  example_divideByVanishingPolynomial();
+  example_EvenOdd();
+  example_Slice();
+  example_DeviceMemoryView();
+
+  return 0;
+}
--- a/examples/c++/polynomial-api/run.sh
+++ b/examples/c++/polynomial-api/run.sh
@@ -0,0 +1,2 @@
+#!/bin/bash
+./build/example/example
--- a/go.mod
+++ b/go.mod
@@ -1,4 +1,4 @@
-module github.com/ingonyama-zk/icicle
+module github.com/ingonyama-zk/icicle/v2

 go 1.20

--- a/icicle/cmake/FieldsCommon.cmake
+++ b/icicle/cmake/FieldsCommon.cmake
@@ -1,5 +1,5 @@
 function(check_field)
-  set(SUPPORTED_FIELDS babybear)
+  set(SUPPORTED_FIELDS babybear;stark252)

  set(IS_FIELD_SUPPORTED FALSE)
  set(I 1000)
@@ -14,4 +14,4 @@ function(check_field)
  if (NOT IS_FIELD_SUPPORTED)
    message( FATAL_ERROR "The value of FIELD variable: ${FIELD} is not one of the supported fields: ${SUPPORTED_FIELDS}" )
  endif ()
-endfunction()
+endfunction()
--- a/icicle/include/api/hash.h
+++ b/icicle/include/api/hash.h
@@ -8,9 +8,9 @@
 #include "hash/keccak/keccak.cuh"

 extern "C" cudaError_t
-  keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig config);
+  keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, keccak::KeccakConfig& config);

 extern "C" cudaError_t
-  keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig config);
+  keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, keccak::KeccakConfig& config);

 #endif
--- a/icicle/include/api/stark252.h
+++ b/icicle/include/api/stark252.h
@@ -0,0 +1,47 @@
+// WARNING: This file is auto-generated by a script.
+// Any changes made to this file may be overwritten.
+// Please modify the code generation script instead.
+// Path to the code generation script: scripts/gen_c_api.py
+
+#pragma once
+#ifndef STARK252_API_H
+#define STARK252_API_H
+
+#include <cuda_runtime.h>
+#include "gpu-utils/device_context.cuh"
+#include "fields/stark_fields/stark252.cuh"
+#include "ntt/ntt.cuh"
+#include "vec_ops/vec_ops.cuh"
+
+extern "C" cudaError_t stark252_mul_cuda(
+  stark252::scalar_t* vec_a, stark252::scalar_t* vec_b, int n, vec_ops::VecOpsConfig& config, stark252::scalar_t* result);
+
+extern "C" cudaError_t stark252_add_cuda(
+  stark252::scalar_t* vec_a, stark252::scalar_t* vec_b, int n, vec_ops::VecOpsConfig& config, stark252::scalar_t* result);
+
+extern "C" cudaError_t stark252_sub_cuda(
+  stark252::scalar_t* vec_a, stark252::scalar_t* vec_b, int n, vec_ops::VecOpsConfig& config, stark252::scalar_t* result);
+
+extern "C" cudaError_t stark252_transpose_matrix_cuda(
+  const stark252::scalar_t* input,
+  uint32_t row_size,
+  uint32_t column_size,
+  stark252::scalar_t* output,
+  device_context::DeviceContext& ctx,
+  bool on_device,
+  bool is_async);
+
+extern "C" void stark252_generate_scalars(stark252::scalar_t* scalars, int size);
+
+extern "C" cudaError_t stark252_scalar_convert_montgomery(
+  stark252::scalar_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx);
+
+extern "C" cudaError_t stark252_initialize_domain(
+  stark252::scalar_t* primitive_root, device_context::DeviceContext& ctx, bool fast_twiddles_mode);
+
+extern "C" cudaError_t stark252_ntt_cuda(
+  const stark252::scalar_t* input, int size, ntt::NTTDir dir, ntt::NTTConfig<stark252::scalar_t>& config, stark252::scalar_t* output);
+
+extern "C" cudaError_t stark252_release_domain(device_context::DeviceContext& ctx);
+
+#endif
--- a/icicle/include/curves/affine.cuh
+++ b/icicle/include/curves/affine.cuh
@@ -1,7 +1,7 @@
 #pragma once

-#include "gpu-utils/sharedmem.cuh"
-#include "gpu-utils/modifiers.cuh"
+#include "../gpu-utils/sharedmem.cuh"
+#include "../gpu-utils/modifiers.cuh"
 #include <iostream>

 template <class FF>
@@ -11,26 +11,26 @@ public:
  FF x;
  FF y;

-  static HOST_DEVICE_INLINE Affine neg(const Affine& point) { return {point.x, FF::neg(point.y)}; }
+  static Affine neg(const Affine& point) { return {point.x, FF::neg(point.y)}; }

-  static HOST_DEVICE_INLINE Affine zero() { return {FF::zero(), FF::zero()}; }
+  static Affine zero() { return {FF::zero(), FF::zero()}; }

-  static HOST_DEVICE_INLINE Affine to_montgomery(const Affine& point)
+  static Affine to_montgomery(const Affine& point)
  {
    return {FF::to_montgomery(point.x), FF::to_montgomery(point.y)};
  }

-  static HOST_DEVICE_INLINE Affine from_montgomery(const Affine& point)
+  static Affine from_montgomery(const Affine& point)
  {
    return {FF::from_montgomery(point.x), FF::from_montgomery(point.y)};
  }

-  friend HOST_DEVICE_INLINE bool operator==(const Affine& xs, const Affine& ys)
+  friend bool operator==(const Affine& xs, const Affine& ys)
  {
    return (xs.x == ys.x) && (xs.y == ys.y);
  }

-  friend HOST_INLINE std::ostream& operator<<(std::ostream& os, const Affine& point)
+  friend std::ostream& operator<<(std::ostream& os, const Affine& point)
  {
    os << "x: " << point.x << "; y: " << point.y;
    return os;
@@ -39,9 +39,9 @@ public:

 template <class FF>
 struct SharedMemory<Affine<FF>> {
-  __device__ Affine<FF>* getPointer()
+  Affine<FF>* getPointer()
  {
-    extern __shared__ Affine<FF> s_affine_[];
+    Affine<FF> *s_affine_ = nullptr;
    return s_affine_;
  }
 };
--- a/icicle/include/curves/curve_config.cuh
+++ b/icicle/include/curves/curve_config.cuh
@@ -1,9 +1,9 @@
-#pragma once
+ #pragma once
 #ifndef CURVE_CONFIG_H
 #define CURVE_CONFIG_H

-#include "fields/id.h"
-#include "curves/projective.cuh"
+#include "../fields/id.h"
+#include "projective.cuh"

 /**
 * @namespace curve_config
@@ -12,23 +12,23 @@
 * with the `-DCURVE` env variable passed during build.
 */
 #if CURVE_ID == BN254
-#include "curves/params/bn254.cuh"
+#include "params/bn254.cuh"
 namespace curve_config = bn254;

 #elif CURVE_ID == BLS12_381
-#include "curves/params/bls12_381.cuh"
+#include "params/bls12_381.cuh"
 namespace curve_config = bls12_381;

 #elif CURVE_ID == BLS12_377
-#include "curves/params/bls12_377.cuh"
+#include "params/bls12_377.cuh"
 namespace curve_config = bls12_377;

 #elif CURVE_ID == BW6_761
-#include "curves/params/bw6_761.cuh"
+#include "params/bw6_761.cuh"
 namespace curve_config = bw6_761;

 #elif CURVE_ID == GRUMPKIN
-#include "curves/params/grumpkin.cuh"
+#include "params/grumpkin.cuh"
 namespace curve_config = grumpkin;
 #endif
 #endif
--- a/icicle/include/curves/params/bn254.cuh
+++ b/icicle/include/curves/params/bn254.cuh
@@ -2,13 +2,13 @@
 #ifndef BN254_PARAMS_H
 #define BN254_PARAMS_H

-#include "fields/storage.cuh"
+#include "../../fields/storage.cuh"

-#include "curves/macro.h"
-#include "curves/projective.cuh"
-#include "fields/snark_fields/bn254_base.cuh"
-#include "fields/snark_fields/bn254_scalar.cuh"
-#include "fields/quadratic_extension.cuh"
+#include "../macro.h"
+#include "../projective.cuh"
+#include "../../fields/snark_fields/bn254_base.cuh"
+#include "../../fields/snark_fields/bn254_scalar.cuh"
+#include "../../fields/quadratic_extension.cuh"

 namespace bn254 {
  // G1 and G2 generators
--- a/icicle/include/curves/projective.cuh
+++ b/icicle/include/curves/projective.cuh
@@ -1,7 +1,7 @@
 #pragma once

 #include "affine.cuh"
-#include "gpu-utils/sharedmem.cuh"
+#include "../gpu-utils/sharedmem.cuh"

 template <typename FF, class SCALAR_FF, const FF& B_VALUE, const FF& GENERATOR_X, const FF& GENERATOR_Y>
 class Projective
@@ -19,34 +19,34 @@ public:
  FF y;
  FF z;

-  static HOST_DEVICE_INLINE Projective zero() { return {FF::zero(), FF::one(), FF::zero()}; }
+  static Projective zero() { return {FF::zero(), FF::one(), FF::zero()}; }

-  static HOST_DEVICE_INLINE Affine<FF> to_affine(const Projective& point)
+  static Affine<FF> to_affine(const Projective& point)
  {
    FF denom = FF::inverse(point.z);
    return {point.x * denom, point.y * denom};
  }

-  static HOST_DEVICE_INLINE Projective from_affine(const Affine<FF>& point)
+  static Projective from_affine(const Affine<FF>& point)
  {
    return point == Affine<FF>::zero() ? zero() : Projective{point.x, point.y, FF::one()};
  }

-  static HOST_DEVICE_INLINE Projective to_montgomery(const Projective& point)
+  static Projective to_montgomery(const Projective& point)
  {
    return {FF::to_montgomery(point.x), FF::to_montgomery(point.y), FF::to_montgomery(point.z)};
  }

-  static HOST_DEVICE_INLINE Projective from_montgomery(const Projective& point)
+  static Projective from_montgomery(const Projective& point)
  {
    return {FF::from_montgomery(point.x), FF::from_montgomery(point.y), FF::from_montgomery(point.z)};
  }

-  static HOST_DEVICE_INLINE Projective generator() { return {GENERATOR_X, GENERATOR_Y, FF::one()}; }
+  static Projective generator() { return {GENERATOR_X, GENERATOR_Y, FF::one()}; }

-  static HOST_DEVICE_INLINE Projective neg(const Projective& point) { return {point.x, FF::neg(point.y), point.z}; }
+  static Projective neg(const Projective& point) { return {point.x, FF::neg(point.y), point.z}; }

-  static HOST_DEVICE_INLINE Projective dbl(const Projective& point)
+  static Projective dbl(const Projective& point)
  {
    const FF X = point.x;
    const FF Y = point.y;
@@ -74,7 +74,7 @@ public:
    return {X3, Y3, Z3};
  }

-  friend HOST_DEVICE_INLINE Projective operator+(Projective p1, const Projective& p2)
+  friend Projective operator+(Projective p1, const Projective& p2)
  {
    const FF X1 = p1.x;                                                                //                   < 2
    const FF Y1 = p1.y;                                                                //                   < 2
@@ -118,9 +118,9 @@ public:
    return {X3, Y3, Z3};
  }

-  friend HOST_DEVICE_INLINE Projective operator-(Projective p1, const Projective& p2) { return p1 + neg(p2); }
+  friend Projective operator-(Projective p1, const Projective& p2) { return p1 + neg(p2); }

-  friend HOST_DEVICE_INLINE Projective operator+(Projective p1, const Affine<FF>& p2)
+  friend Projective operator+(Projective p1, const Affine<FF>& p2)
  {
    const FF X1 = p1.x;                                                                //                   < 2
    const FF Y1 = p1.y;                                                                //                   < 2
@@ -163,12 +163,12 @@ public:
    return {X3, Y3, Z3};
  }

-  friend HOST_DEVICE_INLINE Projective operator-(Projective p1, const Affine<FF>& p2)
+  friend Projective operator-(Projective p1, const Affine<FF>& p2)
  {
    return p1 + Affine<FF>::neg(p2);
  }

-  friend HOST_DEVICE_INLINE Projective operator*(SCALAR_FF scalar, const Projective& point)
+  friend Projective operator*(SCALAR_FF scalar, const Projective& point)
  {
    Projective res = zero();
 #ifdef __CUDA_ARCH__
@@ -181,27 +181,27 @@ public:
    return res;
  }

-  friend HOST_DEVICE_INLINE Projective operator*(const Projective& point, SCALAR_FF scalar) { return scalar * point; }
+  friend Projective operator*(const Projective& point, SCALAR_FF scalar) { return scalar * point; }

-  friend HOST_DEVICE_INLINE bool operator==(const Projective& p1, const Projective& p2)
+  friend bool operator==(const Projective& p1, const Projective& p2)
  {
    return (p1.x * p2.z == p2.x * p1.z) && (p1.y * p2.z == p2.y * p1.z);
  }

-  friend HOST_DEVICE_INLINE bool operator!=(const Projective& p1, const Projective& p2) { return !(p1 == p2); }
+  friend bool operator!=(const Projective& p1, const Projective& p2) { return !(p1 == p2); }

-  friend HOST_INLINE std::ostream& operator<<(std::ostream& os, const Projective& point)
+  friend std::ostream& operator<<(std::ostream& os, const Projective& point)
  {
    os << "Point { x: " << point.x << "; y: " << point.y << "; z: " << point.z << " }";
    return os;
  }

-  static HOST_DEVICE_INLINE bool is_zero(const Projective& point)
+  static bool is_zero(const Projective& point)
  {
    return point.x == FF::zero() && point.y != FF::zero() && point.z == FF::zero();
  }

-  static HOST_DEVICE_INLINE bool is_on_curve(const Projective& point)
+  static bool is_on_curve(const Projective& point)
  {
    if (is_zero(point)) return true;
    bool eq_holds =
@@ -210,7 +210,7 @@ public:
    return point.z != FF::zero() && eq_holds;
  }

-  static HOST_INLINE Projective rand_host()
+  static Projective rand_host()
  {
    SCALAR_FF rand_scalar = SCALAR_FF::rand_host();
    return rand_scalar * generator();
@@ -231,9 +231,9 @@ public:

 template <typename FF, class SCALAR_FF, const FF& B_VALUE, const FF& GENERATOR_X, const FF& GENERATOR_Y>
 struct SharedMemory<Projective<FF, SCALAR_FF, B_VALUE, GENERATOR_X, GENERATOR_Y>> {
-  __device__ Projective<FF, SCALAR_FF, B_VALUE, GENERATOR_X, GENERATOR_Y>* getPointer()
+  Projective<FF, SCALAR_FF, B_VALUE, GENERATOR_X, GENERATOR_Y>* getPointer()
  {
-    extern __shared__ Projective<FF, SCALAR_FF, B_VALUE, GENERATOR_X, GENERATOR_Y> s_projective_[];
+    Projective<FF, SCALAR_FF, B_VALUE, GENERATOR_X, GENERATOR_Y> *s_projective_ = nullptr;
    return s_projective_;
  }
 };
--- a/icicle/include/fields/field.cuh
+++ b/icicle/include/fields/field.cuh
@@ -18,9 +18,9 @@

 #pragma once

-#include "gpu-utils/error_handler.cuh"
-#include "gpu-utils/modifiers.cuh"
-#include "gpu-utils/sharedmem.cuh"
+#include "../gpu-utils/error_handler.cuh"
+#include "../gpu-utils/modifiers.cuh"
+#include "../gpu-utils/sharedmem.cuh"
 #include "host_math.cuh"
 #include "ptx.cuh"
 #include "storage.cuh"
@@ -38,11 +38,11 @@ public:
  static constexpr unsigned TLC = CONFIG::limbs_count;
  static constexpr unsigned NBITS = CONFIG::modulus_bit_count;

-  static constexpr HOST_DEVICE_INLINE Field zero() { return Field{CONFIG::zero}; }
+  static constexpr Field zero() { return Field{CONFIG::zero}; }

-  static constexpr HOST_DEVICE_INLINE Field one() { return Field{CONFIG::one}; }
+  static constexpr Field one() { return Field{CONFIG::one}; }

-  static constexpr HOST_DEVICE_INLINE Field from(uint32_t value)
+  static constexpr Field from(uint32_t value)
  {
    storage<TLC> scalar;
    scalar.limbs[0] = value;
@@ -52,7 +52,7 @@ public:
    return Field{scalar};
  }

-  static HOST_INLINE Field omega(uint32_t logn)
+  static Field omega(uint32_t logn)
  {
    if (logn == 0) { return Field{CONFIG::one}; }

@@ -62,7 +62,7 @@ public:
    return Field{omega.storages[logn - 1]};
  }

-  static HOST_INLINE Field omega_inv(uint32_t logn)
+  static Field omega_inv(uint32_t logn)
  {
    if (logn == 0) { return Field{CONFIG::one}; }

@@ -74,7 +74,7 @@ public:
    return Field{omega_inv.storages[logn - 1]};
  }

-  static HOST_DEVICE_INLINE Field inv_log_size(uint32_t logn)
+  static Field inv_log_size(uint32_t logn)
  {
    if (logn == 0) { return Field{CONFIG::one}; }
 #ifndef __CUDA_ARCH__
@@ -91,7 +91,7 @@ public:
    return Field{inv.storages[logn - 1]};
  }

-  static constexpr HOST_INLINE unsigned get_omegas_count()
+  static constexpr unsigned get_omegas_count()
  {
    if constexpr (has_member_omegas_count<CONFIG>()) {
      return CONFIG::omegas_count;
@@ -113,45 +113,45 @@ public:
  /**
   * A new addition to the config file - \f$ 2^{32 \cdot num\_limbs} - p \f$.
   */
-  static constexpr HOST_DEVICE_INLINE ff_storage get_neg_modulus() { return CONFIG::neg_modulus; }
+  static constexpr ff_storage get_neg_modulus() { return CONFIG::neg_modulus; }

  /**
   * A new addition to the config file - the number of times to reduce in [reduce](@ref reduce) function.
   */
-  static constexpr HOST_DEVICE_INLINE unsigned num_of_reductions() { return CONFIG::num_of_reductions; }
+  static constexpr unsigned num_of_reductions() { return CONFIG::num_of_reductions; }

  static constexpr unsigned slack_bits = 32 * TLC - NBITS;

  struct Wide {
    ff_wide_storage limbs_storage;

-    static constexpr Field HOST_DEVICE_INLINE get_lower(const Wide& xs)
+    static constexpr Field get_lower(const Wide& xs)
    {
      Field out{};
 #ifdef __CUDA_ARCH__
-      UNROLL
+      
 #endif
      for (unsigned i = 0; i < TLC; i++)
        out.limbs_storage.limbs[i] = xs.limbs_storage.limbs[i];
      return out;
    }

-    static constexpr Field HOST_DEVICE_INLINE get_higher(const Wide& xs)
+    static constexpr Field get_higher(const Wide& xs)
    {
      Field out{};
 #ifdef __CUDA_ARCH__
-      UNROLL
+      
 #endif
      for (unsigned i = 0; i < TLC; i++)
        out.limbs_storage.limbs[i] = xs.limbs_storage.limbs[i + TLC];
      return out;
    }

-    static constexpr Field HOST_DEVICE_INLINE get_higher_with_slack(const Wide& xs)
+    static constexpr Field get_higher_with_slack(const Wide& xs)
    {
      Field out{};
 #ifdef __CUDA_ARCH__
-      UNROLL
+      
 #endif
      for (unsigned i = 0; i < TLC; i++) {
 #ifdef __CUDA_ARCH__
@@ -166,7 +166,7 @@ public:
    }

    template <unsigned REDUCTION_SIZE = 1>
-    static constexpr HOST_DEVICE_INLINE Wide sub_modulus_squared(const Wide& xs)
+    static constexpr Wide sub_modulus_squared(const Wide& xs)
    {
      if (REDUCTION_SIZE == 0) return xs;
      const ff_wide_storage modulus = get_modulus_squared<REDUCTION_SIZE>();
@@ -175,7 +175,7 @@ public:
    }

    template <unsigned MODULUS_MULTIPLE = 1>
-    static constexpr HOST_DEVICE_INLINE Wide neg(const Wide& xs)
+    static constexpr Wide neg(const Wide& xs)
    {
      const ff_wide_storage modulus = get_modulus_squared<MODULUS_MULTIPLE>();
      Wide rs = {};
@@ -183,14 +183,14 @@ public:
      return rs;
    }

-    friend HOST_DEVICE_INLINE Wide operator+(Wide xs, const Wide& ys)
+    friend Wide operator+(Wide xs, const Wide& ys)
    {
      Wide rs = {};
      add_limbs<false>(xs.limbs_storage, ys.limbs_storage, rs.limbs_storage);
      return sub_modulus_squared<1>(rs);
    }

-    friend HOST_DEVICE_INLINE Wide operator-(Wide xs, const Wide& ys)
+    friend Wide operator-(Wide xs, const Wide& ys)
    {
      Wide rs = {};
      uint32_t carry = sub_limbs<true>(xs.limbs_storage, ys.limbs_storage, rs.limbs_storage);
@@ -203,7 +203,7 @@ public:

  // return modulus multiplied by 1, 2 or 4
  template <unsigned MULTIPLIER = 1>
-  static constexpr HOST_DEVICE_INLINE ff_storage get_modulus()
+  static constexpr ff_storage get_modulus()
  {
    switch (MULTIPLIER) {
    case 1:
@@ -218,17 +218,17 @@ public:
  }

  template <unsigned MULTIPLIER = 1>
-  static constexpr HOST_DEVICE_INLINE ff_wide_storage modulus_wide()
+  static constexpr ff_wide_storage modulus_wide()
  {
    return CONFIG::modulus_wide;
  }

  // return m
-  static constexpr HOST_DEVICE_INLINE ff_storage get_m() { return CONFIG::m; }
+  static constexpr ff_storage get_m() { return CONFIG::m; }

  // return modulus^2, helpful for ab +/- cd
  template <unsigned MULTIPLIER = 1>
-  static constexpr HOST_DEVICE_INLINE ff_wide_storage get_modulus_squared()
+  static constexpr ff_wide_storage get_modulus_squared()
  {
    switch (MULTIPLIER) {
    case 1:
@@ -243,7 +243,7 @@ public:
  }

  template <bool SUBTRACT, bool CARRY_OUT>
-  static constexpr DEVICE_INLINE uint32_t
+  static constexpr uint32_t
  add_sub_u32_device(const uint32_t* x, const uint32_t* y, uint32_t* r, size_t n = (TLC >> 1))
  {
    r[0] = SUBTRACT ? ptx::sub_cc(x[0], y[0]) : ptx::add_cc(x[0], y[0]);
@@ -258,7 +258,7 @@ public:

  // add or subtract limbs
  template <bool SUBTRACT, bool CARRY_OUT>
-  static constexpr DEVICE_INLINE uint32_t
+  static constexpr uint32_t
  add_sub_limbs_device(const ff_storage& xs, const ff_storage& ys, ff_storage& rs)
  {
    const uint32_t* x = xs.limbs;
@@ -268,7 +268,7 @@ public:
  }

  template <bool SUBTRACT, bool CARRY_OUT>
-  static constexpr DEVICE_INLINE uint32_t
+  static constexpr uint32_t
  add_sub_limbs_device(const ff_wide_storage& xs, const ff_wide_storage& ys, ff_wide_storage& rs)
  {
    const uint32_t* x = xs.limbs;
@@ -278,7 +278,7 @@ public:
  }

  template <bool SUBTRACT, bool CARRY_OUT>
-  static constexpr HOST_INLINE uint32_t add_sub_limbs_host(const ff_storage& xs, const ff_storage& ys, ff_storage& rs)
+  static constexpr uint32_t add_sub_limbs_host(const ff_storage& xs, const ff_storage& ys, ff_storage& rs)
  {
    const uint32_t* x = xs.limbs;
    const uint32_t* y = ys.limbs;
@@ -291,7 +291,7 @@ public:
  }

  template <bool SUBTRACT, bool CARRY_OUT>
-  static constexpr HOST_INLINE uint32_t
+  static constexpr uint32_t
  add_sub_limbs_host(const ff_wide_storage& xs, const ff_wide_storage& ys, ff_wide_storage& rs)
  {
    const uint32_t* x = xs.limbs;
@@ -305,7 +305,7 @@ public:
  }

  template <bool CARRY_OUT, typename T>
-  static constexpr HOST_DEVICE_INLINE uint32_t add_limbs(const T& xs, const T& ys, T& rs)
+  static constexpr uint32_t add_limbs(const T& xs, const T& ys, T& rs)
  {
 #ifdef __CUDA_ARCH__
    return add_sub_limbs_device<false, CARRY_OUT>(xs, ys, rs);
@@ -315,7 +315,7 @@ public:
  }

  template <bool CARRY_OUT, typename T>
-  static constexpr HOST_DEVICE_INLINE uint32_t sub_limbs(const T& xs, const T& ys, T& rs)
+  static constexpr uint32_t sub_limbs(const T& xs, const T& ys, T& rs)
  {
 #ifdef __CUDA_ARCH__
    return add_sub_limbs_device<true, CARRY_OUT>(xs, ys, rs);
@@ -324,18 +324,18 @@ public:
 #endif
  }

-  static DEVICE_INLINE void mul_n(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC)
+  static void mul_n(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC)
  {
-    UNROLL
+    
    for (size_t i = 0; i < n; i += 2) {
      acc[i] = ptx::mul_lo(a[i], bi);
      acc[i + 1] = ptx::mul_hi(a[i], bi);
    }
  }

-  static DEVICE_INLINE void mul_n_msb(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC, size_t start_i = 0)
+  static void mul_n_msb(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC, size_t start_i = 0)
  {
-    UNROLL
+    
    for (size_t i = start_i; i < n; i += 2) {
      acc[i] = ptx::mul_lo(a[i], bi);
      acc[i + 1] = ptx::mul_hi(a[i], bi);
@@ -343,14 +343,14 @@ public:
  }

  template <bool CARRY_IN = false>
-  static DEVICE_INLINE void
+  static void
  cmad_n(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC, uint32_t optional_carry = 0)
  {
    if (CARRY_IN) ptx::add_cc(UINT32_MAX, optional_carry);
    acc[0] = CARRY_IN ? ptx::madc_lo_cc(a[0], bi, acc[0]) : ptx::mad_lo_cc(a[0], bi, acc[0]);
    acc[1] = ptx::madc_hi_cc(a[0], bi, acc[1]);

-    UNROLL
+    
    for (size_t i = 2; i < n; i += 2) {
      acc[i] = ptx::madc_lo_cc(a[i], bi, acc[i]);
      acc[i + 1] = ptx::madc_hi_cc(a[i], bi, acc[i + 1]);
@@ -358,7 +358,7 @@ public:
  }

  template <bool EVEN_PHASE>
-  static DEVICE_INLINE void cmad_n_msb(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC)
+  static void cmad_n_msb(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC)
  {
    if (EVEN_PHASE) {
      acc[0] = ptx::mad_lo_cc(a[0], bi, acc[0]);
@@ -367,14 +367,14 @@ public:
      acc[1] = ptx::mad_hi_cc(a[0], bi, acc[1]);
    }

-    UNROLL
+    
    for (size_t i = 2; i < n; i += 2) {
      acc[i] = ptx::madc_lo_cc(a[i], bi, acc[i]);
      acc[i + 1] = ptx::madc_hi_cc(a[i], bi, acc[i + 1]);
    }
  }

-  static DEVICE_INLINE void cmad_n_lsb(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC)
+  static void cmad_n_lsb(uint32_t* acc, const uint32_t* a, uint32_t bi, size_t n = TLC)
  {
    if (n > 1)
      acc[0] = ptx::mad_lo_cc(a[0], bi, acc[0]);
@@ -382,7 +382,7 @@ public:
      acc[0] = ptx::mad_lo(a[0], bi, acc[0]);

    size_t i;
-    UNROLL
+    
    for (i = 1; i < n - 1; i += 2) {
      acc[i] = ptx::madc_hi_cc(a[i - 1], bi, acc[i]);
      if (i == n - 2)
@@ -394,7 +394,7 @@ public:
  }

  template <bool CARRY_OUT = false, bool CARRY_IN = false>
-  static DEVICE_INLINE uint32_t mad_row(
+  static uint32_t mad_row(
    uint32_t* odd,
    uint32_t* even,
    const uint32_t* a,
@@ -419,7 +419,7 @@ public:
  }

  template <bool EVEN_PHASE>
-  static DEVICE_INLINE void mad_row_msb(uint32_t* odd, uint32_t* even, const uint32_t* a, uint32_t bi, size_t n = TLC)
+  static void mad_row_msb(uint32_t* odd, uint32_t* even, const uint32_t* a, uint32_t bi, size_t n = TLC)
  {
    cmad_n_msb<!EVEN_PHASE>(odd, EVEN_PHASE ? a : (a + 1), bi, n - 2);
    odd[EVEN_PHASE ? (n - 1) : (n - 2)] = ptx::madc_lo_cc(a[n - 1], bi, 0);
@@ -428,7 +428,7 @@ public:
    odd[EVEN_PHASE ? n : (n - 1)] = ptx::addc(odd[EVEN_PHASE ? n : (n - 1)], 0);
  }

-  static DEVICE_INLINE void mad_row_lsb(uint32_t* odd, uint32_t* even, const uint32_t* a, uint32_t bi, size_t n = TLC)
+  static void mad_row_lsb(uint32_t* odd, uint32_t* even, const uint32_t* a, uint32_t bi, size_t n = TLC)
  {
    // bi here is constant so we can do a compile-time check for zero (which does happen once for bls12-381 scalar field
    // modulus)
@@ -439,12 +439,12 @@ public:
    return;
  }

-  static DEVICE_INLINE uint32_t
+  static uint32_t
  mul_n_and_add(uint32_t* acc, const uint32_t* a, uint32_t bi, uint32_t* extra, size_t n = (TLC >> 1))
  {
    acc[0] = ptx::mad_lo_cc(a[0], bi, extra[0]);

-    UNROLL
+    
    for (size_t i = 1; i < n - 1; i += 2) {
      acc[i] = ptx::madc_hi_cc(a[i - 1], bi, extra[i]);
      acc[i + 1] = ptx::madc_lo_cc(a[i + 1], bi, extra[i + 1]);
@@ -467,19 +467,19 @@ public:
   * \cdot b_0}{2^{32}}} + \dots + \floor{\frac{a_0 \cdot b_{TLC - 2}}{2^{32}}}) \leq 2^{64} + 2\cdot 2^{96} + \dots +
   * (TLC - 2) \cdot 2^{32(TLC - 1)} + (TLC - 1) \cdot 2^{32(TLC - 1)} \leq 2(TLC - 1) \cdot 2^{32(TLC - 1)}\f$.
   */
-  static DEVICE_INLINE void multiply_msb_raw_device(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
+  static void multiply_msb_raw_device(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
  {
    if constexpr (TLC > 1) {
      const uint32_t* a = as.limbs;
      const uint32_t* b = bs.limbs;
      uint32_t* even = rs.limbs;
-      __align__(16) uint32_t odd[2 * TLC - 2];
+      uint32_t odd[2 * TLC - 2];

      even[TLC - 1] = ptx::mul_hi(a[TLC - 2], b[0]);
      odd[TLC - 2] = ptx::mul_lo(a[TLC - 1], b[0]);
      odd[TLC - 1] = ptx::mul_hi(a[TLC - 1], b[0]);
      size_t i;
-      UNROLL
+      
      for (i = 2; i < TLC - 1; i += 2) {
        mad_row_msb<true>(&even[TLC - 2], &odd[TLC - 2], &a[TLC - i - 1], b[i - 1], i + 1);
        mad_row_msb<false>(&odd[TLC - 2], &even[TLC - 2], &a[TLC - i - 2], b[i], i + 2);
@@ -504,7 +504,7 @@ public:
   * is excluded if \f$ i + j > TLC - 1 \f$ and only the lower half is included if \f$ i + j = TLC - 1 \f$. All other
   * limb products are included.
   */
-  static DEVICE_INLINE void
+  static void
  multiply_and_add_lsb_neg_modulus_raw_device(const ff_storage& as, ff_storage& cs, ff_storage& rs)
  {
    ff_storage bs = get_neg_modulus();
@@ -514,7 +514,7 @@ public:
    uint32_t* even = rs.limbs;

    if constexpr (TLC > 2) {
-      __align__(16) uint32_t odd[TLC - 1];
+      uint32_t odd[TLC - 1];
      size_t i;
      // `b[0]` is \f$ 2^{32} \f$ minus the last limb of prime modulus. Because most scalar (and some base) primes
      // are necessarily NTT-friendly, `b[0]` often turns out to be \f$ 2^{32} - 1 \f$. This actually leads to
@@ -528,7 +528,6 @@ public:
        mul_n(odd, a + 1, b[0], TLC - 1);
      }
      mad_row_lsb(&even[2], &odd[0], a, b[1], TLC - 1);
-      UNROLL
      for (i = 2; i < TLC - 1; i += 2) {
        mad_row_lsb(&odd[i], &even[i], a, b[i], TLC - i);
        mad_row_lsb(&even[i + 2], &odd[i], a, b[i + 1], TLC - i - 1);
@@ -558,15 +557,15 @@ public:
   * that the top bit of \f$ a_{hi} \f$ and \f$ b_{hi} \f$ are unset. This ensures correctness by allowing to keep the
   * result inside TLC limbs and ignore the carries from the highest limb.
   */
-  static DEVICE_INLINE void
+  static void
  multiply_and_add_short_raw_device(const uint32_t* a, const uint32_t* b, uint32_t* even, uint32_t* in1, uint32_t* in2)
  {
-    __align__(16) uint32_t odd[TLC - 2];
+    uint32_t odd[TLC - 2];
    uint32_t first_row_carry = mul_n_and_add(even, a, b[0], in1);
    uint32_t carry = mul_n_and_add(odd, a + 1, b[0], &in2[1]);

    size_t i;
-    UNROLL
+    
    for (i = 2; i < ((TLC >> 1) - 1); i += 2) {
      carry = mad_row<true, false>(
        &even[i], &odd[i - 2], a, b[i - 1], TLC >> 1, in1[(TLC >> 1) + i - 2], in1[(TLC >> 1) + i - 1], carry);
@@ -587,15 +586,15 @@ public:
   * This method multiplies `a` and `b` and writes the result into `even`. It assumes that `a` and `b` are TLC/2 limbs
   * long. The usual schoolbook algorithm is used.
   */
-  static DEVICE_INLINE void multiply_short_raw_device(const uint32_t* a, const uint32_t* b, uint32_t* even)
+  static void multiply_short_raw_device(const uint32_t* a, const uint32_t* b, uint32_t* even)
  {
-    __align__(16) uint32_t odd[TLC - 2];
+    uint32_t odd[TLC - 2];
    mul_n(even, a, b[0], TLC >> 1);
    mul_n(odd, a + 1, b[0], TLC >> 1);
    mad_row(&even[2], &odd[0], a, b[1], TLC >> 1);

    size_t i;
-    UNROLL
+    
    for (i = 2; i < ((TLC >> 1) - 1); i += 2) {
      mad_row(&odd[i], &even[i], a, b[i], TLC >> 1);
      mad_row(&even[i + 2], &odd[i], a, b[i + 1], TLC >> 1);
@@ -614,7 +613,7 @@ public:
   * with so far. This method implements [subtractive
   * Karatsuba](https://en.wikipedia.org/wiki/Karatsuba_algorithm#Implementation).
   */
-  static DEVICE_INLINE void multiply_raw_device(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
+  static void multiply_raw_device(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
  {
    const uint32_t* a = as.limbs;
    const uint32_t* b = bs.limbs;
@@ -624,8 +623,8 @@ public:
      // write the results into `r`.
      multiply_short_raw_device(a, b, r);
      multiply_short_raw_device(&a[TLC >> 1], &b[TLC >> 1], &r[TLC]);
-      __align__(16) uint32_t middle_part[TLC];
-      __align__(16) uint32_t diffs[TLC];
+      uint32_t middle_part[TLC];
+      uint32_t diffs[TLC];
      // Differences of halves \f$ a_{hi} - a_{lo}; b_{lo} - b_{hi} \$f are written into `diffs`, signs written to
      // `carry1` and `carry2`.
      uint32_t carry1 = add_sub_u32_device<true, true>(&a[TLC >> 1], a, diffs);
@@ -644,7 +643,7 @@ public:
      for (size_t i = TLC + (TLC >> 1); i < 2 * TLC; i++)
        r[i] = ptx::addc_cc(r[i], 0);
    } else if (TLC == 2) {
-      __align__(8) uint32_t odd[2];
+      uint32_t odd[2];
      r[0] = ptx::mul_lo(a[0], b[0]);
      r[1] = ptx::mul_hi(a[0], b[0]);
      r[2] = ptx::mul_lo(a[1], b[1]);
@@ -662,7 +661,7 @@ public:
    }
  }

-  static HOST_INLINE void multiply_raw_host(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
+  static void multiply_raw_host(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
  {
    const uint32_t* a = as.limbs;
    const uint32_t* b = bs.limbs;
@@ -675,7 +674,7 @@ public:
    }
  }

-  static HOST_DEVICE_INLINE void multiply_raw(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
+  static void multiply_raw(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
  {
 #ifdef __CUDA_ARCH__
    return multiply_raw_device(as, bs, rs);
@@ -684,7 +683,7 @@ public:
 #endif
  }

-  static HOST_DEVICE_INLINE void
+  static void
  multiply_and_add_lsb_neg_modulus_raw(const ff_storage& as, ff_storage& cs, ff_storage& rs)
  {
 #ifdef __CUDA_ARCH__
@@ -697,7 +696,7 @@ public:
 #endif
  }

-  static HOST_DEVICE_INLINE void multiply_msb_raw(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
+  static void multiply_msb_raw(const ff_storage& as, const ff_storage& bs, ff_wide_storage& rs)
  {
 #ifdef __CUDA_ARCH__
    return multiply_msb_raw_device(as, bs, rs);
@@ -709,9 +708,9 @@ public:
 public:
  ff_storage limbs_storage;

-  HOST_DEVICE_INLINE uint32_t* export_limbs() { return (uint32_t*)limbs_storage.limbs; }
+  uint32_t* export_limbs() { return (uint32_t*)limbs_storage.limbs; }

-  HOST_DEVICE_INLINE unsigned get_scalar_digit(unsigned digit_num, unsigned digit_width) const
+  unsigned get_scalar_digit(unsigned digit_num, unsigned digit_width) const
  {
    const uint32_t limb_lsb_idx = (digit_num * digit_width) / 32;
    const uint32_t shift_bits = (digit_num * digit_width) % 32;
@@ -723,7 +722,7 @@ public:
    return rv;
  }

-  static HOST_INLINE Field rand_host()
+  static Field rand_host()
  {
    std::random_device rd;
    std::mt19937_64 generator(rd());
@@ -743,7 +742,7 @@ public:
  }

  template <unsigned REDUCTION_SIZE = 1>
-  static constexpr HOST_DEVICE_INLINE Field sub_modulus(const Field& xs)
+  static constexpr Field sub_modulus(const Field& xs)
  {
    if (REDUCTION_SIZE == 0) return xs;
    const ff_storage modulus = get_modulus<REDUCTION_SIZE>();
@@ -764,14 +763,14 @@ public:
    return os;
  }

-  friend HOST_DEVICE_INLINE Field operator+(Field xs, const Field& ys)
+  friend Field operator+(Field xs, const Field& ys)
  {
    Field rs = {};
    add_limbs<false>(xs.limbs_storage, ys.limbs_storage, rs.limbs_storage);
    return sub_modulus<1>(rs);
  }

-  friend HOST_DEVICE_INLINE Field operator-(Field xs, const Field& ys)
+  friend Field operator-(Field xs, const Field& ys)
  {
    Field rs = {};
    uint32_t carry = sub_limbs<true>(xs.limbs_storage, ys.limbs_storage, rs.limbs_storage);
@@ -782,7 +781,7 @@ public:
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE Wide mul_wide(const Field& xs, const Field& ys)
+  static constexpr Wide mul_wide(const Field& xs, const Field& ys)
  {
    Wide rs = {};
    multiply_raw(xs.limbs_storage, ys.limbs_storage, rs.limbs_storage);
@@ -811,7 +810,7 @@ public:
   * will cause only 1 reduction to be performed.
   */
  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE Field reduce(const Wide& xs)
+  static constexpr Field reduce(const Wide& xs)
  {
    // `xs` is left-shifted by `2 * slack_bits` and higher half is written to `xs_hi`
    Field xs_hi = Wide::get_higher_with_slack(xs);
@@ -836,19 +835,19 @@ public:
    return r;
  }

-  friend HOST_DEVICE_INLINE Field operator*(const Field& xs, const Field& ys)
+  friend Field operator*(const Field& xs, const Field& ys)
  {
    Wide xy = mul_wide(xs, ys); // full mult
    return reduce(xy);          // reduce mod p
  }

-  friend HOST_DEVICE_INLINE bool operator==(const Field& xs, const Field& ys)
+  friend bool operator==(const Field& xs, const Field& ys)
  {
 #ifdef __CUDA_ARCH__
    const uint32_t* x = xs.limbs_storage.limbs;
    const uint32_t* y = ys.limbs_storage.limbs;
    uint32_t limbs_or = x[0] ^ y[0];
-    UNROLL
+    
    for (unsigned i = 1; i < TLC; i++)
      limbs_or |= x[i] ^ y[i];
    return limbs_or == 0;
@@ -859,15 +858,15 @@ public:
 #endif
  }

-  friend HOST_DEVICE_INLINE bool operator!=(const Field& xs, const Field& ys) { return !(xs == ys); }
+  friend bool operator!=(const Field& xs, const Field& ys) { return !(xs == ys); }

  template <const Field& multiplier>
-  static HOST_DEVICE_INLINE Field mul_const(const Field& xs)
+  static Field mul_const(const Field& xs)
  {
    Field mul = multiplier;
    static bool is_u32 = true;
 #ifdef __CUDA_ARCH__
-    UNROLL
+    
 #endif
    for (unsigned i = 1; i < TLC; i++)
      is_u32 &= (mul.limbs_storage.limbs[i] == 0);
@@ -877,13 +876,13 @@ public:
  }

  template <uint32_t multiplier, class T, unsigned REDUCTION_SIZE = 1>
-  static constexpr HOST_DEVICE_INLINE T mul_unsigned(const T& xs)
+  static constexpr T mul_unsigned(const T& xs)
  {
    T rs = {};
    T temp = xs;
    bool is_zero = true;
 #ifdef __CUDA_ARCH__
-    UNROLL
+    
 #endif
    for (unsigned i = 0; i < 32; i++) {
      if (multiplier & (1 << i)) {
@@ -897,28 +896,28 @@ public:
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE Wide sqr_wide(const Field& xs)
+  static constexpr Wide sqr_wide(const Field& xs)
  {
    // TODO: change to a more efficient squaring
    return mul_wide<MODULUS_MULTIPLE>(xs, xs);
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE Field sqr(const Field& xs)
+  static constexpr Field sqr(const Field& xs)
  {
    // TODO: change to a more efficient squaring
    return xs * xs;
  }

-  static constexpr HOST_DEVICE_INLINE Field to_montgomery(const Field& xs) { return xs * Field{CONFIG::montgomery_r}; }
+  static constexpr Field to_montgomery(const Field& xs) { return xs * Field{CONFIG::montgomery_r}; }

-  static constexpr HOST_DEVICE_INLINE Field from_montgomery(const Field& xs)
+  static constexpr Field from_montgomery(const Field& xs)
  {
    return xs * Field{CONFIG::montgomery_r_inv};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE Field neg(const Field& xs)
+  static constexpr Field neg(const Field& xs)
  {
    const ff_storage modulus = get_modulus<MODULUS_MULTIPLE>();
    Field rs = {};
@@ -928,14 +927,14 @@ public:

  // Assumes the number is even!
  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE Field div2(const Field& xs)
+  static constexpr Field div2(const Field& xs)
  {
    const uint32_t* x = xs.limbs_storage.limbs;
    Field rs = {};
    uint32_t* r = rs.limbs_storage.limbs;
    if constexpr (TLC > 1) {
 #ifdef __CUDA_ARCH__
-      UNROLL
+      
 #endif
      for (unsigned i = 0; i < TLC - 1; i++) {
 #ifdef __CUDA_ARCH__
@@ -949,18 +948,18 @@ public:
    return sub_modulus<MODULUS_MULTIPLE>(rs);
  }

-  static constexpr HOST_DEVICE_INLINE bool lt(const Field& xs, const Field& ys)
+  static constexpr bool lt(const Field& xs, const Field& ys)
  {
    ff_storage dummy = {};
    uint32_t carry = sub_limbs<true>(xs.limbs_storage, ys.limbs_storage, dummy);
    return carry;
  }

-  static constexpr HOST_DEVICE_INLINE bool is_odd(const Field& xs) { return xs.limbs_storage.limbs[0] & 1; }
+  static constexpr bool is_odd(const Field& xs) { return xs.limbs_storage.limbs[0] & 1; }

-  static constexpr HOST_DEVICE_INLINE bool is_even(const Field& xs) { return ~xs.limbs_storage.limbs[0] & 1; }
+  static constexpr bool is_even(const Field& xs) { return ~xs.limbs_storage.limbs[0] & 1; }

-  static constexpr HOST_DEVICE_INLINE Field inverse(const Field& xs)
+  static constexpr Field inverse(const Field& xs)
  {
    if (xs == zero()) return zero();
    constexpr Field one = Field{CONFIG::one};
@@ -1007,9 +1006,9 @@ struct std::hash<Field<CONFIG>> {

 template <class CONFIG>
 struct SharedMemory<Field<CONFIG>> {
-  __device__ Field<CONFIG>* getPointer()
+  Field<CONFIG>* getPointer()
  {
-    extern __shared__ Field<CONFIG> s_scalar_[];
+    Field<CONFIG> *s_scalar_;
    return s_scalar_;
  }
 };
--- a/icicle/include/fields/field_config.cuh
+++ b/icicle/include/fields/field_config.cuh
@@ -2,8 +2,8 @@
 #ifndef FIELD_CONFIG_H
 #define FIELD_CONFIG_H

-#include "fields/id.h"
-#include "fields/field.cuh"
+#include "id.h"
+#include "field.cuh"

 /**
 * @namespace field_config
@@ -11,25 +11,28 @@
 * with the `-DFIELD` env variable passed during build.
 */
 #if FIELD_ID == BN254
-#include "fields/snark_fields/bn254_scalar.cuh"
+#include "snark_fields/bn254_scalar.cuh"
 namespace field_config = bn254;
 #elif FIELD_ID == BLS12_381
-#include "fields/snark_fields/bls12_381_scalar.cuh"
+#include "snark_fields/bls12_381_scalar.cuh"
 using bls12_381::fp_config;
 namespace field_config = bls12_381;
 #elif FIELD_ID == BLS12_377
-#include "fields/snark_fields/bls12_377_scalar.cuh"
+#include "snark_fields/bls12_377_scalar.cuh"
 namespace field_config = bls12_377;
 #elif FIELD_ID == BW6_761
-#include "fields/snark_fields/bw6_761_scalar.cuh"
+#include "snark_fields/bw6_761_scalar.cuh"
 namespace field_config = bw6_761;
 #elif FIELD_ID == GRUMPKIN
-#include "fields/snark_fields/grumpkin_scalar.cuh"
+#include "snark_fields/grumpkin_scalar.cuh"
 namespace field_config = grumpkin;

 #elif FIELD_ID == BABY_BEAR
-#include "fields/stark_fields/babybear.cuh"
+#include "stark_fields/babybear.cuh"
 namespace field_config = babybear;
+#elif FIELD_ID == STARK_252
+#include "stark_fields/stark252.cuh"
+namespace field_config = stark252;
 #endif

 #endif
--- a/icicle/include/fields/host_math.cuh
+++ b/icicle/include/fields/host_math.cuh
@@ -3,98 +3,97 @@
 #define HOST_MATH_H

 #include <cstdint>
-#include <cuda_runtime.h>
-#include "gpu-utils/modifiers.cuh"
+#include "../gpu-utils/modifiers.cuh"
 namespace host_math {

-  // return x + y with uint32_t operands
-  static __host__ uint32_t add(const uint32_t x, const uint32_t y) { return x + y; }
+ // return x + y with uint32_t operands
+ static uint32_t add(const uint32_t x, const uint32_t y) { return x + y; }

-  // return x + y + carry with uint32_t operands
-  static __host__ uint32_t addc(const uint32_t x, const uint32_t y, const uint32_t carry) { return x + y + carry; }
+ // return x + y + carry with uint32_t operands
+ static uint32_t addc(const uint32_t x, const uint32_t y, const uint32_t carry) { return x + y + carry; }

-  // return x + y and carry out with uint32_t operands
-  static __host__ uint32_t add_cc(const uint32_t x, const uint32_t y, uint32_t& carry)
+ // return x + y and carry out with uint32_t operands
+ static uint32_t add_cc(const uint32_t x, const uint32_t y, uint32_t& carry)
+ {
+  uint32_t result;
+  result = x + y;
+  carry = x > result;
+  return result;
+ }
+
+ // return x + y + carry and carry out with uint32_t operands
+ static uint32_t addc_cc(const uint32_t x, const uint32_t y, uint32_t& carry)
+ {
+  const uint32_t result = x + y + carry;
+  carry = carry && x >= result || !carry && x > result;
+  return result;
+ }
+
+ // return x - y with uint32_t operands
+ static uint32_t sub(const uint32_t x, const uint32_t y) { return x - y; }
+
+ // 	return x - y - borrow with uint32_t operands
+ static uint32_t subc(const uint32_t x, const uint32_t y, const uint32_t borrow) { return x - y - borrow; }
+
+ //	return x - y and borrow out with uint32_t operands
+ static uint32_t sub_cc(const uint32_t x, const uint32_t y, uint32_t& borrow)
+ {
+  uint32_t result;
+  result = x - y;
+  borrow = x < result;
+  return result;
+ }
+
+ //	return x - y - borrow and borrow out with uint32_t operands
+ static uint32_t subc_cc(const uint32_t x, const uint32_t y, uint32_t& borrow)
+ {
+  const uint32_t result = x - y - borrow;
+  borrow = borrow && x <= result || !borrow && x < result;
+  return result;
+ }
+
+ // return x * y + z + carry and carry out with uint32_t operands
+ static uint32_t madc_cc(const uint32_t x, const uint32_t y, const uint32_t z, uint32_t& carry)
+ {
+  uint32_t result;
+  uint64_t r = static_cast<uint64_t>(x) * y + z + carry;
+  carry = (uint32_t)(r >> 32);
+  result = r & 0xffffffff;
+  return result;
+ }
+
+ template <unsigned OPS_COUNT = UINT32_MAX, bool CARRY_IN = false, bool CARRY_OUT = false>
+ struct carry_chain {
+  unsigned index;
+
+  constexpr  carry_chain() : index(0) {}
+
+   uint32_t add(const uint32_t x, const uint32_t y, uint32_t& carry)
  {
-    uint32_t result;
-    result = x + y;
-    carry = x > result;
-    return result;
+   index++;
+   if (index == 1 && OPS_COUNT == 1 && !CARRY_IN && !CARRY_OUT)
+    return host_math::add(x, y);
+   else if (index == 1 && !CARRY_IN)
+    return host_math::add_cc(x, y, carry);
+   else if (index < OPS_COUNT || CARRY_OUT)
+    return host_math::addc_cc(x, y, carry);
+   else
+    return host_math::addc(x, y, carry);
  }

-  // return x + y + carry and carry out  with uint32_t operands
-  static __host__ uint32_t addc_cc(const uint32_t x, const uint32_t y, uint32_t& carry)
+   uint32_t sub(const uint32_t x, const uint32_t y, uint32_t& carry)
  {
-    const uint32_t result = x + y + carry;
-    carry = carry && x >= result || !carry && x > result;
-    return result;
+   index++;
+   if (index == 1 && OPS_COUNT == 1 && !CARRY_IN && !CARRY_OUT)
+    return host_math::sub(x, y);
+   else if (index == 1 && !CARRY_IN)
+    return host_math::sub_cc(x, y, carry);
+   else if (index < OPS_COUNT || CARRY_OUT)
+    return host_math::subc_cc(x, y, carry);
+   else
+    return host_math::subc(x, y, carry);
  }
-
-  // return x - y with uint32_t operands
-  static __host__ uint32_t sub(const uint32_t x, const uint32_t y) { return x - y; }
-
-  // 	return x - y - borrow with uint32_t operands
-  static __host__ uint32_t subc(const uint32_t x, const uint32_t y, const uint32_t borrow) { return x - y - borrow; }
-
-  //	return x - y and borrow out with uint32_t operands
-  static __host__ uint32_t sub_cc(const uint32_t x, const uint32_t y, uint32_t& borrow)
-  {
-    uint32_t result;
-    result = x - y;
-    borrow = x < result;
-    return result;
-  }
-
-  //	return x - y - borrow and borrow out with uint32_t operands
-  static __host__ uint32_t subc_cc(const uint32_t x, const uint32_t y, uint32_t& borrow)
-  {
-    const uint32_t result = x - y - borrow;
-    borrow = borrow && x <= result || !borrow && x < result;
-    return result;
-  }
-
-  // return x * y + z + carry and carry out with uint32_t operands
-  static __host__ uint32_t madc_cc(const uint32_t x, const uint32_t y, const uint32_t z, uint32_t& carry)
-  {
-    uint32_t result;
-    uint64_t r = static_cast<uint64_t>(x) * y + z + carry;
-    carry = (uint32_t)(r >> 32);
-    result = r & 0xffffffff;
-    return result;
-  }
-
-  template <unsigned OPS_COUNT = UINT32_MAX, bool CARRY_IN = false, bool CARRY_OUT = false>
-  struct carry_chain {
-    unsigned index;
-
-    constexpr HOST_INLINE carry_chain() : index(0) {}
-
-    HOST_INLINE uint32_t add(const uint32_t x, const uint32_t y, uint32_t& carry)
-    {
-      index++;
-      if (index == 1 && OPS_COUNT == 1 && !CARRY_IN && !CARRY_OUT)
-        return host_math::add(x, y);
-      else if (index == 1 && !CARRY_IN)
-        return host_math::add_cc(x, y, carry);
-      else if (index < OPS_COUNT || CARRY_OUT)
-        return host_math::addc_cc(x, y, carry);
-      else
-        return host_math::addc(x, y, carry);
-    }
-
-    HOST_INLINE uint32_t sub(const uint32_t x, const uint32_t y, uint32_t& carry)
-    {
-      index++;
-      if (index == 1 && OPS_COUNT == 1 && !CARRY_IN && !CARRY_OUT)
-        return host_math::sub(x, y);
-      else if (index == 1 && !CARRY_IN)
-        return host_math::sub_cc(x, y, carry);
-      else if (index < OPS_COUNT || CARRY_OUT)
-        return host_math::subc_cc(x, y, carry);
-      else
-        return host_math::subc(x, y, carry);
-    }
-  };
+ };
 } // namespace host_math

 #endif
--- a/icicle/include/fields/id.h
+++ b/icicle/include/fields/id.h
@@ -9,5 +9,6 @@
 #define GRUMPKIN  5

 #define BABY_BEAR 1001
+#define STARK_252 1002

 #endif
--- a/icicle/include/fields/ptx.cuh
+++ b/icicle/include/fields/ptx.cuh
@@ -1,139 +1,119 @@
 #pragma once
 #include <cstdint>
-#include <cuda_runtime.h>

 namespace ptx {

-  __device__ __forceinline__ uint32_t add(const uint32_t x, const uint32_t y)
+   uint32_t add(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm("add.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+    uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t add_cc(const uint32_t x, const uint32_t y)
+   uint32_t add_cc(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+    uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t addc(const uint32_t x, const uint32_t y)
+   uint32_t addc(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm volatile("addc.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+    uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t addc_cc(const uint32_t x, const uint32_t y)
+   uint32_t addc_cc(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+    uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t sub(const uint32_t x, const uint32_t y)
+   uint32_t sub(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm("sub.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+    uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t sub_cc(const uint32_t x, const uint32_t y)
+   uint32_t sub_cc(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm volatile("sub.cc.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t subc(const uint32_t x, const uint32_t y)
+   uint32_t subc(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm volatile("subc.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t subc_cc(const uint32_t x, const uint32_t y)
+   uint32_t subc_cc(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm volatile("subc.cc.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t mul_lo(const uint32_t x, const uint32_t y)
+   uint32_t mul_lo(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm("mul.lo.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t mul_hi(const uint32_t x, const uint32_t y)
+   uint32_t mul_hi(const uint32_t x, const uint32_t y)
  {
-    uint32_t result;
-    asm("mul.hi.u32 %0, %1, %2;" : "=r"(result) : "r"(x), "r"(y));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t mad_lo(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t mad_lo(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm("mad.lo.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t mad_hi(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t mad_hi(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm("mad.hi.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t mad_lo_cc(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t mad_lo_cc(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm volatile("mad.lo.cc.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t mad_hi_cc(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t mad_hi_cc(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm volatile("mad.hi.cc.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t madc_lo(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t madc_lo(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm volatile("madc.lo.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t madc_hi(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t madc_hi(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm volatile("madc.hi.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t madc_lo_cc(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t madc_lo_cc(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm volatile("madc.lo.cc.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint32_t madc_hi_cc(const uint32_t x, const uint32_t y, const uint32_t z)
+   uint32_t madc_hi_cc(const uint32_t x, const uint32_t y, const uint32_t z)
  {
-    uint32_t result;
-    asm volatile("madc.hi.cc.u32 %0, %1, %2, %3;" : "=r"(result) : "r"(x), "r"(y), "r"(z));
+     uint32_t result = 0;
    return result;
  }

-  __device__ __forceinline__ uint64_t mov_b64(uint32_t lo, uint32_t hi)
+   uint64_t mov_b64(uint32_t lo, uint32_t hi)
  {
-    uint64_t result;
-    asm("mov.b64 %0, {%1,%2};" : "=l"(result) : "r"(lo), "r"(hi));
+    uint64_t result = 0;
    return result;
  }

@@ -141,142 +121,124 @@ namespace ptx {
  // Callers should know exactly what they're calling (no implicit conversions).
  namespace u64 {

-    __device__ __forceinline__ uint64_t add(const uint64_t x, const uint64_t y)
+     uint64_t add(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm("add.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t add_cc(const uint64_t x, const uint64_t y)
+     uint64_t add_cc(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm volatile("add.cc.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t addc(const uint64_t x, const uint64_t y)
+     uint64_t addc(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm volatile("addc.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t addc_cc(const uint64_t x, const uint64_t y)
+     uint64_t addc_cc(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm volatile("addc.cc.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t sub(const uint64_t x, const uint64_t y)
+     uint64_t sub(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm("sub.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t sub_cc(const uint64_t x, const uint64_t y)
+     uint64_t sub_cc(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm volatile("sub.cc.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t subc(const uint64_t x, const uint64_t y)
+     uint64_t subc(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm volatile("subc.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t subc_cc(const uint64_t x, const uint64_t y)
+     uint64_t subc_cc(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm volatile("subc.cc.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t mul_lo(const uint64_t x, const uint64_t y)
+     uint64_t mul_lo(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm("mul.lo.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t mul_hi(const uint64_t x, const uint64_t y)
+     uint64_t mul_hi(const uint64_t x, const uint64_t y)
    {
-      uint64_t result;
-      asm("mul.hi.u64 %0, %1, %2;" : "=l"(result) : "l"(x), "l"(y));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t mad_lo(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t mad_lo(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm("mad.lo.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t mad_hi(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t mad_hi(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm("mad.hi.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t mad_lo_cc(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t mad_lo_cc(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm volatile("mad.lo.cc.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t mad_hi_cc(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t mad_hi_cc(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm volatile("mad.hi.cc.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t madc_lo(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t madc_lo(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm volatile("madc.lo.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t madc_hi(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t madc_hi(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm volatile("madc.hi.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t madc_lo_cc(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t madc_lo_cc(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm volatile("madc.lo.cc.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

-    __device__ __forceinline__ uint64_t madc_hi_cc(const uint64_t x, const uint64_t y, const uint64_t z)
+     uint64_t madc_hi_cc(const uint64_t x, const uint64_t y, const uint64_t z)
    {
-      uint64_t result;
-      asm volatile("madc.hi.cc.u64 %0, %1, %2, %3;" : "=l"(result) : "l"(x), "l"(y), "l"(z));
+      uint64_t result = 0;
      return result;
    }

  } // namespace u64

-  __device__ __forceinline__ void bar_arrive(const unsigned name, const unsigned count)
+   void bar_arrive(const unsigned name, const unsigned count)
  {
-    asm volatile("bar.arrive %0, %1;" : : "r"(name), "r"(count) : "memory");
+    return;
  }

-  __device__ __forceinline__ void bar_sync(const unsigned name, const unsigned count)
+   void bar_sync(const unsigned name, const unsigned count)
  {
-    asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(count) : "memory");
+    return;
  }

 } // namespace ptx
--- a/icicle/include/fields/quadratic_extension.cuh
+++ b/icicle/include/fields/quadratic_extension.cuh
@@ -1,8 +1,8 @@
 #pragma once

 #include "field.cuh"
-#include "gpu-utils/modifiers.cuh"
-#include "gpu-utils/sharedmem.cuh"
+#include "../gpu-utils/modifiers.cuh"
+#include "../gpu-utils/sharedmem.cuh"

 template <typename CONFIG>
 class ExtensionField
@@ -16,12 +16,12 @@ private:
    FWide real;
    FWide imaginary;

-    friend HOST_DEVICE_INLINE ExtensionWide operator+(ExtensionWide xs, const ExtensionWide& ys)
+    friend ExtensionWide operator+(ExtensionWide xs, const ExtensionWide& ys)
    {
      return ExtensionWide{xs.real + ys.real, xs.imaginary + ys.imaginary};
    }

-    friend HOST_DEVICE_INLINE ExtensionWide operator-(ExtensionWide xs, const ExtensionWide& ys)
+    friend ExtensionWide operator-(ExtensionWide xs, const ExtensionWide& ys)
    {
      return ExtensionWide{xs.real - ys.real, xs.imaginary - ys.imaginary};
    }
@@ -34,21 +34,21 @@ public:
  FF real;
  FF imaginary;

-  static constexpr HOST_DEVICE_INLINE ExtensionField zero() { return ExtensionField{FF::zero(), FF::zero()}; }
+  static constexpr ExtensionField zero() { return ExtensionField{FF::zero(), FF::zero()}; }

-  static constexpr HOST_DEVICE_INLINE ExtensionField one() { return ExtensionField{FF::one(), FF::zero()}; }
+  static constexpr ExtensionField one() { return ExtensionField{FF::one(), FF::zero()}; }

-  static constexpr HOST_DEVICE_INLINE ExtensionField to_montgomery(const ExtensionField& xs)
+  static constexpr ExtensionField to_montgomery(const ExtensionField& xs)
  {
    return ExtensionField{xs.real * FF{CONFIG::montgomery_r}, xs.imaginary * FF{CONFIG::montgomery_r}};
  }

-  static constexpr HOST_DEVICE_INLINE ExtensionField from_montgomery(const ExtensionField& xs)
+  static constexpr ExtensionField from_montgomery(const ExtensionField& xs)
  {
    return ExtensionField{xs.real * FF{CONFIG::montgomery_r_inv}, xs.imaginary * FF{CONFIG::montgomery_r_inv}};
  }

-  static HOST_INLINE ExtensionField rand_host() { return ExtensionField{FF::rand_host(), FF::rand_host()}; }
+  static ExtensionField rand_host() { return ExtensionField{FF::rand_host(), FF::rand_host()}; }

  static void rand_host_many(ExtensionField* out, int size)
  {
@@ -57,7 +57,7 @@ public:
  }

  template <unsigned REDUCTION_SIZE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField sub_modulus(const ExtensionField& xs)
+  static constexpr ExtensionField sub_modulus(const ExtensionField& xs)
  {
    return ExtensionField{FF::sub_modulus<REDUCTION_SIZE>(&xs.real), FF::sub_modulus<REDUCTION_SIZE>(&xs.imaginary)};
  }
@@ -68,38 +68,38 @@ public:
    return os;
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator+(ExtensionField xs, const ExtensionField& ys)
+  friend ExtensionField operator+(ExtensionField xs, const ExtensionField& ys)
  {
    return ExtensionField{xs.real + ys.real, xs.imaginary + ys.imaginary};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator-(ExtensionField xs, const ExtensionField& ys)
+  friend ExtensionField operator-(ExtensionField xs, const ExtensionField& ys)
  {
    return ExtensionField{xs.real - ys.real, xs.imaginary - ys.imaginary};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator+(FF xs, const ExtensionField& ys)
+  friend ExtensionField operator+(FF xs, const ExtensionField& ys)
  {
    return ExtensionField{xs + ys.real, ys.imaginary};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator-(FF xs, const ExtensionField& ys)
+  friend ExtensionField operator-(FF xs, const ExtensionField& ys)
  {
    return ExtensionField{xs - ys.real, FF::neg(ys.imaginary)};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator+(ExtensionField xs, const FF& ys)
+  friend ExtensionField operator+(ExtensionField xs, const FF& ys)
  {
    return ExtensionField{xs.real + ys, xs.imaginary};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator-(ExtensionField xs, const FF& ys)
+  friend ExtensionField operator-(ExtensionField xs, const FF& ys)
  {
    return ExtensionField{xs.real - ys, xs.imaginary};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide mul_wide(const ExtensionField& xs, const ExtensionField& ys)
+  static constexpr ExtensionWide mul_wide(const ExtensionField& xs, const ExtensionField& ys)
  {
    FWide real_prod = FF::mul_wide(xs.real, ys.real);
    FWide imaginary_prod = FF::mul_wide(xs.imaginary, ys.imaginary);
@@ -110,40 +110,40 @@ public:
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide mul_wide(const ExtensionField& xs, const FF& ys)
+  static constexpr ExtensionWide mul_wide(const ExtensionField& xs, const FF& ys)
  {
    return ExtensionWide{FF::mul_wide(xs.real, ys), FF::mul_wide(xs.imaginary, ys)};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide mul_wide(const FF& xs, const ExtensionField& ys)
+  static constexpr ExtensionWide mul_wide(const FF& xs, const ExtensionField& ys)
  {
    return mul_wide(ys, xs);
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField reduce(const ExtensionWide& xs)
+  static constexpr ExtensionField reduce(const ExtensionWide& xs)
  {
    return ExtensionField{
      FF::template reduce<MODULUS_MULTIPLE>(xs.real), FF::template reduce<MODULUS_MULTIPLE>(xs.imaginary)};
  }

  template <class T1, class T2>
-  friend HOST_DEVICE_INLINE ExtensionField operator*(const T1& xs, const T2& ys)
+  friend ExtensionField operator*(const T1& xs, const T2& ys)
  {
    ExtensionWide xy = mul_wide(xs, ys);
    return reduce(xy);
  }

-  friend HOST_DEVICE_INLINE bool operator==(const ExtensionField& xs, const ExtensionField& ys)
+  friend bool operator==(const ExtensionField& xs, const ExtensionField& ys)
  {
    return (xs.real == ys.real) && (xs.imaginary == ys.imaginary);
  }

-  friend HOST_DEVICE_INLINE bool operator!=(const ExtensionField& xs, const ExtensionField& ys) { return !(xs == ys); }
+  friend bool operator!=(const ExtensionField& xs, const ExtensionField& ys) { return !(xs == ys); }

  template <const ExtensionField& multiplier>
-  static HOST_DEVICE_INLINE ExtensionField mul_const(const ExtensionField& xs)
+  static ExtensionField mul_const(const ExtensionField& xs)
  {
    static constexpr FF mul_real = multiplier.real;
    static constexpr FF mul_imaginary = multiplier.imaginary;
@@ -159,33 +159,33 @@ public:
  }

  template <uint32_t multiplier, unsigned REDUCTION_SIZE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField mul_unsigned(const ExtensionField& xs)
+  static constexpr ExtensionField mul_unsigned(const ExtensionField& xs)
  {
    return {FF::template mul_unsigned<multiplier>(xs.real), FF::template mul_unsigned<multiplier>(xs.imaginary)};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide sqr_wide(const ExtensionField& xs)
+  static constexpr ExtensionWide sqr_wide(const ExtensionField& xs)
  {
    // TODO: change to a more efficient squaring
    return mul_wide<MODULUS_MULTIPLE>(xs, xs);
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField sqr(const ExtensionField& xs)
+  static constexpr ExtensionField sqr(const ExtensionField& xs)
  {
    // TODO: change to a more efficient squaring
    return xs * xs;
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField neg(const ExtensionField& xs)
+  static constexpr ExtensionField neg(const ExtensionField& xs)
  {
    return ExtensionField{FF::neg(xs.real), FF::neg(xs.imaginary)};
  }

  // inverse of zero is set to be zero which is what we want most of the time
-  static constexpr HOST_DEVICE_INLINE ExtensionField inverse(const ExtensionField& xs)
+  static constexpr ExtensionField inverse(const ExtensionField& xs)
  {
    ExtensionField xs_conjugate = {xs.real, FF::neg(xs.imaginary)};
    FF nonresidue_times_im = FF::template mul_unsigned<CONFIG::nonresidue>(FF::sqr(xs.imaginary));
@@ -198,9 +198,9 @@ public:

 template <class CONFIG>
 struct SharedMemory<ExtensionField<CONFIG>> {
-  __device__ ExtensionField<CONFIG>* getPointer()
+  ExtensionField<CONFIG>* getPointer()
  {
-    extern __shared__ ExtensionField<CONFIG> s_ext2_scalar_[];
+    ExtensionField<CONFIG> *s_ext2_scalar_;
    return s_ext2_scalar_;
  }
 };
--- a/icicle/include/fields/quartic_extension.cuh
+++ b/icicle/include/fields/quartic_extension.cuh
@@ -1,8 +1,8 @@
 #pragma once

 #include "field.cuh"
-#include "gpu-utils/modifiers.cuh"
-#include "gpu-utils/sharedmem.cuh"
+#include "../gpu-utils/modifiers.cuh"
+#include "../gpu-utils/sharedmem.cuh"

 template <typename CONFIG>
 class ExtensionField
@@ -16,12 +16,12 @@ private:
    FWide im2;
    FWide im3;

-    friend HOST_DEVICE_INLINE ExtensionWide operator+(ExtensionWide xs, const ExtensionWide& ys)
+    friend ExtensionWide operator+(ExtensionWide xs, const ExtensionWide& ys)
    {
      return ExtensionWide{xs.real + ys.real, xs.im1 + ys.im1, xs.im2 + ys.im2, xs.im3 + ys.im3};
    }

-    friend HOST_DEVICE_INLINE ExtensionWide operator-(ExtensionWide xs, const ExtensionWide& ys)
+    friend ExtensionWide operator-(ExtensionWide xs, const ExtensionWide& ys)
    {
      return ExtensionWide{xs.real - ys.real, xs.im1 - ys.im1, xs.im2 - ys.im2, xs.im3 - ys.im3};
    }
@@ -36,31 +36,31 @@ public:
  FF im2;
  FF im3;

-  static constexpr HOST_DEVICE_INLINE ExtensionField zero()
+  static constexpr ExtensionField zero()
  {
    return ExtensionField{FF::zero(), FF::zero(), FF::zero(), FF::zero()};
  }

-  static constexpr HOST_DEVICE_INLINE ExtensionField one()
+  static constexpr ExtensionField one()
  {
    return ExtensionField{FF::one(), FF::zero(), FF::zero(), FF::zero()};
  }

-  static constexpr HOST_DEVICE_INLINE ExtensionField to_montgomery(const ExtensionField& xs)
+  static constexpr ExtensionField to_montgomery(const ExtensionField& xs)
  {
    return ExtensionField{
      xs.real * FF{CONFIG::montgomery_r}, xs.im1 * FF{CONFIG::montgomery_r}, xs.im2 * FF{CONFIG::montgomery_r},
      xs.im3 * FF{CONFIG::montgomery_r}};
  }

-  static constexpr HOST_DEVICE_INLINE ExtensionField from_montgomery(const ExtensionField& xs)
+  static constexpr ExtensionField from_montgomery(const ExtensionField& xs)
  {
    return ExtensionField{
      xs.real * FF{CONFIG::montgomery_r_inv}, xs.im1 * FF{CONFIG::montgomery_r_inv},
      xs.im2 * FF{CONFIG::montgomery_r_inv}, xs.im3 * FF{CONFIG::montgomery_r_inv}};
  }

-  static HOST_INLINE ExtensionField rand_host()
+  static ExtensionField rand_host()
  {
    return ExtensionField{FF::rand_host(), FF::rand_host(), FF::rand_host(), FF::rand_host()};
  }
@@ -72,7 +72,7 @@ public:
  }

  template <unsigned REDUCTION_SIZE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField sub_modulus(const ExtensionField& xs)
+  static constexpr ExtensionField sub_modulus(const ExtensionField& xs)
  {
    return ExtensionField{
      FF::sub_modulus<REDUCTION_SIZE>(&xs.real), FF::sub_modulus<REDUCTION_SIZE>(&xs.im1),
@@ -86,38 +86,38 @@ public:
    return os;
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator+(ExtensionField xs, const ExtensionField& ys)
+  friend ExtensionField operator+(ExtensionField xs, const ExtensionField& ys)
  {
    return ExtensionField{xs.real + ys.real, xs.im1 + ys.im1, xs.im2 + ys.im2, xs.im3 + ys.im3};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator-(ExtensionField xs, const ExtensionField& ys)
+  friend ExtensionField operator-(ExtensionField xs, const ExtensionField& ys)
  {
    return ExtensionField{xs.real - ys.real, xs.im1 - ys.im1, xs.im2 - ys.im2, xs.im3 - ys.im3};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator+(FF xs, const ExtensionField& ys)
+  friend ExtensionField operator+(FF xs, const ExtensionField& ys)
  {
    return ExtensionField{xs + ys.real, ys.im1, ys.im2, ys.im3};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator-(FF xs, const ExtensionField& ys)
+  friend ExtensionField operator-(FF xs, const ExtensionField& ys)
  {
    return ExtensionField{xs - ys.real, FF::neg(ys.im1), FF::neg(ys.im2), FF::neg(ys.im3)};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator+(ExtensionField xs, const FF& ys)
+  friend ExtensionField operator+(ExtensionField xs, const FF& ys)
  {
    return ExtensionField{xs.real + ys, xs.im1, xs.im2, xs.im3};
  }

-  friend HOST_DEVICE_INLINE ExtensionField operator-(ExtensionField xs, const FF& ys)
+  friend ExtensionField operator-(ExtensionField xs, const FF& ys)
  {
    return ExtensionField{xs.real - ys, xs.im1, xs.im2, xs.im3};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide mul_wide(const ExtensionField& xs, const ExtensionField& ys)
+  static constexpr ExtensionWide mul_wide(const ExtensionField& xs, const ExtensionField& ys)
  {
    if (CONFIG::nonresidue_is_negative)
      return ExtensionWide{
@@ -144,21 +144,21 @@ public:
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide mul_wide(const ExtensionField& xs, const FF& ys)
+  static constexpr ExtensionWide mul_wide(const ExtensionField& xs, const FF& ys)
  {
    return ExtensionWide{
      FF::mul_wide(xs.real, ys), FF::mul_wide(xs.im1, ys), FF::mul_wide(xs.im2, ys), FF::mul_wide(xs.im3, ys)};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide mul_wide(const FF& xs, const ExtensionField& ys)
+  static constexpr ExtensionWide mul_wide(const FF& xs, const ExtensionField& ys)
  {
    return ExtensionWide{
      FF::mul_wide(xs, ys.real), FF::mul_wide(xs, ys.im1), FF::mul_wide(xs, ys.im2), FF::mul_wide(xs, ys.im3)};
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField reduce(const ExtensionWide& xs)
+  static constexpr ExtensionField reduce(const ExtensionWide& xs)
  {
    return ExtensionField{
      FF::template reduce<MODULUS_MULTIPLE>(xs.real), FF::template reduce<MODULUS_MULTIPLE>(xs.im1),
@@ -166,21 +166,21 @@ public:
  }

  template <class T1, class T2>
-  friend HOST_DEVICE_INLINE ExtensionField operator*(const T1& xs, const T2& ys)
+  friend ExtensionField operator*(const T1& xs, const T2& ys)
  {
    ExtensionWide xy = mul_wide(xs, ys);
    return reduce(xy);
  }

-  friend HOST_DEVICE_INLINE bool operator==(const ExtensionField& xs, const ExtensionField& ys)
+  friend bool operator==(const ExtensionField& xs, const ExtensionField& ys)
  {
    return (xs.real == ys.real) && (xs.im1 == ys.im1) && (xs.im2 == ys.im2) && (xs.im3 == ys.im3);
  }

-  friend HOST_DEVICE_INLINE bool operator!=(const ExtensionField& xs, const ExtensionField& ys) { return !(xs == ys); }
+  friend bool operator!=(const ExtensionField& xs, const ExtensionField& ys) { return !(xs == ys); }

  template <uint32_t multiplier, unsigned REDUCTION_SIZE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField mul_unsigned(const ExtensionField& xs)
+  static constexpr ExtensionField mul_unsigned(const ExtensionField& xs)
  {
    return {
      FF::template mul_unsigned<multiplier>(xs.real), FF::template mul_unsigned<multiplier>(xs.im1),
@@ -188,27 +188,27 @@ public:
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionWide sqr_wide(const ExtensionField& xs)
+  static constexpr ExtensionWide sqr_wide(const ExtensionField& xs)
  {
    // TODO: change to a more efficient squaring
    return mul_wide<MODULUS_MULTIPLE>(xs, xs);
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField sqr(const ExtensionField& xs)
+  static constexpr ExtensionField sqr(const ExtensionField& xs)
  {
    // TODO: change to a more efficient squaring
    return xs * xs;
  }

  template <unsigned MODULUS_MULTIPLE = 1>
-  static constexpr HOST_DEVICE_INLINE ExtensionField neg(const ExtensionField& xs)
+  static constexpr ExtensionField neg(const ExtensionField& xs)
  {
    return {FF::neg(xs.real), FF::neg(xs.im1), FF::neg(xs.im2), FF::neg(xs.im3)};
  }

  // inverse of zero is set to be zero which is what we want most of the time
-  static constexpr HOST_DEVICE_INLINE ExtensionField inverse(const ExtensionField& xs)
+  static constexpr ExtensionField inverse(const ExtensionField& xs)
  {
    FF x, x0, x2;
    if (CONFIG::nonresidue_is_negative) {
@@ -249,9 +249,9 @@ public:

 template <class CONFIG>
 struct SharedMemory<ExtensionField<CONFIG>> {
-  __device__ ExtensionField<CONFIG>* getPointer()
+  ExtensionField<CONFIG>* getPointer()
  {
-    extern __shared__ ExtensionField<CONFIG> s_ext4_scalar_[];
+    ExtensionField<CONFIG> *s_ext4_scalar_=nullptr;
    return s_ext4_scalar_;
  }
 };
--- a/icicle/include/fields/snark_fields/bn254_base.cuh
+++ b/icicle/include/fields/snark_fields/bn254_base.cuh
@@ -2,7 +2,7 @@
 #ifndef BN254_BASE_PARAMS_H
 #define BN254_BASE_PARAMS_H

-#include "fields/storage.cuh"
+#include "../storage.cuh"

 namespace bn254 {
  struct fq_config {
--- a/icicle/include/fields/snark_fields/bn254_scalar.cuh
+++ b/icicle/include/fields/snark_fields/bn254_scalar.cuh
@@ -2,9 +2,9 @@
 #ifndef BN254_SCALAR_PARAMS_H
 #define BN254_SCALAR_PARAMS_H

-#include "fields/storage.cuh"
-#include "fields/field.cuh"
-#include "fields/quadratic_extension.cuh"
+#include "../storage.cuh"
+#include "../field.cuh"
+#include "../quadratic_extension.cuh"

 namespace bn254 {
  struct fp_config {
--- a/icicle/include/fields/stark_fields/babybear.cuh
+++ b/icicle/include/fields/stark_fields/babybear.cuh
@@ -1,8 +1,8 @@
 #pragma once

-#include "fields/storage.cuh"
-#include "fields/field.cuh"
-#include "fields/quartic_extension.cuh"
+#include "../storage.cuh"
+#include "../field.cuh"
+#include "../quartic_extension.cuh"

 namespace babybear {
  struct fp_config {
--- a/icicle/include/fields/stark_fields/stark252.cuh
+++ b/icicle/include/fields/stark_fields/stark252.cuh
@@ -0,0 +1,631 @@
+#pragma once
+
+#include "fields/storage.cuh"
+#include "fields/field.cuh"
+
+// modulus = 3618502788666131213697322783095070105623107215331596699973092056135872020481 (2^251+17*2^192+1)
+namespace stark252 {
+  struct fp_config {
+    static constexpr unsigned limbs_count = 8;
+    static constexpr unsigned modulus_bit_count = 252;
+    static constexpr unsigned num_of_reductions = 1;
+    static constexpr unsigned omegas_count = 192;
+
+    static constexpr storage<limbs_count> modulus = {0x00000001, 0x00000000, 0x00000000, 0x00000000,
+                                                     0x00000000, 0x00000000, 0x00000011, 0x08000000};
+    static constexpr storage<limbs_count> modulus_2 = {0x00000002, 0x00000000, 0x00000000, 0x00000000,
+                                                       0x00000000, 0x00000000, 0x00000022, 0x10000000};
+    static constexpr storage<limbs_count> modulus_4 = {0x00000004, 0x00000000, 0x00000000, 0x00000000,
+                                                       0x00000000, 0x00000000, 0x00000044, 0x20000000};
+    static constexpr storage<limbs_count> neg_modulus = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
+                                                         0xffffffff, 0xffffffff, 0xffffffee, 0xf7ffffff};
+    static constexpr storage<2 * limbs_count> modulus_wide = {
+      0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000011, 0x08000000,
+      0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000};
+    static constexpr storage<2 * limbs_count> modulus_squared = {
+      0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000022, 0x10000000,
+      0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000121, 0x10000000, 0x00000001, 0x00400000};
+    static constexpr storage<2 * limbs_count> modulus_squared_2 = {
+      0x00000002, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000044, 0x20000000,
+      0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000242, 0x20000000, 0x00000002, 0x00800000};
+    static constexpr storage<2 * limbs_count> modulus_squared_4 = {
+      0x00000004, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000088, 0x40000000,
+      0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000484, 0x40000000, 0x00000004, 0x01000000};
+    static constexpr storage<limbs_count> m = {0x8c81fffb, 0x00000002, 0xfeccf000, 0xffffffff,
+                                               0x0000907f, 0x00000000, 0xffffffbc, 0x1fffffff};
+    static constexpr storage<limbs_count> one = {0x00000001, 0x00000000, 0x00000000, 0x00000000,
+                                                 0x00000000, 0x00000000, 0x00000000, 0x00000000};
+    static constexpr storage<limbs_count> zero = {0x00000000, 0x00000000, 0x00000000, 0x00000000,
+                                                  0x00000000, 0x00000000, 0x00000000, 0x00000000};
+    static constexpr storage<limbs_count> montgomery_r = {0xffffffe1, 0xffffffff, 0xffffffff, 0xffffffff,
+                                                          0xffffffff, 0xffffffff, 0xfffffdf0, 0x07ffffff};
+    static constexpr storage<limbs_count> montgomery_r_inv = {0x00000000, 0x00000000, 0x00000000, 0x00000000,
+                                                              0x00000121, 0x10000000, 0x00000001, 0x00400000};
+
+    static constexpr storage_array<omegas_count, limbs_count> omega = {
+      {{0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000011, 0x08000000},
+       {0xf41337e3, 0x2a616626, 0xac8320da, 0xc5268e56, 0x4329f8c7, 0x53312066, 0x29a2995b, 0x06250239},
+       {0xee6feebb, 0x3ada5e1d, 0xe4412e87, 0x98c62155, 0x2f9c676e, 0xc90adb1e, 0x0de874d9, 0x063365fe},
+       {0x6021e539, 0x8337c45f, 0xbbf30245, 0xb0bdf467, 0x514425f3, 0x4537602d, 0x88826aba, 0x05ec467b},
+       {0x9b48a8ab, 0x2225638f, 0x1a8e7981, 0x26da375d, 0xce6246af, 0xfcdca219, 0x9ecd5c85, 0x0789ad45},
+       {0xb2703765, 0xd6871506, 0xf9e225ec, 0xd09bd064, 0x10826800, 0x5e869a07, 0xe82b2bb5, 0x0128f0fe},
+       {0xdd4af20f, 0xfdab65db, 0x56f9ddbc, 0xefa66822, 0x1b03a097, 0x587781ce, 0x9556f9b8, 0x000fcad1},
+       {0xff0cb347, 0x9f1bc8d7, 0xd0e87cd5, 0xc4d78992, 0xdd51a717, 0xbc7924d5, 0xfd121b58, 0x00c92ecb},
+       {0xc13a1d0b, 0xcc4074a0, 0xe3bc8e32, 0xa1f811a9, 0x6d4b9bd4, 0x0234b46e, 0x7880b4dc, 0x011d07d9},
+       {0xec89c4f1, 0xa206c054, 0xdc125289, 0x653d9e35, 0x711825f5, 0x72406af6, 0x46a03edd, 0x0659d839},
+       {0x0fa30710, 0x45391692, 0x11b54c6c, 0xd439f572, 0xa3492c1e, 0xed5ebbf4, 0xb5d9a6de, 0x010f4d91},
+       {0x7afd187f, 0x9273dbbc, 0x91ee171f, 0xdb5375bc, 0x6749ae3d, 0xc061f425, 0x6ec477cf, 0x003d14df},
+       {0x3112b02d, 0x8171e1da, 0xadf9bf78, 0x5c4564eb, 0x5689b232, 0x68c34184, 0x6538624f, 0x0363d70a},
+       {0x606082e1, 0x3e5a42f0, 0x76fc314a, 0x5edd09f0, 0x0f673d7c, 0xd650df25, 0x34832dba, 0x0393a32b},
+       {0x13a77460, 0xe3efc75d, 0x62ef8a01, 0x93898bc8, 0x8bdbd9b3, 0x1c3a6e5c, 0x611b7206, 0x034b5d5d},
+       {0x309d9da9, 0x80ee9837, 0xf51eddbc, 0x1646d633, 0x4901fab8, 0xb9d2cd85, 0x9978ee09, 0x01eb6d84},
+       {0x2755bfac, 0xa7b1f98c, 0xeb7aa1c1, 0x9ec8116c, 0x3109e611, 0x0eeadedd, 0xc9761a8a, 0x06a6f98d},
+       {0x9745a046, 0xce7b0a8b, 0xe411ee63, 0x7ff61841, 0x635f8799, 0x34f67453, 0xef852560, 0x04768803},
+       {0xbffaa9db, 0x1727fce0, 0xf973dc22, 0x858f5918, 0x223f6558, 0x3e277fa0, 0xf71614e3, 0x02d25658},
+       {0x8574e81f, 0xe3d47b99, 0x7fc4c648, 0xc727c9af, 0xee93dc85, 0x581d81ca, 0xca8a00d9, 0x0594beaf},
+       {0x0e5ffcb8, 0x00654744, 0xe7c1b2fd, 0x030530a6, 0xecbf157b, 0x27e46d76, 0xbeea04f1, 0x01f4c2bf},
+       {0x3e3a2f4b, 0xead33145, 0xd6482f17, 0xd841544d, 0x8d24a344, 0x9822fb10, 0x31eeac7c, 0x03e43835},
+       {0xb40bdbe8, 0x01af11c3, 0xb32a3b23, 0xd7c9c0a1, 0xcd0be360, 0x81cb2e43, 0xafb3df1a, 0x01054544},
+       {0x77156db2, 0xf6b13488, 0xddc0f211, 0x1ad6f3be, 0xd664f4da, 0xe643d3ea, 0x174a8e80, 0x071a47b8},
+       {0x4ca88ffc, 0xb86b03a4, 0x8ef9a25a, 0x6e3398e6, 0xf5fa4665, 0xce9a0d37, 0x5c437763, 0x06e8e769},
+       {0x4586dbc3, 0x32609f1d, 0xaa2da684, 0x03148f22, 0x4795d346, 0xa679e36b, 0x9e51225c, 0x03d8d2c7},
+       {0xea5f81cf, 0xeac5be9e, 0x64c12e72, 0x102e16b2, 0xfee282e4, 0xce0bc0d9, 0xa93b28f3, 0x01f05206},
+       {0xbb6422f9, 0x258e96d2, 0x617c5468, 0x751615d8, 0x6056f032, 0x27145cb6, 0x81c06d84, 0x057a7971},
+       {0xb030713c, 0xf42231bb, 0x3a96c59e, 0xae9c3f9a, 0xf1ee840c, 0x5397e8e2, 0xf2b87657, 0x05e7deca},
+       {0xf81f58b4, 0x209745aa, 0x91af248d, 0x74a64310, 0xc04b00b7, 0xe566a8e1, 0x80fb4cea, 0x022bde40},
+       {0x5de74517, 0x8265b62b, 0xb9b9f2c9, 0x6a788149, 0xa9565d98, 0x6fec2239, 0x573f0c28, 0x060ac0c4},
+       {0xd3ce8992, 0xc129d0f1, 0x81c43de5, 0x719252eb, 0x48221e1a, 0xfea566de, 0x0be8ced2, 0x050732ed},
+       {0x2216f1c8, 0x9aae0db3, 0xd7220015, 0x95e231ac, 0x6340df6f, 0xbd6ae160, 0x16a6e39c, 0x0166c8e2},
+       {0x76b0a92e, 0x3ccd9d2b, 0x7d671a9d, 0x1feb39d7, 0x2109fd56, 0x3c49a630, 0x5d4ec292, 0x07badc4b},
+       {0x5dd8c4c3, 0x081c3166, 0xec14ba21, 0x9dca12d8, 0xcf93b2e5, 0xf58069e2, 0x571ddc34, 0x02399005},
+       {0x08a616fc, 0x65a19cf4, 0x8aea6ff7, 0x860d442c, 0x6896a559, 0x4f24ab19, 0x3d7f5ae6, 0x0685db92},
+       {0x622478c4, 0x051093f0, 0x3fab8962, 0x5c200627, 0x21254c39, 0x2aa7ae1b, 0x7b116fb9, 0x0100fff9},
+       {0x00637050, 0x2693b834, 0x22440235, 0x3fef7c1b, 0x3481c4fe, 0x31150ac1, 0xf261b6de, 0x0772cb7a},
+       {0xd990d491, 0x6966804c, 0xc7505f35, 0x46aba1bc, 0xaceeb7f7, 0x4f696cba, 0x6474b8f0, 0x02b73cad},
+       {0xf39cd3e8, 0x7d13e948, 0x62a1db76, 0xd5c33593, 0x4d1be159, 0x7fd3b59b, 0x3676644e, 0x066d3f61},
+       {0xb3bd8b7e, 0x5a896ef3, 0xba5762ab, 0x2319450a, 0x1a545f8b, 0x226f0a07, 0x55446d35, 0x02760973},
+       {0x140e5623, 0x38eaa186, 0x94be15ba, 0x5a48d469, 0xad75d32a, 0xe4f1f15b, 0x2f14e2f1, 0x039ccdaa},
+       {0xe6fcfdb2, 0xad7108d3, 0x9c9f7f04, 0xfadfc050, 0x9df95366, 0xdbb20071, 0xe555c739, 0x02c4d3fa},
+       {0xc3111bcb, 0xb640956f, 0xbb11fb86, 0xcd942bbd, 0xa3db81cd, 0xa4b4eb09, 0x684fdb65, 0x041ed5ed},
+       {0xdd5ca525, 0x462b41fa, 0x153c3d28, 0xbcc17ccd, 0x6b06db5c, 0x8a81d137, 0x4a050358, 0x05f5cf39},
+       {0xcc60fb85, 0x374012a6, 0x34d1905d, 0x978f9785, 0x4e17ff38, 0x713383d4, 0x1055c25d, 0x07f3796f},
+       {0x0643771f, 0x852ba56e, 0x86781a31, 0xadfa956c, 0xb26a3811, 0x2ee2fccf, 0xdbd56ba7, 0x009214ce},
+       {0x68bc148c, 0xe2bf6c4b, 0x01c203ce, 0xd38dbf38, 0x97923b55, 0x27f73df4, 0x5081f7d9, 0x030a2e81},
+       {0xf11422a0, 0xbe23b78f, 0x99cdc2e0, 0xd4f3510d, 0xaa13ffe5, 0xcb05b3da, 0xc724e0c5, 0x028d98a5},
+       {0x96934000, 0x15277271, 0x588c8a51, 0x8013dd5e, 0x9ed55af8, 0x77772f7c, 0x03549e60, 0x020895f8},
+       {0x34db29f8, 0xc0cc8556, 0x67455b5d, 0x5582a9ff, 0x8a9a38b5, 0x12862a43, 0xa59fd242, 0x059655bc},
+       {0x94ceaf98, 0x39bc5131, 0xc71ccc0d, 0x99f4d1a0, 0x54acb87c, 0xc565794d, 0xc33590ef, 0x0593fcef},
+       {0xe97bf51c, 0xa2922d09, 0x3200d367, 0xdbb866a2, 0x4ad9302d, 0x05849ed8, 0xdf93f2b5, 0x000c447e},
+       {0x850fb317, 0x2755d6c2, 0xd45eb3f5, 0x36feeeea, 0xdfbc1d97, 0x4f4471d7, 0x4e3003f8, 0x07ec8926},
+       {0xb6a791f1, 0x38b8dc2a, 0x27a1bbb1, 0x79d6de48, 0xcad54cf2, 0x78c40b06, 0xa43bc898, 0x036dd150},
+       {0x1cc4133c, 0xefa72477, 0x477d39be, 0x5327d617, 0x2c5db3a4, 0xfd1de1f9, 0xc9a18a1c, 0x0147819b},
+       {0xf8133966, 0x275e6b02, 0x87969b48, 0x82bc79b9, 0x5d1e2f0e, 0x85b1f9bd, 0xc819531b, 0x00f9ea29},
+       {0x120edfab, 0x9e0392a5, 0xe3681a15, 0x07403ad4, 0x8a1c3817, 0xa8d469d8, 0x89f15c6f, 0x0395e7fc},
+       {0x641826ac, 0x7f405a9f, 0x6861e2ce, 0xa566e755, 0xba82a050, 0x8a3a08ba, 0xea63598d, 0x071dd923},
+       {0x5f65c188, 0x1d2b7538, 0xd6fc9625, 0xcb704d0f, 0xf59deccc, 0x18729111, 0x52fe1979, 0x07595020},
+       {0x8a08756f, 0x0175aa1c, 0x7fa7c6c4, 0x9a76a312, 0x6e93f6f3, 0x0bfa523a, 0x258c2f23, 0x03d70de4},
+       {0x8229376d, 0x8a0b9d02, 0x2c65c94e, 0x08421430, 0xd34b0aa6, 0x1160b441, 0xbbfb9491, 0x03b9eb75},
+       {0x827caf53, 0x91874856, 0x37e8a006, 0xdfdcae7a, 0x04e3af6b, 0x6dcfc3f2, 0xba66ff37, 0x0592823d},
+       {0x72fb8b0d, 0xb0a6628d, 0xa72b1f03, 0x7d3eef8b, 0x8dd54dbe, 0x5be965ba, 0x96d1fe4c, 0x0114a278},
+       {0x06051d55, 0x0256d8e6, 0xb9fa9dcc, 0xbf152353, 0x44140d6e, 0x6ef2c68c, 0xc9c0fea6, 0x015f291a},
+       {0xed992efc, 0xa1826724, 0x771da991, 0x9a58fd99, 0xd0b370a1, 0xce51a153, 0x826df846, 0x03c53bf5},
+       {0xcc7bf8c3, 0x3909aad7, 0xb08ddfa2, 0xd408ae7d, 0xff94d9fc, 0x2e9ab5d6, 0xf11cbcf6, 0x0020a1b2},
+       {0x3e257b43, 0x448fff07, 0x5fd9edca, 0x00f4a128, 0x7b429f71, 0x6f8987e3, 0x0fc8b522, 0x013336c1},
+       {0x062bd860, 0xef78ac4c, 0xf5d787d2, 0x6539ee52, 0xbb65576e, 0x113b6071, 0x9f3d7f85, 0x0160e952},
+       {0xf966d24e, 0x0c4e7c07, 0x318277e8, 0x011853d8, 0x7c287f58, 0x93bae650, 0xf64289f7, 0x00b974a1},
+       {0x30408cb9, 0x66d19420, 0x0430b017, 0x709ca6c6, 0x23d95951, 0xb174ad46, 0x111f4192, 0x030762f8},
+       {0xf246c901, 0xb9d70015, 0x57a1cdec, 0xd3616cb1, 0x0d732fdb, 0x61aab25e, 0x12d620d8, 0x0712858b},
+       {0x16334e1a, 0x8ec7e113, 0xa96aeeab, 0x0021a55b, 0xfd639175, 0x8f4c1366, 0x69bc866a, 0x07acdde9},
+       {0x23088fc7, 0x1fb24e5e, 0x92a88089, 0xcacd65df, 0x17343c48, 0x103ec3c8, 0xc387a3b5, 0x03d296b9},
+       {0xcd9fedee, 0xae703c5b, 0x7853b30d, 0xd0c3e0c6, 0x12abaef5, 0xc1e326b3, 0x5d57bb23, 0x04f42d7f},
+       {0x1824b92c, 0x19cd1b4e, 0x81ebc117, 0xc5daaff4, 0xb8183a1d, 0xeeedaa59, 0xe28baf8a, 0x069d8f0c},
+       {0x9dc50729, 0x9733e8df, 0xf1b9f411, 0xd7e0dbb9, 0x50edf7ea, 0x59e4dbd2, 0x4059cb5f, 0x002259fe},
+       {0xb79a92b1, 0x5e3197fc, 0x59086db1, 0xbfddf5c5, 0xdbea4a69, 0x234d8639, 0x4d0a367d, 0x05dd79b0},
+       {0xa86eec0c, 0x8cc1d845, 0x573b44d7, 0x3cac8839, 0x7b0de880, 0x8b8d8735, 0x68c99722, 0x01c5ef12},
+       {0xc2ba0f23, 0x12680395, 0x471f947e, 0xd43bcf85, 0xcc9d9b24, 0x19935b68, 0x108eec6a, 0x06263e1e},
+       {0x5b7be972, 0x29617bad, 0xc55b1b68, 0x0ab73eef, 0x2544381e, 0x07f12359, 0x63a080a0, 0x0161444d},
+       {0x312f9080, 0x07a4b921, 0x2f530413, 0x64c25a07, 0x7d71ca2f, 0x3f6903d7, 0x04838ba1, 0x06917cab},
+       {0x10bdb6cc, 0xec7cfc1f, 0x3bcf85c7, 0x7046910d, 0x7bc3ff5f, 0x7ef09e22, 0x385306d4, 0x004b0b60},
+       {0x3a41158a, 0x82d06d78, 0xaa690d1f, 0x37c4a361, 0x7117c44a, 0x700766e1, 0xab40d7e4, 0x031261d0},
+       {0x91b88258, 0x384c5e8b, 0x009b84dc, 0xd777abd5, 0xe7eed224, 0x02102b55, 0xdbefe5e9, 0x03b22830},
+       {0x8770a4be, 0xec982f60, 0x961f56ad, 0x4b92533d, 0xf428c4b9, 0x7df85fbb, 0x2d9291a4, 0x057e4876},
+       {0xf4910a60, 0x6ace9477, 0x9fc63b7f, 0xdb5a705f, 0x72328369, 0x4cc157b4, 0xc282db6f, 0x05b8acbc},
+       {0x57269216, 0x4c69edd9, 0xbfee24ac, 0xd04f1eeb, 0x2a069b18, 0xacda8418, 0x5990b523, 0x03761a4f},
+       {0xc608d246, 0x7f2e2048, 0x4664959b, 0xd4f52ed2, 0x11c1d565, 0x354e3bf7, 0x457eabd3, 0x0156d837},
+       {0xd455f483, 0xea8cbefd, 0x5d940684, 0x33cd5725, 0x8091a287, 0x2d89a777, 0x939b3ef3, 0x06159e4a},
+       {0x4fa405aa, 0xe43439f1, 0xdbe5763d, 0xa258cfc7, 0x78d7b607, 0x9491173a, 0x9ad23eac, 0x01775d66},
+       {0xd772d637, 0x2413e92c, 0x5eac4588, 0x22c99c9f, 0x71a0cdd2, 0xa2bd1d06, 0xfdd73a36, 0x05e88acb},
+       {0xb2bfa1ad, 0x68886b35, 0x35d2dfb6, 0x7a969b62, 0x9767a44a, 0x359ddb45, 0x52e5da6d, 0x00f1a46e},
+       {0x1c5a4861, 0x4ef9fe94, 0x1c841a89, 0x1540cf67, 0xa9bed4f5, 0x8b51336f, 0xf63c32ab, 0x0240fc41},
+       {0x87086e50, 0x7f5c626d, 0x049c46e2, 0x38ec0386, 0x0c597ea7, 0x30b003fd, 0x6660a912, 0x07a8faa1},
+       {0x7dac5d19, 0x2810d2b4, 0x80339f39, 0x040470c4, 0xc946ab30, 0x30d97769, 0x52667151, 0x019fa1f9},
+       {0x5e7c57a2, 0x00e13c8e, 0x2a0fb7bd, 0x95490ca0, 0x08451e35, 0x6af2b76d, 0xcf78c579, 0x04c3a3a1},
+       {0x55e39071, 0xa848b2f2, 0xf132ce21, 0x6831da1d, 0xe080e2ec, 0x439bdda4, 0xadd19a7d, 0x06680f09},
+       {0x6be27786, 0xfebd2a8b, 0x093a5a7f, 0x2cdd8f78, 0xdcb004b3, 0xbc0746a1, 0xd12450ed, 0x005f950a},
+       {0x39759f39, 0xe1462ca6, 0x7bbe087d, 0x0c37dca2, 0x0c8661cb, 0x198de347, 0x7e531b52, 0x03602655},
+       {0x66d7eb25, 0xaf24ead2, 0x5ee6eb03, 0x27cea560, 0x4f6267c7, 0xe9aa6d50, 0xe5dd28e0, 0x00c962b1},
+       {0xb11706c9, 0x3c3407a5, 0xcf0e1b88, 0x44370686, 0x9fbda5e3, 0x5d0e7af0, 0x41cf0a6b, 0x010d235f},
+       {0x358cfcc2, 0x1fbc42a3, 0xc78f7dac, 0x5a2e6ea2, 0xa12773f2, 0x33e089ca, 0xed7788c1, 0x04bef156},
+       {0xbea42f88, 0xdb150649, 0x5f3fb72a, 0x71329f69, 0x86b82de7, 0x7aa46ad0, 0xc6093912, 0x07913b17},
+       {0xb3b67067, 0xb2b074ae, 0xc55f4455, 0x4f17674d, 0xdeb0740d, 0x9a112816, 0x316cc0d3, 0x06bd0cde},
+       {0x1a264ab3, 0x962ceb6b, 0xd99f7159, 0xd5930255, 0x24a4096e, 0x7db961b0, 0x3e50dfed, 0x050c8e5c},
+       {0x443af109, 0xc3eebe54, 0x86946633, 0x2ca03fcb, 0x04badff6, 0x6e6eef04, 0x82210754, 0x05d92ab7},
+       {0xa5c0dca4, 0xcbadd8ad, 0x5ac103a0, 0x4cf688cf, 0x26e5d435, 0x571dbdb9, 0x220fc7db, 0x074ffc4d},
+       {0x88740c3e, 0x70b80432, 0x03821aa8, 0x4a959d50, 0xe4df06d8, 0x3eb8c3a0, 0xcac57496, 0x025a425b},
+       {0x55205413, 0xdcadfd29, 0x90b17b01, 0xda7456d2, 0x73696a28, 0x437c2fda, 0x329f6855, 0x00a8a188},
+       {0xa828431e, 0x3cde2cdd, 0x9ed29340, 0x60e6c362, 0x7c13e145, 0xef00dfa9, 0xba288c0b, 0x04159bec},
+       {0x9065f8ee, 0x41d351cd, 0xa4845868, 0x4e2e298f, 0xbdb3834a, 0xbcba6ac1, 0xea85f2ec, 0x042c8871},
+       {0x1fda880f, 0xc4dc0d20, 0x26fc2d5c, 0x4f0f9dc4, 0x86839de7, 0x2c555343, 0xf698dd8f, 0x04d12da8},
+       {0x21bd655a, 0x3a6299bd, 0x8cfd772f, 0x2e4aea22, 0xd2c2590d, 0x09716ad9, 0xb298587d, 0x053b143c},
+       {0xa95e3cbf, 0xd35f3e32, 0x04eac3cf, 0xe380dee7, 0x0f7e3e6b, 0x27e6570a, 0xbed46774, 0x008cd288},
+       {0x9583f023, 0xe42676b0, 0x75cfaa7e, 0x39d57dd6, 0x4f0bb727, 0x10d4a8d0, 0x27c81bdd, 0x016b03c9},
+       {0x4decc603, 0x89b394f7, 0xd24690f4, 0xd7322ee9, 0x947a00fd, 0xbbc12961, 0x82e8fa75, 0x00886d23},
+       {0xeb0faad4, 0x7b48a33b, 0x60e0b0c8, 0x4c11ef26, 0x36f0f791, 0x4163a401, 0xa4074faf, 0x07986fea},
+       {0x31d9587e, 0x96044919, 0x9049fd2d, 0xb1cab341, 0x9c0eea09, 0xf28c83c9, 0x5c6620aa, 0x033b74dd},
+       {0x13ee028c, 0xde558d16, 0x5d4233b0, 0x4dcf3932, 0x2e422803, 0x7bd46887, 0xe1261bff, 0x04b4757d},
+       {0xd48e9b00, 0x6c80848f, 0x10b6a121, 0x937c1e6e, 0xe9f2008c, 0x7782f8b8, 0x2bc7171c, 0x00217358},
+       {0x324228d8, 0xba523265, 0x682ee17c, 0x4ebe5506, 0x3be009f9, 0x6c646fe8, 0x8594b924, 0x046de7bc},
+       {0x3b50645a, 0x270aa33a, 0x2a9c6282, 0x28fd23fd, 0xcfe96515, 0x5b2fa771, 0x3f812377, 0x063039de},
+       {0xaba4060a, 0xa1da52b0, 0x0374be67, 0x7f191fd6, 0x0d7d2126, 0x14c64d05, 0xf7f77381, 0x00419cb7},
+       {0xe4b19319, 0x07eda692, 0x0fef654e, 0x6190d3f6, 0x0b21ca7e, 0x893b0916, 0x073c48b4, 0x0367a3c7},
+       {0xc520e3ea, 0x8fd405b2, 0x487e93c9, 0x73b4f714, 0xd5142cff, 0x70b7ee88, 0xa320eca2, 0x058fb800},
+       {0x72ef3623, 0x3b5a8740, 0xaff370fd, 0xbff4af42, 0xe338258e, 0x64c137b0, 0xc7afafca, 0x05ac9917},
+       {0x82ccc89a, 0x99c46a0d, 0x9ff87868, 0x05ae3209, 0xa489481f, 0x6249b2a4, 0xbaead348, 0x0056c235},
+       {0xba0ea95e, 0x5a0640f3, 0xc03af976, 0x518db5cd, 0x5a250a06, 0x1c3223aa, 0xbc3442eb, 0x0397b942},
+       {0xacf14a4f, 0x164f0705, 0x33eb6c0e, 0x386c2325, 0xd7264573, 0xdfaceff6, 0xd1e22f80, 0x00e94509},
+       {0x9ff51bc7, 0x8964ee48, 0x57bbca04, 0x3e0f5037, 0x6510630c, 0xe78d6c8d, 0xdf0a61c1, 0x041d6351},
+       {0x45aa1b58, 0x47892f3b, 0x915c1c70, 0x5a1787ba, 0x67f20d25, 0xbaa23359, 0x0c4bc4be, 0x00e1919f},
+       {0xb9975332, 0x2a87c37a, 0xcdecebc9, 0x95db523f, 0x1d0db226, 0x703949ee, 0x4c3842dd, 0x03152c1d},
+       {0xecfb6f72, 0x0eff7e6a, 0x9493a628, 0xb3a83455, 0xd596cd51, 0xced58dd1, 0x25ee51ff, 0x033dee78},
+       {0x72a30547, 0x1f4047ca, 0xd40b6d0f, 0x9feefa06, 0x94db1b38, 0x836ffd80, 0xa0992ed5, 0x037c79f6},
+       {0xceb3dffd, 0x7ffa095d, 0x768e2cb3, 0x23097a65, 0x373f6222, 0xd228b1f9, 0xc57feea2, 0x06309a6b},
+       {0xecd4c6f7, 0x7a5bead4, 0x7e70f7de, 0xab92043c, 0x220db8d8, 0xf78f890e, 0x2865a07e, 0x052eeb98},
+       {0xdf253531, 0x8e9a6336, 0xbafa937b, 0xb24b664a, 0x303b1f5a, 0xc89f660e, 0x876bd8c7, 0x07ea9749},
+       {0x1d4c3fec, 0xd958e726, 0x06fbef31, 0xa5eb368f, 0xba6a027d, 0x0c911679, 0x5f80f992, 0x06321b51},
+       {0x046b49b2, 0x3ca61d9e, 0x6aa9c29a, 0x616a47d6, 0x9e9462dc, 0x27a7ffeb, 0x8971b70e, 0x0794ed38},
+       {0x9f47496f, 0xdb259a57, 0xa6b0481c, 0x7f3e3f90, 0x4afab47a, 0x76f42726, 0xc5a79505, 0x07b9da96},
+       {0x57e7aeed, 0x908e6450, 0x81648127, 0xe86db2fb, 0x8dd76882, 0x53f3c573, 0x72327da6, 0x02b37324},
+       {0x73a220ec, 0x82a941c9, 0x7f25beea, 0xb4cbecb7, 0xbfb061d6, 0x746ded71, 0x641b3f3d, 0x00f7af27},
+       {0xcbd4ba67, 0x69b8f4df, 0x3d526981, 0x5ee3ac6f, 0x145cef8c, 0x9372af4e, 0x72a31ef1, 0x05cc1cc6},
+       {0x62d1ba57, 0xce898b0d, 0xee3fa47e, 0x86ba0504, 0x4395b70d, 0xc68233b1, 0x80eb8d60, 0x024cfa58},
+       {0x74d51c41, 0x8fa83850, 0x60f8f9da, 0x5824a285, 0xaf1bea48, 0xa7a2067e, 0x5455acc3, 0x04ba49f2},
+       {0x324c6039, 0x0a1e223e, 0x7b18a9d0, 0x28312228, 0x88b6ecda, 0xb60c1f93, 0x687ba365, 0x053097d8},
+       {0xa7dae551, 0x5604b398, 0xe2e11609, 0x51f02e33, 0xe58e2094, 0x0b51a085, 0x3a3ecc28, 0x078679d6},
+       {0x92d52444, 0xe24b5528, 0x33d0fa70, 0xf77e35ad, 0x9bcbfb57, 0x8af5a7b7, 0x022748d2, 0x015c5f15},
+       {0xc993b168, 0xc002185c, 0x293ad856, 0x5586addb, 0x8ec50726, 0x69c1bfcf, 0x5fd97ea1, 0x00d514fc},
+       {0x8866c747, 0x52d7a9a2, 0x01d6ee05, 0x9bd77465, 0xc3a87a88, 0x576adf96, 0xfa69f0ec, 0x0693e89a},
+       {0x05903be3, 0xcfe50d90, 0xcf739179, 0xbe651dd1, 0x2ae70678, 0xba80ffda, 0xb55b06cc, 0x051dbe40},
+       {0x5585a6f0, 0x4adb5947, 0x9fa37e68, 0x14634b99, 0xa2a910a8, 0x27da5fbf, 0xa99c704d, 0x022a91ce},
+       {0xe2ddaacd, 0xfabab7b8, 0x60cf9603, 0x1edf6a83, 0xbfadddd3, 0x20b04218, 0xa81dbffa, 0x03e0ddb6},
+       {0xda25c9fd, 0xf9c1e3a3, 0xac57ece3, 0x41ff4e1e, 0xdd684055, 0x9ba50868, 0x46d8156a, 0x01b30314},
+       {0xab76a462, 0x30e067cc, 0x08f1b99b, 0x2d84c4c2, 0x73edc56f, 0x6b399ae0, 0x62cfacb2, 0x02f187e1},
+       {0x34fc5356, 0xb085758e, 0xf805fedf, 0xbafe9a1c, 0x95272d01, 0x0bcf423c, 0x1feca651, 0x01df4a81},
+       {0x4c264e97, 0xd3bd9833, 0xc08b1798, 0xc0b192be, 0xdc3ed49e, 0x42724e80, 0xbaee9a58, 0x04100303},
+       {0xe49749c9, 0xb653c919, 0x09f8e2fc, 0x07dbe557, 0xca71e551, 0xbb172d28, 0x7989c8fd, 0x07f5f801},
+       {0xdf1d9004, 0x9412a9f3, 0xbe90d67e, 0xddcf6d66, 0x4692f803, 0x1dbfd679, 0x524c2944, 0x04f4fae1},
+       {0x5707d134, 0xd413afdf, 0x887fd7e9, 0xf8a339cf, 0x84883580, 0xf74544f4, 0x851739e0, 0x0554f72a},
+       {0x59824907, 0xe3827564, 0x421182c9, 0x352eab2a, 0x8f8530f2, 0x19138257, 0x20275950, 0x04e3bf44},
+       {0x33f928b7, 0xef7660f9, 0xf5952362, 0xb7cb0619, 0xf17eb8d7, 0x5b24913b, 0x8e8b8082, 0x00f4804c},
+       {0x5bd84f3e, 0xe7020613, 0x736a1659, 0x7ee777e1, 0x0795844b, 0x34ca7cb6, 0x7503ddc3, 0x07ce12e4},
+       {0x6d8408a5, 0xbbbafb3f, 0x519dadca, 0xe0f02915, 0x0670f5d4, 0x5acba199, 0x4a93340f, 0x0056db45},
+       {0xe404f6c5, 0x73f8a435, 0x01731858, 0x68cd3f7a, 0xd01f3de9, 0x214d3134, 0xd5d75a88, 0x05fb76be},
+       {0xf976eb41, 0x3a66ad86, 0xcd08787a, 0x6401b6d3, 0x7d1e82a8, 0x575950f3, 0x55ee9d49, 0x00e34b33},
+       {0x0cc5cbf4, 0xbff2f4e6, 0xec205dcd, 0x5a6b430d, 0xc94862af, 0xa8114ab3, 0x2fe8be1f, 0x0247ecf5},
+       {0x8b98bf40, 0xded3bc57, 0xe26b66b3, 0xb658c8c4, 0x8d4220db, 0x8bd91c55, 0x94d2adea, 0x00d109f2},
+       {0xedeaec42, 0x0fbfd336, 0x5d407ae8, 0xd94f928d, 0x727e74b5, 0xe5e4a16b, 0xc8c22dd8, 0x06a550df},
+       {0x135e0ee9, 0xe378a012, 0x856a1aef, 0x5be86512, 0xd8febe77, 0x7de04ce2, 0xea43d59b, 0x03ddeed6},
+       {0x005a1d86, 0xc04dc48c, 0x6f29053d, 0x64f4bbd2, 0x9be0aef5, 0x10b1b3db, 0xcc625a0b, 0x03745ca5},
+       {0x1f4f0e85, 0x6c72bd40, 0xc2069cba, 0x4234afd0, 0xb99395f4, 0xc25b262f, 0xae0874e2, 0x0605f6a2},
+       {0xdd756b6d, 0x9513e0d4, 0xf0c137cd, 0x5127a167, 0x7f01c538, 0x1a12a425, 0x00a4483b, 0x068b3aaf},
+       {0x79bc6c86, 0x7a5b3e70, 0x375dc240, 0x5a337909, 0xe111d6ce, 0x46d6fe3c, 0x2ff2ca50, 0x02708b05},
+       {0x1524ad8c, 0x1181eb95, 0x52294490, 0xd0744ddc, 0x848605cf, 0x88ed5b7b, 0xb478c12a, 0x04b9cb49},
+       {0x27105dae, 0x98cb2411, 0xed5c1361, 0x3efa8fae, 0xd498e337, 0x6fa736a5, 0x1e369b4f, 0x038e3b07},
+       {0x98c8db7f, 0xbc5915ae, 0x50425ae8, 0x1f3c8f96, 0xfa86658a, 0x77d60416, 0x28ec2dda, 0x02bc8b30},
+       {0xb94bc10e, 0xad6794f2, 0x7e80093a, 0x7463b3f3, 0x90db4c79, 0x7bf5af53, 0x965c0cc4, 0x031531c6},
+       {0x7cc1083d, 0x66425289, 0xa45d785f, 0x778ba471, 0xbbc94c16, 0xe3f5c599, 0x9b92e036, 0x02606413},
+       {0xcf287faf, 0x191a2ea9, 0x823ddf07, 0xe6406a78, 0xaabe912b, 0xabcf2825, 0x7c48649a, 0x021dab44},
+       {0x65375f6c, 0x9465d77c, 0x65370520, 0x924e189c, 0x918f0105, 0x8be0ca5f, 0xb1925509, 0x07586d27},
+       {0x9302ac44, 0xe4fa93cb, 0xbf87d840, 0xf381ebbd, 0x44793049, 0x5027e7d9, 0xd3f09392, 0x0230b5c3},
+       {0x31d48a82, 0x123e992e, 0x729d40e2, 0xef2990c6, 0x0f331903, 0x946813e3, 0x112a2c4d, 0x022f575e},
+       {0xd4ee8cf7, 0x4b44764e, 0xdb576ebc, 0x4d44cff8, 0x0ab93ba1, 0xc6185d3a, 0x7e3f1e78, 0x0520c2d3},
+       {0xbc46b8b4, 0xd9446736, 0x91e2ede1, 0xc7776293, 0x87689930, 0x0323845f, 0x379293ae, 0x061e359f},
+       {0xb49b3a0a, 0x767a1747, 0x2b58f45e, 0x17e69346, 0x1425ad98, 0x10820519, 0x1b487ae5, 0x0367f384},
+       {0x92f8ac25, 0xe0407696, 0x2beb71a6, 0x9ca9d269, 0x2f0c2471, 0x914017ea, 0xf421a10d, 0x07709cc3},
+       {0xc3bb6a8f, 0x2c8ed622, 0xa2a1a8f2, 0x31c57cb6, 0x4bf6c316, 0x053924d5, 0x09563089, 0x0727b76a},
+       {0x09dc6b5c, 0x567be37f, 0x9476eb5d, 0x57e36f45, 0xee5be5b6, 0xf68488dd, 0x2884c2d7, 0x05ac1ff1},
+       {0x04173760, 0x0fc5b934, 0xda828f00, 0xe43272df, 0x2fad6e9c, 0x7e2ab5fe, 0x0a4995b3, 0x00e0a5eb},
+       {0x42f8ef94, 0x6070024f, 0xe11a6161, 0xad187148, 0x9c8b0fa5, 0x3f046451, 0x87529cfa, 0x005282db}}};
+
+    static constexpr storage_array<omegas_count, limbs_count> omega_inv = {
+      {{0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000011, 0x08000000},
+       {0x0becc81e, 0xd59e99d9, 0x537cdf25, 0x3ad971a9, 0xbcd60738, 0xaccedf99, 0xd65d66b5, 0x01dafdc6},
+       {0x4bc9ca34, 0xc8e6df6f, 0x5397aaca, 0xab8bfbc5, 0x94813e6e, 0xb5ea6773, 0xe295dda2, 0x0446ed3c},
+       {0x8145aa75, 0xd7981c5b, 0x3d174c52, 0xb14011ea, 0xe4721c1e, 0x647c9ba3, 0x6f6ac6dd, 0x05c3ed0c},
+       {0x6e0bef41, 0x9de8c5cf, 0xcee1b9b0, 0xec349cbb, 0x2121589c, 0xfe72ab05, 0x24c7669c, 0x03b1c96a},
+       {0x246766d8, 0xb878549e, 0xb5a03ab4, 0x8c5d8531, 0x7f1ec75e, 0x334a83ab, 0x46b146d7, 0x01342b29},
+       {0x31055652, 0x8c71bd50, 0x6081f8c3, 0x2eedac49, 0xab013740, 0x25164a76, 0xbca84bf7, 0x05c0a717},
+       {0xd0a6b4f5, 0x1ad37af3, 0x8ca50294, 0x6dc49fe3, 0x5d9529c3, 0x8357a7ff, 0xcefe8efe, 0x02c161bc},
+       {0x296fbf1c, 0x90a5fa7f, 0xc977b113, 0x18226a39, 0xc178262e, 0x9362d5c9, 0x40d28de5, 0x03a362d3},
+       {0x125ca33a, 0x04eeb1c0, 0x8437c604, 0xaa47a4c0, 0xa4d6bafe, 0x064426a2, 0xb8cc76db, 0x00ffbb44},
+       {0x179e2ebe, 0xecf0daf8, 0x2574403b, 0x942e643e, 0x6bf06f7c, 0x684d31aa, 0x244c675c, 0x003b2bde},
+       {0xfeccfccc, 0x96bc19dc, 0x269130b4, 0xbb26f74e, 0xd511649f, 0x15d57a9f, 0x7dcde3c3, 0x02d852a4},
+       {0x44ad0610, 0xb4a47f4c, 0x06fa1b55, 0xdc2f028f, 0xd25979ac, 0xd73ddcd4, 0x076e7f5d, 0x06ba7cbe},
+       {0x349eea63, 0xb0f43dd2, 0x3e64660d, 0x5e64466c, 0xc3bb94ce, 0x7206f426, 0xed4327aa, 0x036cb7c6},
+       {0xf248b36c, 0x6503e80b, 0xe36060ec, 0xb93dd56f, 0x95c2c067, 0x6d3b2763, 0x155023a7, 0x038e7d59},
+       {0xcdf92351, 0x140437ad, 0x2a5ab630, 0xb7a6e1b4, 0xd48175a5, 0xaa80b742, 0xd4afae89, 0x06a50046},
+       {0xaea51997, 0xe8cde2cd, 0x417e3754, 0x612806f6, 0xb940adf4, 0xe40a4a07, 0xa33929b2, 0x063f5efa},
+       {0x0c07573f, 0x0c0926df, 0xd8d4bee3, 0xa84e9027, 0x6bcd79ea, 0xf3776dfa, 0x523f55a8, 0x043a8517},
+       {0x66984d05, 0x5b7e4e45, 0xdb8c30c4, 0xb9381de7, 0xae86e4f6, 0xd7c15128, 0x809daae7, 0x0718f1ad},
+       {0xc1eae1a6, 0xe4fb0a7d, 0xa90a0813, 0xe5484134, 0x895df525, 0x24cca8f9, 0x1cedd2ee, 0x035fd390},
+       {0x82e87775, 0x0a87a942, 0x971f450b, 0x9f2b4b62, 0x8eae6f09, 0x1dc5aecd, 0x1c5686a6, 0x07547fa3},
+       {0x2e35511a, 0x785975cc, 0xa085c456, 0x4266bc82, 0x3abd5bfd, 0x45cf52e1, 0x7bd95ece, 0x019e8e43},
+       {0xae580194, 0xfad72a75, 0x2989ac16, 0xf2bb5a00, 0x55f2b4d0, 0x53fee728, 0x9c7a91e5, 0x02b9f95d},
+       {0x71200963, 0xb0062d2c, 0x1ac57a23, 0xe16e9f91, 0xc4bd9d3e, 0xaae7b169, 0x7f505f35, 0x07462151},
+       {0x57e31913, 0xcf7bd10e, 0x6a4d0ee4, 0x1a360a91, 0x31869e35, 0xb2ba4914, 0x18005db4, 0x07a62d5c},
+       {0xb4344711, 0x431f11e2, 0x6192c47e, 0x0cc3049c, 0xeb9c1bc3, 0x375dff93, 0x42071ee8, 0x03a75790},
+       {0x9ed81498, 0x4eb14251, 0x98b804ef, 0x5852dbc5, 0x56d7f20c, 0xe0c1be13, 0x20d69181, 0x023e7f68},
+       {0xe34f2d55, 0xf2eeb9b5, 0x2aad6f84, 0x63459f16, 0xbe37dbea, 0xf12099e7, 0x11b1a0fd, 0x06e45493},
+       {0x0d6c93ed, 0x63032f6a, 0x5a04829f, 0xd99cbcc8, 0x89608b5e, 0x80f20416, 0x9df329f4, 0x00bf4231},
+       {0x2710f927, 0xc7fc3d1b, 0x90d8503e, 0xc72d19af, 0x9940e689, 0xa9dcd3b8, 0x2da77ac9, 0x06fd386e},
+       {0x08b27bc2, 0xc800035f, 0x4dfacc03, 0xd98987cf, 0x1256e525, 0x24f8fdbf, 0x1f104273, 0x04c575f1},
+       {0x256c604a, 0x68b16e90, 0x6eba097d, 0x7f51023a, 0x1aeba9c8, 0x52c7629c, 0x4809d8da, 0x0575e850},
+       {0x4ac81249, 0x7439d2f9, 0x4fc31ff2, 0x351e4a62, 0xb3906ded, 0x68fb8313, 0x08507a35, 0x007d43d8},
+       {0x98859a12, 0xa87902b8, 0x73af55b3, 0x2f0d13e0, 0x1b9783c2, 0x5a46c66a, 0x2f5f71d4, 0x01045b06},
+       {0x604fce1e, 0x0c379595, 0x7fccc2b4, 0x20ab6eb8, 0xf1820ae7, 0xac0bc709, 0x93fb2b07, 0x07e7654f},
+       {0x246c4bf0, 0xa0e40811, 0x816b15e0, 0xe12accf5, 0x17938138, 0xee417239, 0x2c9a34fb, 0x004e092e},
+       {0xad2cd984, 0x6304351b, 0x4bf1aafc, 0x38546ca6, 0xf310e99f, 0x1fb81192, 0xb5376275, 0x07e89896},
+       {0x7b2d141d, 0xe4376a0b, 0x6dac220c, 0xea1795e5, 0xb19e1901, 0xd778ab50, 0xa94c274f, 0x077df905},
+       {0x16fcd6c7, 0x7039bab1, 0xa6ea1c94, 0x8eececb7, 0x0f122046, 0x84d26ab5, 0x22fd55a1, 0x053c5d48},
+       {0x72f11f65, 0xd43eb7bb, 0xb2a566d6, 0xfb538785, 0x3f35cbf5, 0xccc2cdc6, 0x7112504a, 0x06df5a9e},
+       {0x60ce9c30, 0x75efb55c, 0x3c541437, 0x991873ed, 0xdf0cbb3b, 0x37eaedcb, 0xb04c2858, 0x0278d7f0},
+       {0x1a06866b, 0x5757dd4e, 0x6570fa7f, 0x15c176b1, 0xafe89a1d, 0x9981b57f, 0xee0cb14c, 0x03c57f4d},
+       {0x503c31cd, 0x3438cd66, 0xc0736d4b, 0x34437e52, 0x2a9d1b28, 0xe825b769, 0x73c06ee7, 0x06955a3a},
+       {0x5c5e530e, 0xbbf0995a, 0x6569a2f9, 0xdee304b3, 0x5bd1a886, 0x3b9c993c, 0xc9cd050a, 0x00f66017},
+       {0xee755737, 0x3666e752, 0x74d0e317, 0xa13bfafc, 0x01d2f1bf, 0x17ab672a, 0x0778f525, 0x079dde3a},
+       {0xed8a25e9, 0x96a003c2, 0x8f347cec, 0x45d258fe, 0x96ea14ac, 0x68ff148d, 0xe148eda9, 0x058f4ec7},
+       {0xe2a700ab, 0x23baf732, 0x5202a945, 0x6434725a, 0x2e693363, 0xa19a338d, 0xbf2f39c6, 0x01d0ea7a},
+       {0x3ab52589, 0x5e571cad, 0x92240361, 0xe2916bb2, 0xdff5e354, 0xe6f8897b, 0x2ffa4707, 0x02a62880},
+       {0xef649a85, 0xaf446c62, 0xed4e461f, 0x14d8072f, 0x59993efa, 0x5a07f4e5, 0x72a3a652, 0x00dc28b6},
+       {0xf21511df, 0x139299d7, 0x4854ebc3, 0x8914e707, 0xbfd102a9, 0x9f3b5913, 0x3a5af894, 0x009dc24f},
+       {0x1f4ba4fa, 0x650e1d91, 0x1977bff0, 0x6ba67806, 0xaa9bbc1b, 0xffbdc531, 0x997408aa, 0x057b69b2},
+       {0x65fb1a91, 0x25c03e81, 0x7fd22618, 0x8682f98b, 0xf46cb453, 0xcad67f13, 0x5a80e5c6, 0x060ca599},
+       {0x94188f2a, 0xa7978a90, 0xdbb9338e, 0xd5fc8f0b, 0xcbdd84f0, 0xf8387e6d, 0xbbc743a3, 0x073ae131},
+       {0x0415bbcc, 0xafd00c46, 0x0df4a52a, 0x1a00eb6c, 0x0b96b594, 0x1ec67c64, 0x8e26b699, 0x01cb82a5},
+       {0x7f740f93, 0xf56319fb, 0x2e2f6ed7, 0xb40d559b, 0x75e19784, 0x63f96f04, 0xc31ba061, 0x06406929},
+       {0xfa5a3239, 0x22349e8b, 0xb9ca6bf9, 0xe1236395, 0x9b0017a4, 0x76ae5a8b, 0x17b7af03, 0x06cfb4ce},
+       {0xb51abfe6, 0x34938785, 0x1249edb6, 0x21f54c80, 0xab038972, 0x3bd1cc16, 0xa4a57a81, 0x0636b37f},
+       {0xf88717cf, 0xfda4a9a1, 0xee19d402, 0xf8fcba35, 0x47c9ba1b, 0x1ac940f6, 0xdd991440, 0x013c0ab3},
+       {0x3743adf4, 0x5082318a, 0x22440f94, 0x3293bae1, 0x8dd2d761, 0x4c2e6d7f, 0xcdc38c82, 0x07124118},
+       {0x76198779, 0xb031f8b7, 0x1b6c1944, 0x6742f602, 0x894a6134, 0xa18290db, 0xaba037dc, 0x035289d8},
+       {0x9f8a9b07, 0x4579e855, 0x4dca3764, 0x1e580662, 0xb8c8ef49, 0xda92152e, 0x8b54508a, 0x0444085a},
+       {0x34696648, 0x7f670ce1, 0xc05768d9, 0x2f00108f, 0x390fb519, 0x2d00a444, 0x1cd6f914, 0x015c468b},
+       {0xfe46c5f2, 0x00666cbf, 0x9f7174d6, 0xca4051c5, 0x8e4277f4, 0x1629882a, 0x6ee002a3, 0x00b3f261},
+       {0xc1dbb4f6, 0x418a2b86, 0x9a6ca270, 0x9f453ccc, 0x1d457b20, 0x1966471f, 0x80fd1319, 0x00b4d831},
+       {0x1c76c8b1, 0xa12f86a8, 0xc0125e48, 0x2772e424, 0x1459dfb8, 0x8d650644, 0xad06d01c, 0x02128e5c},
+       {0x3472799c, 0xcc8cc7f6, 0x2f511cae, 0xfbd97f95, 0x5ebbff71, 0xadd8818b, 0x09af0983, 0x00520540},
+       {0x8ec654cc, 0xcaab5dd4, 0x17ba15a9, 0xc05ad0a7, 0x36300a00, 0x4bda7469, 0x41bb0610, 0x02e486cd},
+       {0x2d6be8b5, 0x077ba983, 0xfe89eb7d, 0xdd5e728f, 0x63f9c51f, 0xe3c872fb, 0xce639995, 0x01f2f7a8},
+       {0xaa2ea7eb, 0xd82b1599, 0xa16489e0, 0x1be5d254, 0x173d3219, 0x19cb236a, 0x1fe63b23, 0x007dd45f},
+       {0x19dba628, 0xa27cc4d3, 0x5fd2e061, 0xf04ac441, 0x9307a758, 0xc7405333, 0x28c40fe4, 0x0103c707},
+       {0x54662aab, 0xb5129fd1, 0x59158f32, 0x2ec5b69b, 0x12c44eec, 0x6c7e6492, 0xe527abb2, 0x046e7c11},
+       {0xe32d46fe, 0xb9bf4936, 0xb08ef006, 0xf23ae18c, 0xe6a5179e, 0x5352cc59, 0x5bf7c0b8, 0x0753a621},
+       {0x9318db3a, 0x19f65bc2, 0x7e3d0014, 0x93ff3f79, 0x6beb580d, 0xf7f93c7f, 0xddd72603, 0x04fdb898},
+       {0xe184a935, 0xf7e1f88f, 0x1ad510f0, 0x82a0f047, 0x4c9ab6ca, 0xce0f7c44, 0x5104a95a, 0x0552304e},
+       {0x985bba5c, 0x06615580, 0xf487a1fb, 0x8ccd29a8, 0xeecf758d, 0xb3e15ed0, 0x857ce648, 0x05328783},
+       {0x6cb042b0, 0x5d1d5a22, 0x0277083c, 0x64375cf4, 0x5fa82215, 0xe8947dab, 0x86932495, 0x05e72829},
+       {0x8c3e2849, 0x5bf6f46a, 0x4924c8f4, 0x7e40314c, 0xdffd6118, 0x3c74a4ba, 0x2f8de20a, 0x05247cdd},
+       {0xd0042d11, 0x25a418c5, 0x2f7da60c, 0x1b60ee9f, 0x02c0b69f, 0x61c041ad, 0x15670214, 0x0632d33a},
+       {0x90e05a92, 0x32b03a5e, 0x78d1e8d6, 0xfb12a1b1, 0x5bc2f5d5, 0xb8af534e, 0xa032918a, 0x05ab4772},
+       {0x0a711a9d, 0x096878a8, 0x6b083c8c, 0x87d070da, 0x87d06afb, 0x77931578, 0xf3104057, 0x03705277},
+       {0xdf993e46, 0x502d2374, 0x35baf646, 0xc1cd2868, 0xe30aa213, 0xa61b54b6, 0xbce34b74, 0x02511017},
+       {0x90a6b9b9, 0xcfb6c51a, 0x8be6ade8, 0x4e0b29ef, 0xd3832d74, 0xa8292467, 0x41ca1e45, 0x02ce7977},
+       {0x3e672d5b, 0x25ee10aa, 0x28597504, 0xb0e60c63, 0xe263c827, 0x4a8d0567, 0xfadefeba, 0x01f4ec42},
+       {0xa5a26158, 0x8b4b15e0, 0x88a71cf2, 0xa59b2df9, 0x5d734341, 0xde44f2e7, 0x4db8d2e8, 0x007a18a0},
+       {0xb4d18100, 0x30fcf001, 0xf8ae0b4f, 0xcdaa5334, 0xe325615a, 0x67017b2b, 0xf0ccbf57, 0x016c6d47},
+       {0xba937732, 0x66afc115, 0xc20be386, 0x917d4890, 0xa017c59d, 0x5dadccff, 0x986c39c1, 0x043fa44e},
+       {0x08baa72a, 0xc57ec886, 0x052364ed, 0xe65a4680, 0x85f9a523, 0x0536b505, 0xfe744ee2, 0x03580609},
+       {0x1bab1ab8, 0x88109415, 0x62f0fa74, 0x02244b19, 0x915618e0, 0x837fcd10, 0x942f12d2, 0x061b83d0},
+       {0x687b7798, 0x823d0bba, 0x84a49784, 0x5f93174a, 0x2574af37, 0xcfd64159, 0xe108057c, 0x0290722e},
+       {0x58a66036, 0x900a7031, 0x6153c2ae, 0xcb443378, 0xa6ccdffe, 0x4c48b8dd, 0xa06e955a, 0x049a9211},
+       {0xea0b9dd9, 0x1b034532, 0x638c79ec, 0x11cba08f, 0x7c5b2d15, 0x16d00728, 0xbb9a759c, 0x05abcbcd},
+       {0x1552d6af, 0x21b4f60e, 0xbed54865, 0x2f7ea9d2, 0x738befdb, 0x39378802, 0x97845360, 0x02adf76c},
+       {0x4026bb92, 0x6e5eb2ca, 0xcbed5570, 0x18f3d8bf, 0xb655ac26, 0x2a5fc8cd, 0x3809a1c5, 0x0031cd25},
+       {0x0ef5e011, 0x2d698950, 0xc018b82d, 0xc0668c45, 0xf520d325, 0xd180ff47, 0xa38122b1, 0x046714c7},
+       {0x12df2cc7, 0x8dec8a4b, 0x963031f8, 0x5eb84a1b, 0x88525708, 0xb75ad701, 0x07df57bd, 0x02054a99},
+       {0x82b2f616, 0xe0013d43, 0x7b385914, 0x2ad34c97, 0x11108f4b, 0xc9969223, 0x9c9fad59, 0x0183f639},
+       {0x06b4dc38, 0xaca9dfbc, 0x962d5774, 0x85596bbc, 0x22f1cd7d, 0xd7023923, 0x2067b180, 0x04d3c939},
+       {0xe4004173, 0x6d13e6ab, 0xaafe8726, 0x3495d095, 0x33dc3303, 0xa22d3e4a, 0x776d2e14, 0x0276dbb2},
+       {0x68c539b6, 0xa03f83cb, 0x7b42a06e, 0xfd3fa839, 0xe8d45ac3, 0xea0f1f15, 0xa414b012, 0x061adb94},
+       {0xb33fb188, 0xd22fc6e3, 0xf723dc18, 0xbebc7978, 0xf6c99f34, 0xa874b584, 0xf67ff454, 0x049beb53},
+       {0x754bed16, 0x7c247948, 0xe50eac10, 0x4a84bcfb, 0xade97580, 0xc00d65df, 0xca79c5ae, 0x0763d73c},
+       {0x7aadbe1a, 0x696e27af, 0x9d8e2a1f, 0x113535e0, 0x4c011766, 0x6953003f, 0xbb52558c, 0x0498a75f},
+       {0x6e09cee7, 0xcf26e897, 0x299b63c7, 0x813a76f2, 0x0939904c, 0x67c02fa7, 0x7e0b9483, 0x045c41a9},
+       {0x4af5adcc, 0xad979914, 0xc2c7c068, 0x7d9267f9, 0x21b4a0a7, 0xda4fa3f8, 0x3386c423, 0x03f4bcc9},
+       {0xd1228595, 0xe5fcd634, 0x12fc8b7c, 0x5571b994, 0x244857f8, 0xd50dcd33, 0x263b93f0, 0x060dc1d6},
+       {0xfee59c89, 0x7040a236, 0x78ceb168, 0x91a4301b, 0x19cdb36a, 0x973b55bd, 0x71008400, 0x06a1c58e},
+       {0x6af1f351, 0x1d3c7ad7, 0xe8ad24dc, 0x8493c0c1, 0x48d5ffd9, 0x076f9dea, 0x5931555f, 0x00b9b2bf},
+       {0xeaa5731c, 0xa3d54d89, 0xba84ee02, 0xfcc41a45, 0xcc1cdac8, 0x7c828f73, 0x5bfe9d23, 0x009c426b},
+       {0x3f1f352c, 0x36fb314c, 0x9feb1120, 0x750a2a5f, 0xd7b06171, 0x3a2f19e8, 0x3b550cd9, 0x06de1885},
+       {0xb69183f6, 0xefc03237, 0x979ee075, 0xb5a14fc3, 0x2dcb1d51, 0xbf114125, 0xb8eca2d3, 0x062364f7},
+       {0x95375861, 0x575f1ea7, 0x80cc8dba, 0x30608586, 0xcf7a8f9f, 0x2beca9f5, 0x5fe60da4, 0x00dfc078},
+       {0x0f86ded5, 0x312928eb, 0xb9c4f0cc, 0x646f5d3e, 0x2fbf14dd, 0x23c69382, 0xc44caa0e, 0x023aae90},
+       {0x13e16243, 0xa7c92faf, 0x92efd5fc, 0x035a3e75, 0x86a744ea, 0x32f44d08, 0x1ea28333, 0x05b45217},
+       {0xc41fdf22, 0xb557d203, 0x4bbc8f76, 0x9697570c, 0x81eaf742, 0x3a6a2cb5, 0xb0d03a0f, 0x07f2c08a},
+       {0x2a18b73a, 0xca806385, 0xdb6a953d, 0xf2015d6d, 0xba5f67b9, 0x51d21a8e, 0x14807dd6, 0x051439d5},
+       {0xf75051de, 0x7b6e0c13, 0x14dd1aa0, 0x114681fb, 0x0fd95a37, 0x72a1cccc, 0xa39e5bb8, 0x02f29d4c},
+       {0x116529cd, 0x4808a0de, 0x5b941d1c, 0x1cf38580, 0xd70796f7, 0xc96a451e, 0x3f24e64f, 0x016d083f},
+       {0x3cf155ee, 0xc71b78d0, 0x0c361b67, 0x0c04a134, 0x7756e4a9, 0xdb546edc, 0x2988eb2c, 0x03474404},
+       {0xf30cef17, 0x1a0b3585, 0x864abd80, 0x63c1de29, 0xc0687c8e, 0x0c171d6e, 0xc9763a97, 0x0353aec8},
+       {0x94192fb8, 0x0a2c9cff, 0x1a7f5bbf, 0x27320b93, 0xe5ceeb75, 0x465d2f9f, 0xd78f1cc3, 0x07ce6f99},
+       {0xe8d1b26d, 0x0f899233, 0xb87a2984, 0xed4b44d2, 0x0bd6354a, 0x0c0712c6, 0xc7032f5c, 0x01eb2a31},
+       {0x46b03b57, 0xc4c03fbd, 0x785ebbe8, 0x989b0ff3, 0x7f0bcb19, 0x5cada62a, 0xa97557c9, 0x01426410},
+       {0x96fb0a26, 0xf1d2e82b, 0x1edb9ce3, 0xe270bc10, 0xfc7aaed8, 0x9549cfd0, 0xd90d7c9c, 0x03e8256c},
+       {0x43ac9984, 0x14eef0ee, 0xa16d6770, 0x2903ff22, 0xa38fbfc0, 0xc66c2690, 0x8755440e, 0x0032a202},
+       {0xf3601782, 0x46a07cf2, 0xaa71d137, 0x79f410f9, 0x8bcabc59, 0xc320c6f1, 0xf8ab64d8, 0x00a706cf},
+       {0x8dbd8d4f, 0x8848a9f0, 0x0085061d, 0xeff89e69, 0xfee62fbe, 0x90e634a7, 0x2ffb456b, 0x03983046},
+       {0xb272ed5c, 0x91ec28a8, 0xdc0cbb77, 0xf8529918, 0x3648d2c5, 0x8f896ddb, 0x74edaf19, 0x0668a86c},
+       {0x128c9bd9, 0x341d5fc8, 0x6b3241c5, 0x592f87d8, 0xb2cc3c97, 0xf8cba6f2, 0x03f396ed, 0x03463bf1},
+       {0xafd9d239, 0xcf3ae525, 0xea20b753, 0x06b8b7b9, 0x3408a993, 0xb2be1e49, 0x9f47063f, 0x02bcb200},
+       {0xa0bd0bc8, 0x7ca02722, 0xb862774d, 0xce8b32ee, 0x5f8da059, 0x424ba5f0, 0x3bb422a0, 0x05c81961},
+       {0x32fd8907, 0x137dad8c, 0xc95a3a5d, 0x301d5119, 0x8937ac08, 0x144b38c3, 0x39338de7, 0x00e66f0e},
+       {0xcfc10885, 0xe68b8875, 0x96147e68, 0x4f24d49a, 0x43032c15, 0x5da9e6fd, 0x9bf25e12, 0x061ab0e6},
+       {0x455c65ad, 0xeab29bbd, 0x2448be64, 0x1c7da0e7, 0x8eedfa1f, 0x8c2c1bcd, 0x698c1197, 0x0400e2d2},
+       {0x04549c13, 0x335d3e9e, 0xd31585cc, 0x546f0d82, 0xe16dbbac, 0x350d5ed5, 0x113c53fd, 0x05f77544},
+       {0x7d8f3b7e, 0x6aa75c04, 0x10a641ae, 0xc70851dd, 0x9a0750fe, 0x4d33edd4, 0xcd1b230f, 0x022802cf},
+       {0xef8170e3, 0x59fa1903, 0x62995788, 0x464a73ef, 0x13369717, 0x338be7fd, 0x52d21278, 0x02e97589},
+       {0x4856ddd5, 0x3f2deca8, 0xfced10e2, 0x969b10e2, 0x52860ee7, 0x09620dde, 0xb620fa3f, 0x04a169bf},
+       {0xa03b49f1, 0xd9beb712, 0xe9af606e, 0x0798af09, 0x63e70b9a, 0xe37f9aea, 0xb35abd7c, 0x02542a44},
+       {0xf6e78973, 0x335d4000, 0x76f1bb23, 0x7bc28fde, 0x1b30e9ca, 0x6cfdc907, 0x0400b651, 0x03ff88aa},
+       {0x36433eaf, 0xfb862981, 0x4111cfa3, 0x15fdc659, 0xeab2909d, 0x569574b9, 0x3cd80f84, 0x01442360},
+       {0xe85c4af3, 0xa8ed8f31, 0xe6aaf3da, 0xf7680fee, 0xc5c1772c, 0x2240e931, 0xaebeeb70, 0x04f44f6f},
+       {0x8846e0af, 0x29de323f, 0x42c25319, 0x33f91593, 0x6cbadd58, 0x863099c1, 0xfd83e5b3, 0x06a603cf},
+       {0x86c77703, 0x1bdd17f3, 0xe02db671, 0x8cee8e78, 0x0b6dffce, 0xed1627af, 0xa0d9b3cc, 0x04491984},
+       {0xcb583661, 0x177f8f9c, 0x73d05bfc, 0x54122d0c, 0xebe37b4a, 0xa9231660, 0xd4826038, 0x06e885db},
+       {0x13c253b9, 0x64cde875, 0x2fbc98a9, 0x8484bccb, 0x4885a9af, 0xbad877c5, 0x0cbc33b6, 0x03007c90},
+       {0x47cfa357, 0x41eb9173, 0x325309ad, 0xb3f06289, 0xaa85421b, 0x029da7c1, 0x84de4bd4, 0x07b7eb0d},
+       {0x56b831e2, 0x2c459a80, 0x321aba19, 0x2b99d098, 0xea73c0e1, 0x96237364, 0xe25ed0ed, 0x02f2c638},
+       {0x9b388bf4, 0xfc8c3228, 0x82cd081d, 0xa4c371e4, 0xc85f75df, 0x11239026, 0x8892896e, 0x01f01c5e},
+       {0x73457917, 0xce1dde59, 0x16dd8b49, 0xdfdaeb19, 0xbfd17b1e, 0x4289a976, 0xc842870a, 0x05e2cf7e},
+       {0xc7705532, 0x72faa825, 0x8f7fe8c2, 0xd24bf942, 0xb695e31b, 0xb7403e13, 0xfc85a0c6, 0x02eac9e7},
+       {0x1ddb2dff, 0xc47638e3, 0x799bb649, 0x78b91a13, 0x552588ed, 0x001800de, 0x9cd9425c, 0x01d0640c},
+       {0xfb431e10, 0x159891e7, 0xa012b461, 0x2f2fb29a, 0xb3333e5d, 0xc1dca804, 0x9a47200d, 0x05b918ec},
+       {0x2d5ce760, 0x379119b5, 0xda2ccdab, 0xf9911f75, 0x47b5c054, 0x92b09490, 0x7298d065, 0x0742a31e},
+       {0x4a73d1f1, 0xe2a1046b, 0xc6ab4d9c, 0xbc85a747, 0xba0701f8, 0x79b0e699, 0xeebc6762, 0x05e5c2cb},
+       {0xe0c0db50, 0xdc644b37, 0x2b8444d2, 0x26f7f083, 0x63479a84, 0x90acf2e7, 0x90ffe372, 0x0590d880},
+       {0x83c0fc9c, 0x3dd1aba4, 0xcfb43020, 0x30a1051f, 0xaf5be716, 0x7d1ca380, 0x1ed8aed9, 0x01d56947},
+       {0x0fa23690, 0x657df8c4, 0x32111be3, 0x61a12fe4, 0xe78236c9, 0xd6cc9942, 0x85e66191, 0x01709635},
+       {0xc6a054f0, 0x96bf35ed, 0x004113cc, 0x9d1e411a, 0x1ac7a3ec, 0xccdb9bc3, 0xd08016b8, 0x07362425},
+       {0x9721b035, 0x72744cce, 0x0beb72e3, 0xb87eb606, 0x60870c2e, 0x00c5e70c, 0x685d7c14, 0x029fa4d3},
+       {0x86e52af4, 0x06d3a7a3, 0x70020878, 0x7b1c814a, 0x52e68007, 0x44373cb7, 0xe403540f, 0x041cf8c0},
+       {0x76a27949, 0xd5dbc8bf, 0x27d9cd12, 0xb41449bc, 0xa7a667a1, 0x93740020, 0x0fbb4e77, 0x000bf807},
+       {0x9969cfe9, 0x274ce281, 0x259ec27c, 0x3234d283, 0xe0b44f04, 0x9ff85b71, 0xffcc1006, 0x0298d060},
+       {0x68ab54f8, 0x5cd8b289, 0x437eaab8, 0x42e3877f, 0x9318bd3e, 0x6490dc61, 0x4e54d968, 0x075b01f3},
+       {0x7b64243c, 0x73100d65, 0x5c802f82, 0x692378be, 0x88184c0c, 0x00283dbb, 0xab6f4f0e, 0x0442efad},
+       {0x72015722, 0xbe83b708, 0xe1cdcf0e, 0x2035319f, 0x398347da, 0x2b1b3351, 0x1a14b8dc, 0x061823d8},
+       {0x378d9803, 0x1090948c, 0x4725c64b, 0x61a558cc, 0x7d7fcd91, 0x9e5bd3b5, 0x57ebda25, 0x061e02a0},
+       {0xf8324dc8, 0x166b4a3c, 0x38133fda, 0xa25b9d11, 0x917171a5, 0x9d602950, 0x417d104e, 0x0632e48b},
+       {0x6a61d5e0, 0x03b9f1b9, 0xe59cfbb7, 0xd906b740, 0x7892fbe4, 0x99a93267, 0xad1b8171, 0x06ddc2a6},
+       {0x67fc3874, 0x6ae4355d, 0xb1ada695, 0x4fa456d8, 0x9f91ac43, 0x4e234065, 0x829d173e, 0x028da309},
+       {0xfc695c2c, 0x1e08dd18, 0xfa687112, 0x1c0a2fad, 0xffd6302a, 0xeb5ebf01, 0xfd1d10f5, 0x012fd387},
+       {0x236e65c9, 0x0b907f2e, 0xb1281d54, 0x92ba7a15, 0xc13f1d75, 0x07f0a6ad, 0xcd6d1e9c, 0x05dfe4e3},
+       {0xc45f33f8, 0xd99cc41a, 0xd373165c, 0xc1c10a71, 0x2ce2936a, 0x6c809230, 0xa0498cf5, 0x018dc832},
+       {0x7b222ad8, 0x8e881eab, 0xb6194efb, 0xc8b48774, 0x963c6b6b, 0x38452dfd, 0xe4c4e0f8, 0x02847f5a},
+       {0x2bf4ad95, 0x2950bb4a, 0xdc39ffb0, 0x37f42c9b, 0x101253a8, 0x3814fa42, 0xb67f2ca5, 0x04d4a34c},
+       {0xa9684ba0, 0x6c40fece, 0x3b13bca4, 0xc7108aad, 0xe7bff9be, 0x98ccc7ea, 0xe9b3b316, 0x048b3a6a},
+       {0x08390a2b, 0x4d908260, 0x74b070bc, 0xd5a641d0, 0x910015c5, 0xc3b19274, 0xd5a998a7, 0x02ac8e74},
+       {0x9698d605, 0x8de03acc, 0xa4c9137f, 0x3b8b720c, 0x354faf46, 0x5bbad6e4, 0xfd9e842f, 0x0054c120},
+       {0xd65aead5, 0x305fa33f, 0x0fe296f9, 0xba02b164, 0x708efc94, 0x64cba43c, 0x8ad7f0ef, 0x034b9ffe},
+       {0x13c2e8f4, 0x59e1179e, 0xc572f8a8, 0x5d823d59, 0x74003bce, 0x0cfdb6ee, 0x011c179e, 0x00763941},
+       {0xa47999a8, 0x29b692ee, 0xbfcd80d8, 0x6436c3f1, 0x959768d7, 0x553444f3, 0x583896d4, 0x01d45a26},
+       {0xc150b3f8, 0x0ce0791d, 0xf493c135, 0x7d3a0c1f, 0x5ede0712, 0x4d37cc23, 0x34fbae9c, 0x036a6a38},
+       {0x2ca1eb78, 0xa8ee8204, 0x66d8b759, 0xc713a1dc, 0xac061800, 0x1813508d, 0x3b1f0da2, 0x05725ca0},
+       {0xf2f391c1, 0xbe6826df, 0x232878f0, 0xeb85b046, 0xf7e1d662, 0xf5a96510, 0xe38c2b64, 0x0419a43b},
+       {0xe69e791b, 0x4b54889b, 0xb5c95ea5, 0xb371eeb0, 0x0b2f26a3, 0x9f53ccca, 0x66f45f71, 0x0040592d},
+       {0xad2e5d5b, 0x4ced12db, 0x0987b849, 0x5f57b16d, 0xd9ec045b, 0xcab0e2e9, 0x6cfbf4df, 0x03e4e405},
+       {0x3ecb72a4, 0xd71a1eee, 0x03a13fb7, 0x6bd9f7ec, 0x5877c6c7, 0xb74a54c8, 0xa28236a5, 0x0377689b},
+       {0x74b3354c, 0x6f558a20, 0x3f776b18, 0xb67f6d10, 0x01165ed8, 0x8c447df2, 0xf3889308, 0x056b8991},
+       {0x0d306b7a, 0x9482eb10, 0xd441cd03, 0xdd738e0f, 0x2de5dfd7, 0x6d186de5, 0x75fd1833, 0x00781b3e},
+       {0x77ec28e5, 0xdbc14748, 0xd26e050c, 0x02ceee41, 0x18457c96, 0x8e5aef74, 0x1823c60f, 0x0461a6e2},
+       {0x2be17c8b, 0x172e551d, 0x49c6a7b8, 0x90e25fa2, 0xa1b3478f, 0x6219e63e, 0xd063a517, 0x00c412f8},
+       {0x65a9b68e, 0xb136b848, 0x673c6cbc, 0x9a9b7169, 0xf8ec7473, 0x15fa1875, 0x3033a5d6, 0x022d72f6}}};
+
+    static constexpr storage_array<omegas_count, limbs_count> inv = {
+      {{0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000, 0x00000008, 0x04000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xc0000000, 0x0000000c, 0x06000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xe0000000, 0x0000000e, 0x07000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xf0000000, 0x0000000f, 0x07800000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x78000000, 0x00000010, 0x07c00000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xbc000000, 0x00000010, 0x07e00000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xde000000, 0x00000010, 0x07f00000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xef000000, 0x00000010, 0x07f80000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xf7800000, 0x00000010, 0x07fc0000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfbc00000, 0x00000010, 0x07fe0000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfde00000, 0x00000010, 0x07ff0000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfef00000, 0x00000010, 0x07ff8000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xff780000, 0x00000010, 0x07ffc000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffbc0000, 0x00000010, 0x07ffe000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffde0000, 0x00000010, 0x07fff000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffef0000, 0x00000010, 0x07fff800},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfff78000, 0x00000010, 0x07fffc00},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffbc000, 0x00000010, 0x07fffe00},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffde000, 0x00000010, 0x07ffff00},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffef000, 0x00000010, 0x07ffff80},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffff7800, 0x00000010, 0x07ffffc0},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffbc00, 0x00000010, 0x07ffffe0},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffde00, 0x00000010, 0x07fffff0},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffef00, 0x00000010, 0x07fffff8},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffff780, 0x00000010, 0x07fffffc},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffffbc0, 0x00000010, 0x07fffffe},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffffde0, 0x00000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xfffffef0, 0x80000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffff78, 0xc0000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffffbc, 0xe0000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffffde, 0xf0000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffffef, 0xf8000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x80000000, 0xfffffff7, 0xfc000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xc0000000, 0xfffffffb, 0xfe000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xe0000000, 0xfffffffd, 0xff000010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xf0000000, 0xfffffffe, 0xff800010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x78000000, 0xffffffff, 0xffc00010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xbc000000, 0xffffffff, 0xffe00010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xde000000, 0xffffffff, 0xfff00010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xef000000, 0xffffffff, 0xfff80010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xf7800000, 0xffffffff, 0xfffc0010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfbc00000, 0xffffffff, 0xfffe0010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfde00000, 0xffffffff, 0xffff0010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfef00000, 0xffffffff, 0xffff8010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xff780000, 0xffffffff, 0xffffc010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffbc0000, 0xffffffff, 0xffffe010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffde0000, 0xffffffff, 0xfffff010, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffef0000, 0xffffffff, 0xfffff810, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfff78000, 0xffffffff, 0xfffffc10, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffbc000, 0xffffffff, 0xfffffe10, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffde000, 0xffffffff, 0xffffff10, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffef000, 0xffffffff, 0xffffff90, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffff7800, 0xffffffff, 0xffffffd0, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffbc00, 0xffffffff, 0xfffffff0, 0x07ffffff},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffde00, 0xffffffff, 0x00000000, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffef00, 0xffffffff, 0x00000008, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffff780, 0xffffffff, 0x0000000c, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffffbc0, 0xffffffff, 0x0000000e, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffffde0, 0xffffffff, 0x0000000f, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xfffffef0, 0x7fffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffff78, 0xbfffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffffbc, 0xdfffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffffde, 0xefffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x00000000, 0xffffffef, 0xf7ffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x80000000, 0xfffffff7, 0xfbffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xc0000000, 0xfffffffb, 0xfdffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xe0000000, 0xfffffffd, 0xfeffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xf0000000, 0xfffffffe, 0xff7fffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0x78000000, 0xffffffff, 0xffbfffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xbc000000, 0xffffffff, 0xffdfffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xde000000, 0xffffffff, 0xffefffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xef000000, 0xffffffff, 0xfff7ffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xf7800000, 0xffffffff, 0xfffbffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfbc00000, 0xffffffff, 0xfffdffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfde00000, 0xffffffff, 0xfffeffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfef00000, 0xffffffff, 0xffff7fff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xff780000, 0xffffffff, 0xffffbfff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffbc0000, 0xffffffff, 0xffffdfff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffde0000, 0xffffffff, 0xffffefff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffef0000, 0xffffffff, 0xfffff7ff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfff78000, 0xffffffff, 0xfffffbff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffbc000, 0xffffffff, 0xfffffdff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffde000, 0xffffffff, 0xfffffeff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffef000, 0xffffffff, 0xffffff7f, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffff7800, 0xffffffff, 0xffffffbf, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffbc00, 0xffffffff, 0xffffffdf, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffde00, 0xffffffff, 0xffffffef, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffef00, 0xffffffff, 0xfffffff7, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffff780, 0xffffffff, 0xfffffffb, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffffbc0, 0xffffffff, 0xfffffffd, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffffde0, 0xffffffff, 0xfffffffe, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xfffffef0, 0x7fffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffff78, 0xbfffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffffbc, 0xdfffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffffde, 0xefffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x00000000, 0xffffffef, 0xf7ffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x80000000, 0xfffffff7, 0xfbffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xc0000000, 0xfffffffb, 0xfdffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xe0000000, 0xfffffffd, 0xfeffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xf0000000, 0xfffffffe, 0xff7fffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0x78000000, 0xffffffff, 0xffbfffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xbc000000, 0xffffffff, 0xffdfffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xde000000, 0xffffffff, 0xffefffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xef000000, 0xffffffff, 0xfff7ffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xf7800000, 0xffffffff, 0xfffbffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfbc00000, 0xffffffff, 0xfffdffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfde00000, 0xffffffff, 0xfffeffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfef00000, 0xffffffff, 0xffff7fff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xff780000, 0xffffffff, 0xffffbfff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffbc0000, 0xffffffff, 0xffffdfff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffde0000, 0xffffffff, 0xffffefff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffef0000, 0xffffffff, 0xfffff7ff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfff78000, 0xffffffff, 0xfffffbff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffbc000, 0xffffffff, 0xfffffdff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffde000, 0xffffffff, 0xfffffeff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffef000, 0xffffffff, 0xffffff7f, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffff7800, 0xffffffff, 0xffffffbf, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffbc00, 0xffffffff, 0xffffffdf, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffde00, 0xffffffff, 0xffffffef, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffef00, 0xffffffff, 0xfffffff7, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffff780, 0xffffffff, 0xfffffffb, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffffbc0, 0xffffffff, 0xfffffffd, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffffde0, 0xffffffff, 0xfffffffe, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xfffffef0, 0x7fffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffff78, 0xbfffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffffbc, 0xdfffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffffde, 0xefffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x00000000, 0xffffffef, 0xf7ffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x80000000, 0xfffffff7, 0xfbffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xc0000000, 0xfffffffb, 0xfdffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xe0000000, 0xfffffffd, 0xfeffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xf0000000, 0xfffffffe, 0xff7fffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0x78000000, 0xffffffff, 0xffbfffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xbc000000, 0xffffffff, 0xffdfffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xde000000, 0xffffffff, 0xffefffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xef000000, 0xffffffff, 0xfff7ffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xf7800000, 0xffffffff, 0xfffbffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfbc00000, 0xffffffff, 0xfffdffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfde00000, 0xffffffff, 0xfffeffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfef00000, 0xffffffff, 0xffff7fff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xff780000, 0xffffffff, 0xffffbfff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffbc0000, 0xffffffff, 0xffffdfff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffde0000, 0xffffffff, 0xffffefff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffef0000, 0xffffffff, 0xfffff7ff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfff78000, 0xffffffff, 0xfffffbff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffbc000, 0xffffffff, 0xfffffdff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffde000, 0xffffffff, 0xfffffeff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffef000, 0xffffffff, 0xffffff7f, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffff7800, 0xffffffff, 0xffffffbf, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffbc00, 0xffffffff, 0xffffffdf, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffde00, 0xffffffff, 0xffffffef, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffef00, 0xffffffff, 0xfffffff7, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffff780, 0xffffffff, 0xfffffffb, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffffbc0, 0xffffffff, 0xfffffffd, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffffde0, 0xffffffff, 0xfffffffe, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xfffffef0, 0x7fffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffff78, 0xbfffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffffbc, 0xdfffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffffde, 0xefffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x00000001, 0xffffffef, 0xf7ffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x80000001, 0xfffffff7, 0xfbffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xc0000001, 0xfffffffb, 0xfdffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xe0000001, 0xfffffffd, 0xfeffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xf0000001, 0xfffffffe, 0xff7fffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0x78000001, 0xffffffff, 0xffbfffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xbc000001, 0xffffffff, 0xffdfffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xde000001, 0xffffffff, 0xffefffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xef000001, 0xffffffff, 0xfff7ffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xf7800001, 0xffffffff, 0xfffbffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfbc00001, 0xffffffff, 0xfffdffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfde00001, 0xffffffff, 0xfffeffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfef00001, 0xffffffff, 0xffff7fff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xff780001, 0xffffffff, 0xffffbfff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffbc0001, 0xffffffff, 0xffffdfff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffde0001, 0xffffffff, 0xffffefff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffef0001, 0xffffffff, 0xfffff7ff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfff78001, 0xffffffff, 0xfffffbff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffbc001, 0xffffffff, 0xfffffdff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffde001, 0xffffffff, 0xfffffeff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffef001, 0xffffffff, 0xffffff7f, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffff7801, 0xffffffff, 0xffffffbf, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffffbc01, 0xffffffff, 0xffffffdf, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffffde01, 0xffffffff, 0xffffffef, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffffef01, 0xffffffff, 0xfffffff7, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffff781, 0xffffffff, 0xfffffffb, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffffbc1, 0xffffffff, 0xfffffffd, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffffde1, 0xffffffff, 0xfffffffe, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffffef1, 0x7fffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffffff79, 0xbfffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffffffbd, 0xdfffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xffffffdf, 0xefffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000},
+       {0xfffffff0, 0xf7ffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x00000010, 0x08000000}}};
+  };
+
+  /**
+   * Scalar field. Is always a prime field.
+   */
+  typedef Field<fp_config> scalar_t;
+} // namespace stark252
--- a/icicle/include/gpu-utils/device_context.cuh
+++ b/icicle/include/gpu-utils/device_context.cuh
@@ -1,20 +1,20 @@
 #pragma once
 #ifndef DEVICE_CONTEXT_H
 #define DEVICE_CONTEXT_H
-
-#include <cuda_runtime.h>
+#include <cstddef>

 namespace device_context {

-  constexpr std::size_t MAX_DEVICES = 32;
+  size_t MAX_DEVICES = 32;
+  

  /**
   * Properties of the device used in icicle functions.
   */
  struct DeviceContext {
-    cudaStream_t& stream;  /**< Stream to use. Default value: 0. */
+    int stream;  /**< Stream to use. Default value: 0. */
    std::size_t device_id; /**< Index of the currently used GPU. Default value: 0. */
-    cudaMemPool_t mempool; /**< Mempool to use. Default value: 0. */
+    int mempool; /**< Mempool to use. Default value: 0. */
  };

  /**
@@ -22,9 +22,9 @@ namespace device_context {
   */
  inline DeviceContext get_default_device_context() // TODO: naming convention ?
  {
-    static cudaStream_t default_stream = (cudaStream_t)0;
+    static int default_stream = 0;
    return DeviceContext{
-      (cudaStream_t&)default_stream, // stream
+      default_stream, // stream
      0,                             // device_id
      0,                             // mempool
    };
--- a/icicle/include/gpu-utils/error_handler.cuh
+++ b/icicle/include/gpu-utils/error_handler.cuh
@@ -3,12 +3,10 @@
 #define ERR_H

 #include <iostream>
-
-#include <cuda_runtime.h>
 #include <stdexcept>
 #include <string>

-enum class IcicleError_t {
+enum IcicleError_t {
  IcicleSuccess = 0,
  InvalidArgument = 1,
  MemoryAllocationError = 2,
@@ -38,14 +36,14 @@ private:

 public:
  // Constructor for cudaError_t with optional message
-  IcicleError(cudaError_t cudaError, const std::string& msg = "")
-      : std::runtime_error("CUDA Error: " + std::string(cudaGetErrorString(cudaError)) + " " + msg),
+  IcicleError(int cudaError, const std::string& msg = "")
+      : std::runtime_error("Error: " + msg),
        errCode(static_cast<int>(cudaError))
  {
  }

  // Constructor for cudaError_t with const char* message
-  IcicleError(cudaError_t cudaError, const char* msg) : IcicleError(cudaError, std::string(msg)) {}
+  IcicleError(int cudaError, const char* msg) : IcicleError(cudaError, std::string(msg)) {}

  // Constructor for IcicleError_t with optional message
  IcicleError(IcicleError_t icicleError, const std::string& msg = "")
@@ -67,11 +65,10 @@ public:
 #define CHK_LOG(val)                   check((val), #val, __FILE__, __LINE__)
 #define CHK_VAL(val, file, line)       check((val), #val, file, line)

-cudaError_t inline check(cudaError_t err, const char* const func, const char* const file, const int line)
+int inline check(int err, const char* const func, const char* const file, const int line)
 {
-  if (err != cudaSuccess) {
+  if (err != 0) {
    std::cerr << "CUDA Runtime Error by: " << func << " at: " << file << ":" << line << std::endl;
-    std::cerr << cudaGetErrorString(err) << std::endl << std::endl;
  }

  return err;
@@ -90,12 +87,12 @@ cudaError_t inline check(cudaError_t err, const char* const func, const char* co
 #define THROW_ICICLE_CUDA(val)                       throwIcicleCudaErr(val, __FUNCTION__, __FILE__, __LINE__)
 #define THROW_ICICLE_CUDA_ERR(val, func, file, line) throwIcicleCudaErr(val, func, file, line)
 void inline throwIcicleCudaErr(
-  cudaError_t err, const char* const func, const char* const file, const int line, bool isUnrecoverable = true)
+  int err, const char* const func, const char* const file, const int line, bool isUnrecoverable = true)
 {
  // TODO: fmt::format introduced only in C++20
-  std::string err_msg = (isUnrecoverable ? "!!!Unrecoverable!!! : " : "") + std::string{cudaGetErrorString(err)} +
-                        " : detected by: " + func + " at: " + file + ":" + std::to_string(line) +
-                        "\nThe error is reported there and may be caused by prior calls.\n";
+  std::string err_msg = (isUnrecoverable ? "!!!Unrecoverable!!! : " : "");
+  //  + " : detected by: " + func + " at: " + file + ":" + std::to_string(line) +
+  //                       "\nThe error is reported there and may be caused by prior calls.\n";
  std::cerr << err_msg << std::endl; // TODO: Logging
  throw IcicleError{err, err_msg};
 }
@@ -111,14 +108,14 @@ void inline throwIcicleErr(
  throw IcicleError{err, err_msg};
 }

-cudaError_t inline checkCudaErrorIsSticky(
-  cudaError_t err, const char* const func, const char* const file, const int line, bool isThrowing = true)
+int inline checkCudaErrorIsSticky(
+  int err, const char* const func, const char* const file, const int line, bool isThrowing = true)
 {
-  if (err != cudaSuccess) {
+  if (err != 0) {
    // check for sticky (unrecoverable) error when the only option is to restart process
-    cudaError_t err2 = cudaDeviceSynchronize();
+    int err2 = 0;
    bool is_logged;
-    if (err2 != cudaSuccess) { // we suspect sticky error
+    if (err2 != 0) { // we suspect sticky error
      if (err != err2) {
        is_logged = true;
        CHK_ERR(err, func, file, line);
@@ -139,13 +136,13 @@ cudaError_t inline checkCudaErrorIsSticky(
 // most common macros to use
 #define CHK_INIT_IF_RETURN()                                                                                           \
  {                                                                                                                    \
-    cudaError_t err_result = CHK_LAST();                                                                               \
+    int err_result = CHK_LAST();                                                                               \
    if (err_result != cudaSuccess) return err_result;                                                                  \
  }

 #define CHK_IF_RETURN(val)                                                                                             \
  {                                                                                                                    \
-    cudaError_t err_result = CHK_STICKY(val);                                                                          \
+    int err_result = CHK_STICKY(val);                                                                          \
    if (err_result != cudaSuccess) return err_result;                                                                  \
  }

--- a/icicle/include/gpu-utils/modifiers.cuh
+++ b/icicle/include/gpu-utils/modifiers.cuh
@@ -6,6 +6,6 @@
 #define UNROLL       #pragma unroll
 #endif

-#define HOST_INLINE        __host__ INLINE_MACRO
-#define DEVICE_INLINE      __device__ INLINE_MACRO
-#define HOST_DEVICE_INLINE __host__ __device__ INLINE_MACRO
+// #define        __host__ INLINE_MACRO
+// #define      INLINE_MACRO
+// #define __host__ INLINE_MACRO
--- a/icicle/include/gpu-utils/sharedmem.cuh
+++ b/icicle/include/gpu-utils/sharedmem.cuh
@@ -24,7 +24,7 @@
 * definitions.
 *
 * To use dynamically allocated shared memory in a templatized __global__ or
- * __device__ function, just replace code like this:
+ * function, just replace code like this:
 *
 * <pre>
 *  template<class T>
@@ -32,7 +32,7 @@
 *  foo( T* d_out, T* d_in)
 *  {
 *      // Shared mem size is determined by the host app at run time
- *      extern __shared__  T sdata[];
+ *       T sdata[];
 *      ...
 *      doStuff(sdata);
 *      ...
@@ -62,7 +62,7 @@
 *
 * This struct uses template specialization on the type \a T to declare
 * a differently named dynamic shared memory array for each type
- * (\code extern __shared__ T s_type[] \endcode).
+ * (\code T s_type[] \endcode).
 *
 * Currently there are specializations for the following types:
 * \c int, \c uint, \c char, \c uchar, \c short, \c ushort, \c long,
@@ -73,11 +73,10 @@ template <typename T>
 struct SharedMemory {
  //! @brief Return a pointer to the runtime-sized shared memory array.
  //! @returns Pointer to runtime-sized shared memory array
-  __device__ T* getPointer()
+  T* getPointer()
  {
-    // extern __device__ void Error_UnsupportedType(); // Ensure that we won't compile any un-specialized types
-    // Error_UnsupportedType();
-    return (T*)0;
+    T* a = nullptr; // Initialize pointer to nullptr or allocate memory as needed
+    return a;
  }
  // TODO: Use operator overloading to make this class look like a regular array
 };
@@ -88,129 +87,128 @@ struct SharedMemory {

 template <>
 struct SharedMemory<int> {
-  __device__ int* getPointer()
+  int* getPointer()
  {
-    extern __shared__ int s_int[];
-    return s_int;
+    return 0;
  }
 };

 template <>
 struct SharedMemory<unsigned int> {
-  __device__ unsigned int* getPointer()
+  unsigned int* getPointer()
  {
-    extern __shared__ unsigned int s_uint[];
-    return s_uint;
+    return 0;
  }
 };

 template <>
 struct SharedMemory<char> {
-  __device__ char* getPointer()
+  char* getPointer()
  {
-    extern __shared__ char s_char[];
-    return s_char;
+    char *a = nullptr;
+    return a;
  }
 };

 template <>
 struct SharedMemory<unsigned char> {
-  __device__ unsigned char* getPointer()
+  unsigned char* getPointer()
  {
-    extern __shared__ unsigned char s_uchar[];
-    return s_uchar;
+    unsigned char* a = nullptr;
+    return a;
  }
 };

 template <>
 struct SharedMemory<short> {
-  __device__ short* getPointer()
+  short* getPointer()
  {
-    extern __shared__ short s_short[];
-    return s_short;
+    short* a = nullptr;
+    return a;
  }
 };

 template <>
 struct SharedMemory<unsigned short> {
-  __device__ unsigned short* getPointer()
+  unsigned short* getPointer()
  {
-    extern __shared__ unsigned short s_ushort[];
-    return s_ushort;
+    unsigned short* a = nullptr;
+    return a;
  }
 };

 template <>
 struct SharedMemory<long> {
-  __device__ long* getPointer()
+  long* getPointer()
  {
-    extern __shared__ long s_long[];
+    long *s_long = nullptr;
    return s_long;
  }
 };

 template <>
 struct SharedMemory<unsigned long> {
-  __device__ unsigned long* getPointer()
+  unsigned long* getPointer()
  {
-    extern __shared__ unsigned long s_ulong[];
+    unsigned long *s_ulong = nullptr;
    return s_ulong;
  }
 };

 template <>
 struct SharedMemory<long long> {
-  __device__ long long* getPointer()
+  long long* getPointer()
  {
-    extern __shared__ long long s_longlong[];
+    long long *s_longlong;
    return s_longlong;
  }
 };

 template <>
 struct SharedMemory<unsigned long long> {
-  __device__ unsigned long long* getPointer()
+  unsigned long long* getPointer()
  {
-    extern __shared__ unsigned long long s_ulonglong[];
+    unsigned long long *s_ulonglong;
    return s_ulonglong;
  }
 };

 template <>
 struct SharedMemory<bool> {
-  __device__ bool* getPointer()
+  bool* getPointer()
  {
-    extern __shared__ bool s_bool[];
+    bool *s_bool;
    return s_bool;
  }
 };

 template <>
 struct SharedMemory<float> {
-  __device__ float* getPointer()
+  float* getPointer()
  {
-    extern __shared__ float s_float[];
+    float *s_float;
    return s_float;
  }
 };

 template <>
 struct SharedMemory<double> {
-  __device__ double* getPointer()
+  double* getPointer()
  {
-    extern __shared__ double s_double[];
+    double *s_double;
    return s_double;
  }
 };

-template <>
-struct SharedMemory<uchar4> {
-  __device__ uchar4* getPointer()
-  {
-    extern __shared__ uchar4 s_uchar4[];
-    return s_uchar4;
-  }
-};
+
+// template <>
+// struct SharedMemory<uchar4> {
+//   uchar4* getPointer()
+//   {
+//     uchar4 *s_uchar4;
+//     return s_uchar4;
+//   }
+// };

 #endif //_SHAREDMEM_H_

--- a/icicle/include/hash/keccak/keccak.cuh
+++ b/icicle/include/hash/keccak/keccak.cuh
@@ -3,9 +3,9 @@
 #define KECCAK_H

 #include <cstdint>
-#include "gpu-utils/device_context.cuh"
-#include "gpu-utils/error_handler.cuh"
-
+#include "../../gpu-utils/device_context.cuh"
+#include "../../gpu-utils/error_handler.cuh"
+typedef int cudaError_t;
 namespace keccak {
  /**
   * @struct KeccakConfig
@@ -50,7 +50,7 @@ namespace keccak {
   */
  template <int C, int D>
  cudaError_t
-  keccak_hash(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig config);
+  keccak_hash(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config);
 } // namespace keccak

 #endif
--- a/icicle/include/msm/msm.cuh
+++ b/icicle/include/msm/msm.cuh
@@ -2,13 +2,12 @@
 #ifndef MSM_H
 #define MSM_H

-#include <cuda_runtime.h>

-#include "curves/affine.cuh"
-#include "curves/projective.cuh"
-#include "fields/field.cuh"
-#include "gpu-utils/device_context.cuh"
-#include "gpu-utils/error_handler.cuh"
+#include "../curves/affine.cuh"
+#include "../curves/projective.cuh"
+#include "../fields/field.cuh"
+#include "../gpu-utils/device_context.cuh"
+#include "../gpu-utils/error_handler.cuh"

 /**
 * @namespace msm
@@ -124,6 +123,8 @@ namespace msm {
   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
   *
   */
+
+  typedef int cudaError_t;
  template <typename S, typename A, typename P>
  cudaError_t msm(const S* scalars, const A* points, int msm_size, MSMConfig& config, P* results);

--- a/icicle/include/ntt/ntt.cuh
+++ b/icicle/include/ntt/ntt.cuh
@@ -2,13 +2,11 @@
 #ifndef NTT_H
 #define NTT_H

-#include <cuda_runtime.h>
-
-#include "gpu-utils/device_context.cuh"
-#include "gpu-utils/error_handler.cuh"
-#include "gpu-utils/sharedmem.cuh"
-#include "utils/utils_kernels.cuh"
-#include "utils/utils.h"
+#include "../gpu-utils/device_context.cuh"
+#include "../gpu-utils/error_handler.cuh"
+#include "../gpu-utils/sharedmem.cuh"
+#include "../utils/utils_kernels.cuh"
+#include "../utils/utils.h"

 /**
 * @namespace ntt
@@ -36,6 +34,8 @@ namespace ntt {
   * primitive_root).
   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
   */
+  typedef int cudaError_t;
+
  template <typename S>
  cudaError_t init_domain(S primitive_root, device_context::DeviceContext& ctx, bool fast_twiddles_mode = false);

--- a/icicle/include/ntt/ntt_impl.cuh
+++ b/icicle/include/ntt/ntt_impl.cuh
@@ -3,8 +3,9 @@
 #define _NTT_IMPL_H

 #include <stdint.h>
-#include "ntt/ntt.cuh" // for enum Ordering
-
+#include "ntt.cuh" // for enum Ordering
+typedef int cudaError_t;
+typedef int cudaStream_t;
 namespace mxntt {

  template <typename S>
@@ -32,6 +33,7 @@ namespace mxntt {
    S* external_twiddles,
    S* internal_twiddles,
    S* basic_twiddles,
+    S* linear_twiddle, // twiddles organized as [1,w,w^2,...] for coset-eval in fast-tw mode
    int ntt_size,
    int max_logn,
    int batch_size,
--- a/icicle/include/polynomials/cuda_backend/polynomial_cuda_backend.cuh
+++ b/icicle/include/polynomials/cuda_backend/polynomial_cuda_backend.cuh
@@ -1,8 +1,8 @@
 #pragma once

-#include "gpu-utils/device_context.cuh"
-#include "fields/field_config.cuh"
-#include "polynomials/polynomials.h"
+#include "../../gpu-utils/device_context.cuh"
+#include "../../fields/field_config.cuh"
+#include "../polynomials.h"

 using device_context::DeviceContext;

@@ -11,7 +11,7 @@ namespace polynomials {
  class CUDAPolynomialFactory : public AbstractPolynomialFactory<C, D, I>
  {
    std::vector<DeviceContext> m_device_contexts; // device-id --> device context
-    std::vector<cudaStream_t> m_device_streams;   // device-id --> device stream. Storing the streams here as workaround
+    std::vector<int> m_device_streams;   // device-id --> device stream. Storing the streams here as workaround
                                                  // since DeviceContext has a reference to a stream.

  public:
--- a/icicle/include/polynomials/polynomial_context.h
+++ b/icicle/include/polynomials/polynomial_context.h
@@ -6,7 +6,7 @@
 #include <algorithm> // for std::max
 #include <cstdint>   // for uint64_t, etc.
 #include <memory>
-#include "utils/integrity_pointer.h"
+#include "../utils/integrity_pointer.h"

 namespace polynomials {

--- a/icicle/include/polynomials/polynomials.h
+++ b/icicle/include/polynomials/polynomials.h
@@ -2,8 +2,8 @@

 #include <iostream>
 #include <memory>
-#include "utils/integrity_pointer.h"
-#include "fields/field_config.cuh"
+#include "../utils/integrity_pointer.h"
+#include "../fields/field_config.cuh"

 #include "polynomial_context.h"
 #include "polynomial_backend.h"
--- a/icicle/include/poseidon/poseidon.cuh
+++ b/icicle/include/poseidon/poseidon.cuh
@@ -4,9 +4,9 @@

 #include <cstdint>
 #include <stdexcept>
-#include "gpu-utils/device_context.cuh"
-#include "gpu-utils/error_handler.cuh"
-#include "utils/utils.h"
+#include "../gpu-utils/device_context.cuh"
+#include "../gpu-utils/error_handler.cuh"
+#include "../utils/utils.h"

 /**
 * @namespace poseidon
@@ -117,6 +117,7 @@ namespace poseidon {
  /**
   * Loads pre-calculated optimized constants, moves them to the device
   */
+  typedef int cudaError_t;
  template <typename S>
  cudaError_t
  init_optimized_poseidon_constants(int arity, device_context::DeviceContext& ctx, PoseidonConstants<S>* constants);
--- a/icicle/include/utils/mont.cuh
+++ b/icicle/include/utils/mont.cuh
@@ -9,7 +9,7 @@ namespace mont {
 #define MAX_THREADS_PER_BLOCK 256

    template <typename E, bool is_into>
-    __global__ void MontgomeryKernel(const E* input, int n, E* output)
+    void MontgomeryKernel(const E* input, int n, E* output)
    {
      int tid = blockIdx.x * blockDim.x + threadIdx.x;
      if (tid < n) { output[tid] = is_into ? E::to_montgomery(input[tid]) : E::from_montgomery(input[tid]); }
--- a/icicle/include/utils/utils_kernels.cuh
+++ b/icicle/include/utils/utils_kernels.cuh
@@ -4,14 +4,13 @@

 namespace utils_internal {
  template <typename E, typename S>
-  __global__ void NormalizeKernel(E* arr, S scalar, int n)
+  void NormalizeKernel(E* arr, S scalar, int n)
  {
-    int tid = blockIdx.x * blockDim.x + threadIdx.x;
-    if (tid < n) { arr[tid] = scalar * arr[tid]; }
+    return;
  }

  template <typename E, typename S>
-  __global__ void BatchMulKernel(
+  void BatchMulKernel(
    const E* in_vec,
    int n_elements,
    int batch_size,
@@ -22,12 +21,7 @@ namespace utils_internal {
    bool bitrev,
    E* out_vec)
  {
-    int tid = blockDim.x * blockIdx.x + threadIdx.x;
-    if (tid < n_elements * batch_size) {
-      int64_t scalar_id = tid % n_elements;
-      if (bitrev) scalar_id = __brev(scalar_id) >> (32 - logn);
-      out_vec[tid] = *(scalar_vec + ((scalar_id * step) % n_scalars)) * in_vec[tid];
-    }
+    return;
  }

 } // namespace utils_internal
--- a/icicle/include/vec_ops/vec_ops.cuh
+++ b/icicle/include/vec_ops/vec_ops.cuh
@@ -2,7 +2,7 @@
 #ifndef LDE_H
 #define LDE_H

-#include "gpu-utils/device_context.cuh"
+#include "../gpu-utils/device_context.cuh"

 /**
 * @namespace vec_ops
@@ -57,6 +57,7 @@ namespace vec_ops {
   * @tparam E The type of elements `vec_b` and `result`. Often (but not always) `E=S`.
   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
   */
+  typedef int cudaError_t;
  template <typename E, typename S>
  cudaError_t Mul(const S* vec_a, const E* vec_b, int n, VecOpsConfig& config, E* result);

--- a/icicle/src/curves/CMakeLists.txt
+++ b/icicle/src/curves/CMakeLists.txt
@@ -1,29 +1,26 @@
-if (G2)
-  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DG2")
-endif ()
-
 set(TARGET icicle_curve)
 set(FIELD_TARGET icicle_field)

-set(SRC ${CMAKE_SOURCE_DIR}/src)
+set(SRC ../../)

-set(CURVE_SOURCE ${SRC}/curves/extern.cu)
+set(CURVE_SOURCE ${SRC}/curves/extern.cpp)
 if(G2)
-  list(APPEND CURVE_SOURCE ${SRC}/curves/extern_g2.cu)
-endif()
-if(MSM)
-  list(APPEND CURVE_SOURCE ${SRC}/msm/extern.cu)
-  if(G2)
-    list(APPEND CURVE_SOURCE ${SRC}/msm/extern_g2.cu)
-  endif()
-endif()
-if(ECNTT)
-  list(APPEND CURVE_SOURCE ${SRC}/ntt/extern_ecntt.cu)
-  list(APPEND CURVE_SOURCE ${SRC}/ntt/kernel_ntt.cu)
+  list(APPEND CURVE_SOURCE ${SRC}/curves/extern_g2.cpp)
 endif()
+# if(MSM)
+#   list(APPEND CURVE_SOURCE ${SRC}/msm/extern.cpp)
+#   if(G2)
+#     list(APPEND CURVE_SOURCE ${SRC}/msm/extern_g2.cpp)
+#   endif()
+# endif()
+# if(ECNTT)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/extern_ecntt.cpp)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/kernel_ntt.cpp)
+# endif()

 add_library(${TARGET} STATIC ${CURVE_SOURCE})
 target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
 set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_curve_${CURVE}")
 target_compile_definitions(${TARGET} PUBLIC CURVE=${CURVE})
-target_link_libraries(${TARGET} PRIVATE ${FIELD_TARGET})
+target_link_libraries(${TARGET} PRIVATE ${FIELD_TARGET})
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/curves/extern.cpp
+++ b/icicle/src/curves/extern.cpp
@@ -0,0 +1,40 @@
+#define CURVE_ID BN254
+#include "../../include/curves/curve_config.cuh"
+
+using namespace curve_config;
+
+#include "../../include/gpu-utils/device_context.cuh"
+#include "../../include/utils/utils.h"
+// #include "../utils/mont.cuh"
+
+extern "C" bool CONCAT_EXPAND(CURVE, eq)(projective_t* point1, projective_t* point2)
+{
+  return true;
+}
+
+extern "C" void CONCAT_EXPAND(CURVE, to_affine)(projective_t* point, affine_t* point_out)
+{
+  return;
+}
+
+extern "C" void CONCAT_EXPAND(CURVE, generate_projective_points)(projective_t* points, int size)
+{
+  return;
+}
+
+extern "C" void CONCAT_EXPAND(CURVE, generate_affine_points)(affine_t* points, int size)
+{
+  return;
+}
+
+extern "C" int CONCAT_EXPAND(CURVE, affine_convert_montgomery)(
+  affine_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx)
+{
+  return 0;
+}
+
+extern "C" int CONCAT_EXPAND(CURVE, projective_convert_montgomery)(
+  projective_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx)
+{
+  return 0;
+}
--- a/icicle/src/curves/extern_g2.cpp
+++ b/icicle/src/curves/extern_g2.cpp
@@ -0,0 +1,39 @@
+#include "curves/curve_config.cuh"
+
+using namespace curve_config;
+
+#include "gpu-utils/device_context.cuh"
+#include "utils/utils.h"
+#include "utils/mont.cuh"
+
+extern "C" bool CONCAT_EXPAND(CURVE, g2_eq)(g2_projective_t* point1, g2_projective_t* point2)
+{
+  return true;
+}
+
+extern "C" void CONCAT_EXPAND(CURVE, g2_to_affine)(g2_projective_t* point, g2_affine_t* point_out)
+{
+  return;
+}
+
+extern "C" void CONCAT_EXPAND(CURVE, g2_generate_projective_points)(g2_projective_t* points, int size)
+{
+  return;
+}
+
+extern "C" void CONCAT_EXPAND(CURVE, g2_generate_affine_points)(g2_affine_t* points, int size)
+{
+  return;
+}
+
+extern "C" cudaError_t CONCAT_EXPAND(CURVE, g2_affine_convert_montgomery)(
+  g2_affine_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx)
+{
+  return 0;
+}
+
+extern "C" cudaError_t CONCAT_EXPAND(CURVE, g2_projective_convert_montgomery)(
+  g2_projective_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx)
+{
+  return 0;
+}
--- a/icicle/src/fields/CMakeLists.txt
+++ b/icicle/src/fields/CMakeLists.txt
@@ -1,40 +1,37 @@
-if (EXT_FIELD)
-  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DEXT_FIELD")
-endif ()
-
 SET(SUPPORTED_FIELDS_WITHOUT_NTT grumpkin)

 set(TARGET icicle_field)

-set(SRC ${CMAKE_SOURCE_DIR}/src)
+set(SRC ../../)

-set(FIELD_SOURCE ${SRC}/fields/extern.cu)
-list(APPEND FIELD_SOURCE ${SRC}/vec_ops/extern.cu)
+set(FIELD_SOURCE ${SRC}/fields/extern.cpp)
+# list(APPEND FIELD_SOURCE ${SRC}/vec_ops/extern.cu)
 if(EXT_FIELD)
-  list(APPEND FIELD_SOURCE ${SRC}/fields/extern_extension.cu)
-  list(APPEND FIELD_SOURCE ${SRC}/ntt/extern_extension.cu)
-  list(APPEND FIELD_SOURCE ${SRC}/vec_ops/extern_extension.cu)
+  list(APPEND FIELD_SOURCE ${SRC}/fields/extern_extension.cpp)
+  # list(APPEND FIELD_SOURCE ${SRC}/ntt/extern_extension.cu)
+  # list(APPEND FIELD_SOURCE ${SRC}/vec_ops/extern_extension.cu)
 endif()

-set(POLYNOMIAL_SOURCE_FILES 
-    ${SRC}/polynomials/polynomials.cu
-    ${SRC}/polynomials/cuda_backend/polynomial_cuda_backend.cu
-    ${SRC}/polynomials/polynomials_c_api.cu)
+# set(POLYNOMIAL_SOURCE_FILES 
+#     ${SRC}/polynomials/polynomials.cu
+#     ${SRC}/polynomials/cuda_backend/polynomial_cuda_backend.cu
+#     ${SRC}/polynomials/polynomials_c_api.cu)

-list(APPEND FIELD_SOURCE ${POLYNOMIAL_SOURCE_FILES})
+# list(APPEND FIELD_SOURCE ${POLYNOMIAL_SOURCE_FILES})

 # TODO: impl poseidon for small fields. note that it needs to be defined over the extension field!
-if (DEFINED CURVE)
-  list(APPEND FIELD_SOURCE ${SRC}/poseidon/poseidon.cu)
-  list(APPEND FIELD_SOURCE ${SRC}/poseidon/tree/merkle.cu)
-endif()
+# if (DEFINED CURVE)
+#   list(APPEND FIELD_SOURCE ${SRC}/poseidon/poseidon.cu)
+#   list(APPEND FIELD_SOURCE ${SRC}/poseidon/tree/merkle.cu)
+# endif()

-if (NOT FIELD IN_LIST SUPPORTED_FIELDS_WITHOUT_NTT)
-  list(APPEND FIELD_SOURCE ${SRC}/ntt/extern.cu)
-  list(APPEND FIELD_SOURCE ${SRC}/ntt/kernel_ntt.cu)
-endif()
+# if (NOT FIELD IN_LIST SUPPORTED_FIELDS_WITHOUT_NTT)
+#   list(APPEND FIELD_SOURCE ${SRC}/ntt/extern.cu)
+#   list(APPEND FIELD_SOURCE ${SRC}/ntt/kernel_ntt.cu)
+# endif()

 add_library(${TARGET} STATIC ${FIELD_SOURCE})
 target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
 set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_field_${FIELD}")
 target_compile_definitions(${TARGET} PUBLIC FIELD=${FIELD})
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/fields/extern.cpp
+++ b/icicle/src/fields/extern.cpp
@@ -0,0 +1,19 @@
+#define FIELD_ID BN254 
+#include "../../include/fields/field_config.cuh"
+
+using namespace field_config;
+
+//#include "../../include/utils/mont.cuh"
+#include "../../include/utils/utils.h"
+#include "../../include/gpu-utils/device_context.cuh"
+
+extern "C" void CONCAT_EXPAND(FIELD, generate_scalars)(scalar_t* scalars, int size)
+{
+  return;
+}
+
+extern "C" int CONCAT_EXPAND(FIELD, scalar_convert_montgomery)(
+  scalar_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx)
+{
+  return 0;
+}
--- a/icicle/src/fields/extern_extension.cpp
+++ b/icicle/src/fields/extern_extension.cpp
@@ -0,0 +1,18 @@
+#include "fields/field_config.cuh"
+
+using namespace field_config;
+
+#include "utils/mont.cuh"
+#include "utils/utils.h"
+#include "gpu-utils/device_context.cuh"
+
+extern "C" void CONCAT_EXPAND(FIELD, extension_generate_scalars)(extension_t* scalars, int size)
+{
+  return;
+}
+
+extern "C" cudaError_t CONCAT_EXPAND(FIELD, extension_scalar_convert_montgomery)(
+  extension_t* d_inout, size_t n, bool is_into, device_context::DeviceContext& ctx)
+{
+  return 0;
+}
--- a/icicle/src/hash/CMakeLists.txt
+++ b/icicle/src/hash/CMakeLists.txt
@@ -1,5 +1,6 @@
 set(TARGET icicle_hash)

-add_library(${TARGET} STATIC keccak/keccak.cu)
+add_library(${TARGET} STATIC keccak/keccak.cpp)
 target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
-set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_hash")
+set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_hash")
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/hash/keccak/keccak.cpp
+++ b/icicle/src/hash/keccak/keccak.cpp
@@ -0,0 +1,24 @@
+#include "../../../include/hash/keccak/keccak.cuh"
+
+typedef int cudaError_t;
+
+namespace keccak {
+  template <int C, int D>
+  cudaError_t
+  keccak_hash(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config)
+  {
+    return 0;
+  }
+
+  extern "C" cudaError_t
+  keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config)
+  {
+    return 0;
+  }
+
+  extern "C" cudaError_t
+  keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config)
+  {
+    return 0;
+  }
+} // namespace keccak
--- a/icicle/src/hash/keccak/keccak.cu
+++ b/icicle/src/hash/keccak/keccak.cu
@@ -144,14 +144,14 @@ namespace keccak {
    element ^= rc;                                                                                                     \
  }

-  __device__ const uint64_t RC[24] = {0x0000000000000001, 0x0000000000008082, 0x800000000000808a, 0x8000000080008000,
+  const uint64_t RC[24] = {0x0000000000000001, 0x0000000000008082, 0x800000000000808a, 0x8000000080008000,
                                      0x000000000000808b, 0x0000000080000001, 0x8000000080008081, 0x8000000000008009,
                                      0x000000000000008a, 0x0000000000000088, 0x0000000080008009, 0x000000008000000a,
                                      0x000000008000808b, 0x800000000000008b, 0x8000000000008089, 0x8000000000008003,
                                      0x8000000000008002, 0x8000000000000080, 0x000000000000800a, 0x800000008000000a,
                                      0x8000000080008081, 0x8000000000008080, 0x0000000080000001, 0x8000000080008008};

-  __device__ void keccakf(uint64_t s[25])
+  void keccakf(uint64_t s[25])
  {
    uint64_t t0, t1, t2, t3, t4;

@@ -224,7 +224,7 @@ namespace keccak {

  template <int C, int D>
  cudaError_t
-  keccak_hash(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig config)
+  keccak_hash(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config)
  {
    CHK_INIT_IF_RETURN();
    cudaStream_t& stream = config.ctx.stream;
@@ -245,7 +245,7 @@ namespace keccak {
      CHK_IF_RETURN(cudaMallocAsync(&output_device, number_of_blocks * (D / 8), stream));
    }

-    int number_of_threads = 1024;
+    int number_of_threads = 512;
    int number_of_gpu_blocks = (number_of_blocks - 1) / number_of_threads + 1;
    keccak_hash_blocks<C, D><<<number_of_gpu_blocks, number_of_threads, 0, stream>>>(
      input_device, input_block_size, number_of_blocks, output_device);
@@ -262,13 +262,13 @@ namespace keccak {
  }

  extern "C" cudaError_t
-  keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig config)
+  keccak256_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config)
  {
    return keccak_hash<512, 256>(input, input_block_size, number_of_blocks, output, config);
  }

  extern "C" cudaError_t
-  keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig config)
+  keccak512_cuda(uint8_t* input, int input_block_size, int number_of_blocks, uint8_t* output, KeccakConfig& config)
  {
    return keccak_hash<1024, 512>(input, input_block_size, number_of_blocks, output, config);
  }
--- a/icicle/src/hash/keccak/test.cu
+++ b/icicle/src/hash/keccak/test.cu
@@ -1,4 +1,4 @@
-#include "utils/device_context.cuh"
+#include "gpu-utils/device_context.cuh"
 #include "keccak.cu"

 // #define DEBUG
@@ -51,7 +51,7 @@ int main(int argc, char* argv[])

  START_TIMER(keccak_timer);
  KeccakConfig config = default_keccak_config();
-  keccak256(in_ptr, input_block_size, number_of_blocks, out_ptr, config);
+  keccak256_cuda(in_ptr, input_block_size, number_of_blocks, out_ptr, config);
  END_TIMER(keccak_timer, "Keccak")

  for (int i = 0; i < number_of_blocks; i++) {
--- a/icicle/src/msm/CMakeLists.txt
+++ b/icicle/src/msm/CMakeLists.txt
@@ -0,0 +1,28 @@
+set(TARGET icicle_msm)
+set(CURVE_TARGET icicle_curve)
+set(FIELD_TARGET icicle_field)
+
+set(SRC ../)
+
+set(MSM_SOURCE ${SRC}/msm/extern.cpp)
+if(G2)
+  list(APPEND MSM_SOURCE ${SRC}/msm/extern_g2.cpp)
+endif()
+# if(MSM)
+#   list(APPEND CURVE_SOURCE ${SRC}/msm/extern.cpp)
+#   if(G2)
+#     list(APPEND CURVE_SOURCE ${SRC}/msm/extern_g2.cpp)
+#   endif()
+# endif()
+# if(ECNTT)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/extern_ecntt.cpp)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/kernel_ntt.cpp)
+# endif()
+
+add_library(${TARGET} STATIC ${MSM_SOURCE})
+target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
+set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_curve_${CURVE}")
+target_compile_definitions(${TARGET} PUBLIC CURVE=${CURVE})
+target_link_libraries(${TARGET} PRIVATE ${FIELD_TARGET})
+target_link_libraries(${TARGET} PRIVATE ${CURVE_TARGET})
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/msm/extern.cpp
+++ b/icicle/src/msm/extern.cpp
@@ -0,0 +1,46 @@
+#define CURVE_ID BN254
+#define FIELD_ID BN254
+#include "../../include/curves/curve_config.cuh"
+#include "../../include/fields/field_config.cuh"
+#include "../../include/gpu-utils/device_context.cuh"
+#include "../../include/msm/msm.cuh"
+
+typedef int cudaError_t;
+using namespace curve_config;
+using namespace field_config;
+
+#include "../../include/utils/utils.h"
+
+namespace msm {
+  /**
+   * Extern "C" version of [precompute_msm_bases](@ref precompute_msm_bases) function with the following values of
+   * template parameters (where the curve is given by `-DCURVE` env variable during build):
+   *  - `A` is the [affine representation](@ref affine_t) of curve points;
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(CURVE, precompute_msm_bases_cuda)(
+    affine_t* bases,
+    int bases_size,
+    int precompute_factor,
+    int _c,
+    bool are_bases_on_device,
+    device_context::DeviceContext& ctx,
+    affine_t* output_bases)
+  {
+    return 0;
+  }
+
+  /**
+   * Extern "C" version of [msm](@ref msm) function with the following values of template parameters
+   * (where the curve is given by `-DCURVE` env variable during build):
+   *  - `S` is the [scalar field](@ref scalar_t) of the curve;
+   *  - `A` is the [affine representation](@ref affine_t) of curve points;
+   *  - `P` is the [projective representation](@ref projective_t) of curve points.
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(CURVE, msm_cuda)(
+    const scalar_t* scalars, const affine_t* points, int msm_size, MSMConfig& config, projective_t* out)
+  {
+    return 0;
+  }
+} // namespace msm
--- a/icicle/src/msm/extern_g2.cpp
+++ b/icicle/src/msm/extern_g2.cpp
@@ -0,0 +1,43 @@
+#include "curves/curve_config.cuh"
+#include "fields/field_config.cuh"
+
+using namespace curve_config;
+using namespace field_config;
+
+#include "msm.cu"
+#include "utils/utils.h"
+
+namespace msm {
+  /**
+   * Extern "C" version of [precompute_msm_bases](@ref precompute_msm_bases) function with the following values of
+   * template parameters (where the curve is given by `-DCURVE` env variable during build):
+   *  - `A` is the [affine representation](@ref g2_affine_t) of G2 curve points;
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(CURVE, g2_precompute_msm_bases_cuda)(
+    g2_affine_t* bases,
+    int bases_size,
+    int precompute_factor,
+    int _c,
+    bool are_bases_on_device,
+    device_context::DeviceContext& ctx,
+    g2_affine_t* output_bases)
+  {
+    return precompute_msm_bases<g2_affine_t, g2_projective_t>(
+      bases, bases_size, precompute_factor, _c, are_bases_on_device, ctx, output_bases);
+  }
+
+  /**
+   * Extern "C" version of [msm](@ref msm) function with the following values of template parameters
+   * (where the curve is given by `-DCURVE` env variable during build):
+   *  - `S` is the [scalar field](@ref scalar_t) of the curve;
+   *  - `A` is the [affine representation](@ref g2_affine_t) of G2 curve points;
+   *  - `P` is the [projective representation](@ref g2_projective_t) of G2 curve points.
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(CURVE, g2_msm_cuda)(
+    const scalar_t* scalars, const g2_affine_t* points, int msm_size, MSMConfig& config, g2_projective_t* out)
+  {
+    return msm<scalar_t, g2_affine_t, g2_projective_t>(scalars, points, msm_size, config, out);
+  }
+} // namespace msm
--- a/icicle/src/msm/tests/msm_test.cu
+++ b/icicle/src/msm/tests/msm_test.cu
@@ -20,32 +20,32 @@ public:
  unsigned p = 10;
  // unsigned p = 1<<30;

-  static HOST_DEVICE_INLINE Dummy_Scalar zero() { return {0}; }
+  static Dummy_Scalar zero() { return {0}; }

-  static HOST_DEVICE_INLINE Dummy_Scalar one() { return {1}; }
+  static Dummy_Scalar one() { return {1}; }

-  friend HOST_INLINE std::ostream& operator<<(std::ostream& os, const Dummy_Scalar& scalar)
+  friend std::ostream& operator<<(std::ostream& os, const Dummy_Scalar& scalar)
  {
    os << scalar.x;
    return os;
  }

-  HOST_DEVICE_INLINE unsigned get_scalar_digit(unsigned digit_num, unsigned digit_width) const
+  unsigned get_scalar_digit(unsigned digit_num, unsigned digit_width) const
  {
    return (x >> (digit_num * digit_width)) & ((1 << digit_width) - 1);
  }

-  friend HOST_DEVICE_INLINE Dummy_Scalar operator+(Dummy_Scalar p1, const Dummy_Scalar& p2)
+  friend Dummy_Scalar operator+(Dummy_Scalar p1, const Dummy_Scalar& p2)
  {
    return {(p1.x + p2.x) % p1.p};
  }

-  friend HOST_DEVICE_INLINE bool operator==(const Dummy_Scalar& p1, const Dummy_Scalar& p2) { return (p1.x == p2.x); }
+  friend bool operator==(const Dummy_Scalar& p1, const Dummy_Scalar& p2) { return (p1.x == p2.x); }

-  friend HOST_DEVICE_INLINE bool operator==(const Dummy_Scalar& p1, const unsigned p2) { return (p1.x == p2); }
+  friend bool operator==(const Dummy_Scalar& p1, const unsigned p2) { return (p1.x == p2); }

-  static HOST_DEVICE_INLINE Dummy_Scalar neg(const Dummy_Scalar& scalar) { return {scalar.p - scalar.x}; }
-  static HOST_INLINE Dummy_Scalar rand_host()
+  static Dummy_Scalar neg(const Dummy_Scalar& scalar) { return {scalar.p - scalar.x}; }
+  static Dummy_Scalar rand_host()
  {
    return {(unsigned)rand() % 10};
    // return {(unsigned)rand()};
@@ -57,32 +57,32 @@ class Dummy_Projective
 public:
  Dummy_Scalar x;

-  static HOST_DEVICE_INLINE Dummy_Projective zero() { return {0}; }
+  static Dummy_Projective zero() { return {0}; }

-  static HOST_DEVICE_INLINE Dummy_Projective one() { return {1}; }
+  static Dummy_Projective one() { return {1}; }

-  static HOST_DEVICE_INLINE Dummy_Projective to_affine(const Dummy_Projective& point) { return {point.x}; }
+  static Dummy_Projective to_affine(const Dummy_Projective& point) { return {point.x}; }

-  static HOST_DEVICE_INLINE Dummy_Projective from_affine(const Dummy_Projective& point) { return {point.x}; }
+  static Dummy_Projective from_affine(const Dummy_Projective& point) { return {point.x}; }

-  static HOST_DEVICE_INLINE Dummy_Projective neg(const Dummy_Projective& point) { return {Dummy_Scalar::neg(point.x)}; }
+  static Dummy_Projective neg(const Dummy_Projective& point) { return {Dummy_Scalar::neg(point.x)}; }

-  friend HOST_DEVICE_INLINE Dummy_Projective operator+(Dummy_Projective p1, const Dummy_Projective& p2)
+  friend Dummy_Projective operator+(Dummy_Projective p1, const Dummy_Projective& p2)
  {
    return {p1.x + p2.x};
  }

-  // friend HOST_DEVICE_INLINE Dummy_Projective operator-(Dummy_Projective p1, const Dummy_Projective& p2) {
+  // friend Dummy_Projective operator-(Dummy_Projective p1, const Dummy_Projective& p2) {
  //   return p1 + neg(p2);
  // }

-  friend HOST_INLINE std::ostream& operator<<(std::ostream& os, const Dummy_Projective& point)
+  friend std::ostream& operator<<(std::ostream& os, const Dummy_Projective& point)
  {
    os << point.x;
    return os;
  }

-  friend HOST_DEVICE_INLINE Dummy_Projective operator*(Dummy_Scalar scalar, const Dummy_Projective& point)
+  friend Dummy_Projective operator*(Dummy_Scalar scalar, const Dummy_Projective& point)
  {
    Dummy_Projective res = zero();
 #ifdef CUDA_ARCH
@@ -95,14 +95,14 @@ public:
    return res;
  }

-  friend HOST_DEVICE_INLINE bool operator==(const Dummy_Projective& p1, const Dummy_Projective& p2)
+  friend bool operator==(const Dummy_Projective& p1, const Dummy_Projective& p2)
  {
    return (p1.x == p2.x);
  }

-  static HOST_DEVICE_INLINE bool is_zero(const Dummy_Projective& point) { return point.x == 0; }
+  static bool is_zero(const Dummy_Projective& point) { return point.x == 0; }

-  static HOST_INLINE Dummy_Projective rand_host()
+  static Dummy_Projective rand_host()
  {
    return {(unsigned)rand() % 10};
    // return {(unsigned)rand()};
--- a/icicle/src/ntt/CMakeLists.txt
+++ b/icicle/src/ntt/CMakeLists.txt
@@ -0,0 +1,34 @@
+set(TARGET icicle_ntt)
+set(CURVE_TARGET icicle_curve)
+set(FIELD_TARGET icicle_field)
+
+set(SRC ../)
+
+set(NTT_SOURCE ${SRC}/ntt/extern.cpp)
+set(NTT_SOURCE_EXTENSION ${SRC}/ntt/extern_extension.cpp)
+set(NTT_SOURCE_EC ${SRC}/ntt/extern_ecntt.cpp)
+set(NTT_SOURCE ${SRC}/ntt/extern.cpp)
+if(G2)
+  list(APPEND NTT_SOURCE ${SRC}/ntt/extern_g2.cpp)
+endif()
+# if(MSM)
+#   list(APPEND CURVE_SOURCE ${SRC}/msm/extern.cpp)
+#   if(G2)
+#     list(APPEND CURVE_SOURCE ${SRC}/msm/extern_g2.cpp)
+#   endif()
+# endif()
+# if(ECNTT)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/extern_ecntt.cpp)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/kernel_ntt.cpp)
+# endif()
+
+add_library(${TARGET} STATIC ${NTT_SOURCE})
+add_library(${TARGET} STATIC ${NTT_SOURCE_EXTENSION})
+add_library(${TARGET} STATIC ${NTT_SOURCE_EC})
+add_library(${TARGET} STATIC )
+target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
+set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_curve_${CURVE}")
+target_compile_definitions(${TARGET} PUBLIC CURVE=${CURVE})
+target_link_libraries(${TARGET} PRIVATE ${FIELD_TARGET})
+target_link_libraries(${TARGET} PRIVATE ${CURVE_TARGET})
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/ntt/Makefile
+++ b/icicle/src/ntt/Makefile
@@ -1,13 +1,21 @@
 build_verification:
 	mkdir -p work
-	nvcc -o work/test_verification -I. -I../../include tests/verification.cu -std=c++17
+	g++ -o work/test_verification -Intt.cpp -Ikernel_ntt.cpp -Iextern.cpp -Iextern_ecntt.cpp -I ectern_extension.cpp -I../../include tests/verification.cpp -std=c++17
+
+build_extern:
+	g++ -o work/test_verification -I../../include extern.cpp -std=c++17
+
+
+extern.o: extern.cpp
+    g++ -std=c++17 -std=gnu++11 -c -o $@ $< -I../../include
+

 test_verification: build_verification
 	work/test_verification

 build_verification_ecntt:
 	mkdir -p work
-	nvcc -o work/test_verification_ecntt -I. -I../../include tests/verification.cu -std=c++17 -DECNTT
+	g++ -o work/test_verification_ecntt -I. -I../../include tests/verification.cpp -std=c++17 -DECNTT

 test_verification_ecntt: build_verification_ecntt
 	work/test_verification_ecntt
--- a/icicle/src/ntt/extern.cpp
+++ b/icicle/src/ntt/extern.cpp
@@ -0,0 +1,60 @@
+#define FIELD_ID BN254
+#include "../../include/fields/field_config.cuh"
+
+using namespace field_config;
+
+#include "ntt.cpp"
+
+#include "../../include/gpu-utils/device_context.cuh"
+#include "../../include/utils/utils.h"
+
+typedef int cudaError_t;
+namespace ntt {
+  /**
+   * Extern "C" version of [init_domain](@ref init_domain) function with the following
+   * value of template parameter (where the field is given by `-DFIELD` env variable during build):
+   *  - `S` is the [field](@ref scalar_t) - either a scalar field of the elliptic curve or a
+   * stand-alone "STARK field";
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(FIELD, initialize_domain)(
+    scalar_t* primitive_root, device_context::DeviceContext& ctx, bool fast_twiddles_mode)
+  {
+    return 0;
+  }
+
+  /**
+   * Extern "C" version of [ntt](@ref ntt) function with the following values of template parameters
+   * (where the field is given by `-DFIELD` env variable during build):
+   *  - `S` is the [field](@ref scalar_t) - either a scalar field of the elliptic curve or a
+   * stand-alone "STARK field";
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(FIELD, ntt_cuda)(
+    const scalar_t* input, int size, NTTDir dir, NTTConfig<scalar_t>& config, scalar_t* output)
+  {
+    return ntt<scalar_t, scalar_t>(input, size, dir, config, output);
+  }
+
+  /**
+   * Extern "C" version of [release_domain](@ref release_domain) function with the following values of template
+   * parameters (where the field is given by `-DFIELD` env variable during build):
+   *  - `S` is the [field](@ref scalar_t) - either a scalar field of the elliptic curve or a
+   * stand-alone "STARK field";
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(FIELD, release_domain)(device_context::DeviceContext& ctx)
+  {
+    return release_domain<scalar_t>(ctx);
+  }
+
+  /**
+   * Extern "C" version of [get_root_of_unity](@ref get_root_of_unity) function with the following
+   * value of template parameter (where the field is given by `-DFIELD` env variable during build):
+   *  - `S` is the [field](@ref scalar_t) - either a scalar field of the elliptic curve or a
+   * stand-alone "STARK field";
+   */
+  extern "C" scalar_t CONCAT_EXPAND(FIELD, get_root_of_unity)(uint32_t logn)
+  {
+    return get_root_of_unity<scalar_t>(logn);
+  }
+} // namespace ntt
--- a/icicle/src/ntt/extern_ecntt.cpp
+++ b/icicle/src/ntt/extern_ecntt.cpp
@@ -0,0 +1,28 @@
+
+#define FIELD_ID BN254
+#define CURVE_ID BN254
+#include "../../include/curves/curve_config.cuh"
+#include "../../include/fields/field_config.cuh"
+
+using namespace curve_config;
+using namespace field_config;
+
+#include "ntt.cpp"
+
+#include "../../include/gpu-utils/device_context.cuh"
+#include "../../include/utils/utils.h"
+
+namespace ntt {
+  /**
+   * Extern "C" version of [ntt](@ref ntt) function with the following values of template parameters
+   * (where the curve is given by `-DCURVE` env variable during build):
+   *  - `S` is the [projective representation](@ref projective_t) of the curve (i.e. EC NTT is computed);
+   *  - `E` is the [scalar field](@ref scalar_t) of the curve;
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(CURVE, ecntt_cuda)(
+    const projective_t* input, int size, NTTDir dir, NTTConfig<scalar_t>& config, projective_t* output)
+  {
+    return ntt<scalar_t, projective_t>(input, size, dir, config, output);
+  }
+} // namespace ntt
--- a/icicle/src/ntt/extern_extension.cpp
+++ b/icicle/src/ntt/extern_extension.cpp
@@ -0,0 +1,24 @@
+#define FIELD_ID BABY_BEAR
+#include "../../include/fields/field_config.cuh"
+
+using namespace field_config;
+
+#include "ntt.cpp"
+
+#include "../../include/gpu-utils/device_context.cuh"
+#include "../../include/utils/utils.h"
+
+namespace ntt {
+  /**
+   * Extern "C" version of [ntt](@ref ntt) function with the following values of template parameters
+   * (where the field is given by `-DFIELD` env variable during build):
+   *  - `E` is the [field](@ref scalar_t);
+   *  - `S` is the [extension](@ref extension_t) of `E` of appropriate degree;
+   * @return `cudaSuccess` if the execution was successful and an error code otherwise.
+   */
+  extern "C" cudaError_t CONCAT_EXPAND(FIELD, extension_ntt_cuda)(
+    const extension_t* input, int size, NTTDir dir, NTTConfig<scalar_t>& config, extension_t* output)
+  {
+    return ntt<scalar_t, extension_t>(input, size, dir, config, output);
+  }
+} // namespace ntt
--- a/icicle/src/ntt/kernel_ntt.cpp
+++ b/icicle/src/ntt/kernel_ntt.cpp
--- a/icicle/src/ntt/kernel_ntt.cu
+++ b/icicle/src/ntt/kernel_ntt.cu
@@ -8,7 +8,7 @@ using namespace field_config;

 namespace mxntt {

-  static inline __device__ uint32_t dig_rev(uint32_t num, uint32_t log_size, bool dit, bool fast_tw)
+  static inline uint32_t dig_rev(uint32_t num, uint32_t log_size, bool dit, bool fast_tw)
  {
    uint32_t rev_num = 0, temp, dig_len;
    if (dit) {
@@ -31,11 +31,11 @@ namespace mxntt {
    return rev_num;
  }

-  static inline __device__ uint32_t bit_rev(uint32_t num, uint32_t log_size) { return __brev(num) >> (32 - log_size); }
+  static inline uint32_t bit_rev(uint32_t num, uint32_t log_size) { return __brev(num) >> (32 - log_size); }

  enum eRevType { None, RevToMixedRev, MixedRevToRev, NaturalToMixedRev, NaturalToRev, MixedRevToNatural };

-  static __device__ uint32_t generalized_rev(uint32_t num, uint32_t log_size, bool dit, bool fast_tw, eRevType rev_type)
+  static uint32_t generalized_rev(uint32_t num, uint32_t log_size, bool dit, bool fast_tw, eRevType rev_type)
  {
    switch (rev_type) {
    case eRevType::RevToMixedRev:
@@ -134,14 +134,23 @@ namespace mxntt {
    int n_scalars,
    uint32_t log_size,
    eRevType rev_type,
-    bool dit,
+    bool fast_tw,
    E* out_vec)
  {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid >= size * batch_size) return;
    int64_t scalar_id = (tid / columns_batch_size) % size;
-    if (rev_type != eRevType::None)
-      scalar_id = generalized_rev((tid / columns_batch_size) & ((1 << log_size) - 1), log_size, dit, false, rev_type);
+    if (rev_type != eRevType::None) {
+      // Note: when we multiply an in_vec that is mixed (by DIF (I)NTT), we want to shuffle the
+      // scalars the same way (then multiply element-wise). This would be a DIT-digit-reverse shuffle. (this is
+      // confusing but) BUT to avoid shuffling the scalars, we instead want to ask which element in the non-shuffled
+      // vec is now placed at index tid, which is the opposite of a DIT-digit-reverse --> this is the DIF-digit-reverse.
+      // Therefore we use the DIF-digit-reverse to know which element moved to index tid and use it to access the
+      // corresponding element in scalars vec.
+      const bool dif = rev_type == eRevType::NaturalToMixedRev;
+      scalar_id =
+        generalized_rev((tid / columns_batch_size) & ((1 << log_size) - 1), log_size, !dif, fast_tw, rev_type);
+    }
    out_vec[tid] = *(scalar_vec + ((scalar_id * step) % n_scalars)) * in_vec[tid];
  }

@@ -903,6 +912,7 @@ namespace mxntt {
    S* external_twiddles,
    S* internal_twiddles,
    S* basic_twiddles,
+    S* linear_twiddle, // twiddles organized as [1,w,w^2,...] for coset-eval in fast-tw mode
    int ntt_size,
    int max_logn,
    int batch_size,
@@ -958,8 +968,8 @@ namespace mxntt {
    if (is_on_coset && !is_inverse) {
      batch_elementwise_mul_with_reorder_kernel<<<NOF_BLOCKS, NOF_THREADS, 0, cuda_stream>>>(
        d_input, ntt_size, columns_batch, batch_size, columns_batch ? batch_size : 1,
-        arbitrary_coset ? arbitrary_coset : external_twiddles, arbitrary_coset ? 1 : coset_gen_index, n_twiddles, logn,
-        reverse_coset, dit, d_output);
+        arbitrary_coset ? arbitrary_coset : linear_twiddle, arbitrary_coset ? 1 : coset_gen_index, n_twiddles, logn,
+        reverse_coset, fast_tw, d_output);

      d_input = d_output;
    }
@@ -991,8 +1001,8 @@ namespace mxntt {
    if (is_on_coset && is_inverse) {
      batch_elementwise_mul_with_reorder_kernel<<<NOF_BLOCKS, NOF_THREADS, 0, cuda_stream>>>(
        d_output, ntt_size, columns_batch, batch_size, columns_batch ? batch_size : 1,
-        arbitrary_coset ? arbitrary_coset : external_twiddles + n_twiddles, arbitrary_coset ? 1 : -coset_gen_index,
-        n_twiddles, logn, reverse_coset, dit, d_output);
+        arbitrary_coset ? arbitrary_coset : linear_twiddle + n_twiddles, arbitrary_coset ? 1 : -coset_gen_index,
+        n_twiddles, logn, reverse_coset, fast_tw, d_output);
    }

    return CHK_LAST();
@@ -1021,6 +1031,8 @@ namespace mxntt {
    scalar_t* external_twiddles,
    scalar_t* internal_twiddles,
    scalar_t* basic_twiddles,
+    scalar_t* linear_twiddles,
+
    int ntt_size,
    int max_logn,
    int batch_size,
@@ -1039,6 +1051,8 @@ namespace mxntt {
    scalar_t* external_twiddles,
    scalar_t* internal_twiddles,
    scalar_t* basic_twiddles,
+    scalar_t* linear_twiddles,
+
    int ntt_size,
    int max_logn,
    int batch_size,
--- a/icicle/src/ntt/ntt.cpp
+++ b/icicle/src/ntt/ntt.cpp
@@ -0,0 +1,338 @@
+
+
+#define FIELD_ID BN254
+#include "../../include/fields/field_config.cuh"
+
+
+using namespace field_config;
+
+#include "../../include/ntt/ntt.cuh"
+
+#include <unordered_map>
+#include <vector>
+#include <type_traits>
+
+#include "../../include/gpu-utils/sharedmem.cuh"
+#include "../../include/utils/utils_kernels.cuh"
+#include "../../include/utils/utils.h"
+#include "../../include/ntt/ntt_impl.cuh"
+#include "../../include/gpu-utils/device_context.cuh"
+
+#include <mutex>
+
+#ifdef CURVE_ID
+#include "../../include/curves/curve_config.cuh"
+using namespace curve_config;
+#define IS_ECNTT std::is_same_v<E, projective_t>
+#else
+#define IS_ECNTT false
+#endif
+
+namespace ntt {
+
+  namespace {
+    // TODO: Set MAX THREADS based on GPU arch
+    const uint32_t MAX_NUM_THREADS = 512; // TODO: hotfix - should be 1024, currently limits shared memory size
+    const uint32_t MAX_THREADS_BATCH = 512;
+    const uint32_t MAX_THREADS_BATCH_ECNTT =
+      128; // TODO: hardcoded - allows (2^18 x 64) ECNTT for sm86, decrease this to allow larger ecntt length, batch
+           // size limited by on-device memory
+    const uint32_t MAX_SHARED_MEM_ELEMENT_SIZE = 32; // TODO: occupancy calculator, hardcoded for sm_86..sm_89
+    const uint32_t MAX_SHARED_MEM = MAX_SHARED_MEM_ELEMENT_SIZE * MAX_NUM_THREADS;
+
+    template <typename E>
+    void reverse_order_kernel(const E* arr, E* arr_reversed, uint32_t n, uint32_t logn, uint32_t batch_size)
+    {
+      return;
+    }
+
+    /**
+     * Bit-reverses a batch of input arrays out-of-place inside GPU.
+     * for example: on input array ([a[0],a[1],a[2],a[3]], 4, 2) it returns
+     * [a[0],a[3],a[2],a[1]] (elements at indices 3 and 1 swhich places).
+     * @param arr_in batch of arrays of some object of type T. Should be on GPU.
+     * @param n length of `arr`.
+     * @param logn log(n).
+     * @param batch_size the size of the batch.
+     * @param arr_out buffer of the same size as `arr_in` on the GPU to write the bit-permuted array into.
+     */
+    template <typename E>
+    void reverse_order_batch(
+      const E* arr_in, uint32_t n, uint32_t logn, uint32_t batch_size, cudaStream_t stream, E* arr_out)
+    {
+      return;
+    }
+
+    /**
+     * Bit-reverses an input array out-of-place inside GPU.
+     * for example: on array ([a[0],a[1],a[2],a[3]], 4, 2) it returns
+     * [a[0],a[3],a[2],a[1]] (elements at indices 3 and 1 swhich places).
+     * @param arr_in array of some object of type T of size which is a power of 2. Should be on GPU.
+     * @param n length of `arr`.
+     * @param logn log(n).
+     * @param arr_out buffer of the same size as `arr_in` on the GPU to write the bit-permuted array into.
+     */
+    template <typename E>
+    void reverse_order(const E* arr_in, uint32_t n, uint32_t logn, cudaStream_t stream, E* arr_out)
+    {
+      reverse_order_batch(arr_in, n, logn, 1, stream, arr_out);
+    }
+
+    /**
+     * Cooley-Tuckey NTT.
+     * NOTE! this function assumes that d_twiddles are located in the device memory.
+     * @param arr_in input array of type E (elements).
+     * @param n length of d_arr.
+     * @param twiddles twiddle factors of type S (scalars) array allocated on the device memory (must be a power of 2).
+     * @param n_twiddles length of twiddles, should be negative for intt.
+     * @param max_task max count of parallel tasks.
+     * @param s log2(n) loop index.
+     * @param arr_out buffer for the output.
+     */
+    template <typename E, typename S>
+    void ntt_template_kernel_shared_rev(
+      const E* __restrict__ arr_in,
+      int n,
+      const S* __restrict__ r_twiddles,
+      int n_twiddles,
+      int max_task,
+      int ss,
+      int logn,
+      E* __restrict__ arr_out)
+    {
+      return;
+    }
+
+    /**
+     * Cooley-Tuckey NTT.
+     * NOTE! this function assumes that d_twiddles are located in the device memory.
+     * @param arr_in input array of type E (elements).
+     * @param n length of d_arr.
+     * @param twiddles twiddle factors of type S (scalars) array allocated on the device memory (must be a power of 2).
+     * @param n_twiddles length of twiddles, should be negative for intt.
+     * @param max_task max count of parallel tasks.
+     * @param s log2(n) loop index.
+     * @param arr_out buffer for the output.
+     */
+    template <typename E, typename S>
+    void ntt_template_kernel_shared(
+      const E* __restrict__ arr_in,
+      int n,
+      const S* __restrict__ r_twiddles,
+      int n_twiddles,
+      int max_task,
+      int s,
+      int logn,
+      E* __restrict__ arr_out)
+    {
+      return;
+    }
+
+    /**
+     * Cooley-Tukey NTT.
+     * NOTE! this function assumes that d_twiddles are located in the device memory.
+     * @param arr input array of type E (elements).
+     * @param n length of d_arr.
+     * @param twiddles twiddle factors of type S (scalars) array allocated on the device memory (must be a power of 2).
+     * @param n_twiddles length of twiddles, should be negative for intt.
+     * @param max_task max count of parallel tasks.
+     * @param s log2(n) loop index.
+     */
+    template <typename E, typename S>
+    void
+    ntt_template_kernel(const E* arr_in, int n, S* twiddles, int n_twiddles, int max_task, int s, bool rev, E* arr_out)
+    {
+      return;
+    }
+
+    /**
+     * NTT/INTT inplace batch
+     * Note: this function does not perform any bit-reverse permutations on its inputs or outputs.
+     * @param d_input Input array
+     * @param n Size of `d_input`
+     * @param d_twiddles Twiddles
+     * @param n_twiddles Size of `d_twiddles`
+     * @param batch_size The size of the batch; the length of `d_inout` is `n` * `batch_size`.
+     * @param inverse true for iNTT
+     * @param coset should be array of length n or a nullptr if NTT is not computed on a coset
+     * @param stream CUDA stream
+     * @param is_async if false, perform sync of the supplied CUDA stream at the end of processing
+     * @param d_output Output array
+     */
+    template <typename E, typename S>
+    cudaError_t ntt_inplace_batch_template(
+      const E* d_input,
+      int n,
+      S* d_twiddles,
+      int n_twiddles,
+      int batch_size,
+      int logn,
+      bool inverse,
+      bool dit,
+      S* arbitrary_coset,
+      int coset_gen_index,
+      cudaStream_t stream,
+      E* d_output)
+    {
+      return 0;
+    }
+  } // namespace
+
+  /**
+   * @struct Domain
+   * Struct containing information about the domain on which (i)NTT is evaluated i.e. twiddle factors.
+   * Twiddle factors are private, static and can only be set using [init_domain](@ref init_domain) function.
+   * The internal representation of twiddles is prone to change in accordance with changing [NTT](@ref NTT) algorithm.
+   * @tparam S The type of twiddle factors \f$ \{ \omega^i \} \f$. Must be a field.
+   */
+  template <typename S>
+  class Domain
+  {
+    // Mutex for protecting access to the domain/device container array
+    static inline std::mutex device_domain_mutex;
+    // The domain-per-device container - assumption is init_domain is called once per device per program.
+
+    int max_size = 0;
+    int max_log_size = 0;
+    S* twiddles = nullptr;
+    bool initialized = false; // protection for multi-threaded case
+    std::unordered_map<S, int> coset_index = {};
+
+    S* internal_twiddles = nullptr; // required by mixed-radix NTT
+    S* basic_twiddles = nullptr;    // required by mixed-radix NTT
+
+    // mixed-radix NTT supports a fast-twiddle option at the cost of additional 4N memory (where N is max NTT size)
+    S* fast_external_twiddles = nullptr;     // required by mixed-radix NTT (fast-twiddles mode)
+    S* fast_internal_twiddles = nullptr;     // required by mixed-radix NTT (fast-twiddles mode)
+    S* fast_basic_twiddles = nullptr;        // required by mixed-radix NTT (fast-twiddles mode)
+    S* fast_external_twiddles_inv = nullptr; // required by mixed-radix NTT (fast-twiddles mode)
+    S* fast_internal_twiddles_inv = nullptr; // required by mixed-radix NTT (fast-twiddles mode)
+    S* fast_basic_twiddles_inv = nullptr;    // required by mixed-radix NTT (fast-twiddles mode)
+
+  public:
+    template <typename U>
+    friend cudaError_t init_domain(U primitive_root, device_context::DeviceContext& ctx, bool fast_tw);
+
+    template <typename U>
+    friend cudaError_t release_domain(device_context::DeviceContext& ctx);
+
+    template <typename U>
+    friend U get_root_of_unity(uint64_t logn, device_context::DeviceContext& ctx);
+
+    template <typename U>
+    friend U get_root_of_unity_from_domain(uint64_t logn, device_context::DeviceContext& ctx);
+
+    template <typename U, typename E>
+    friend cudaError_t ntt(const E* input, int size, NTTDir dir, NTTConfig<U>& config, E* output);
+  };
+
+  template <typename S>
+  // static inline Domain<S> domains_for_devices[device_context::MAX_DEVICES] = {};
+  static inline Domain<S> domains_for_devices[1] = {};
+
+  template <typename S>
+  cudaError_t init_domain(S primitive_root, device_context::DeviceContext& ctx, bool fast_twiddles_mode)
+  {
+    return 0;
+  }
+
+  template <typename S>
+  cudaError_t release_domain(device_context::DeviceContext& ctx)
+  {
+    return 0;
+  }
+
+  template <typename S>
+  S get_root_of_unity(uint64_t max_size)
+  {
+    // ceil up
+    const auto log_max_size = static_cast<uint32_t>(std::ceil(std::log2(max_size)));
+    return S::omega(log_max_size);
+  }
+  // explicit instantiation to avoid having to include this file
+  template scalar_t get_root_of_unity(uint64_t logn);
+
+  template <typename S>
+  S get_root_of_unity_from_domain(uint64_t logn, device_context::DeviceContext& ctx)
+  {
+    Domain<S>& domain = domains_for_devices<S>[ctx.device_id];
+    if (logn > domain.max_log_size) {
+      std::ostringstream oss;
+      oss << "NTT log_size=" << logn
+          << " is too large for the domain. Consider generating your domain with a higher order root of unity.\n";
+      THROW_ICICLE_ERR(IcicleError_t::InvalidArgument, oss.str().c_str());
+    }
+    const size_t twiddles_idx = 1ULL << (domain.max_log_size - logn);
+    return domain.twiddles[twiddles_idx];
+  }
+  // explicit instantiation to avoid having to include this file
+  template scalar_t get_root_of_unity_from_domain(uint64_t logn, device_context::DeviceContext& ctx);
+
+  template <typename S>
+  static bool is_choosing_radix2_algorithm(int logn, int batch_size, const NTTConfig<S>& config)
+  {
+    const bool is_mixed_radix_alg_supported = (logn > 3 && logn != 7);
+    if (!is_mixed_radix_alg_supported && config.columns_batch)
+      throw IcicleError(IcicleError_t::InvalidArgument, "columns batch is not supported for given NTT size");
+    const bool is_user_selected_radix2_alg = config.ntt_algorithm == NttAlgorithm::Radix2;
+    const bool is_force_radix2 = !is_mixed_radix_alg_supported || is_user_selected_radix2_alg;
+    if (is_force_radix2) return true;
+
+    const bool is_user_selected_mixed_radix_alg = config.ntt_algorithm == NttAlgorithm::MixedRadix;
+    if (is_user_selected_mixed_radix_alg) return false;
+    if (config.columns_batch) return false; // radix2 does not currently support columns batch mode.
+
+    // Heuristic to automatically select an algorithm
+    // Note that generally the decision depends on {logn, batch, ordering, inverse, coset, in-place, coeff-field} and
+    // the specific GPU.
+    // the following heuristic is a simplification based on measurements. Users can try both and select the algorithm
+    // based on the specific case via the 'NTTConfig.ntt_algorithm' field
+
+    if (logn >= 16) return false; // mixed-radix is typically faster in those cases
+    if (logn <= 11) return true;  //  radix-2 is typically faster for batch<=256 in those cases
+    const int log_batch = (int)log2(batch_size);
+    return (logn + log_batch <= 18); // almost the cutoff point where both are equal
+  }
+
+  template <typename S, typename E>
+  cudaError_t radix2_ntt(
+    const E* d_input,
+    E* d_output,
+    S* twiddles,
+    int ntt_size,
+    int max_size,
+    int batch_size,
+    bool is_inverse,
+    Ordering ordering,
+    S* arbitrary_coset,
+    int coset_gen_index,
+    cudaStream_t cuda_stream)
+  {
+    return 0;
+  }
+
+  template <typename S, typename E>
+  cudaError_t ntt(const E* input, int size, NTTDir dir, NTTConfig<S>& config, E* output)
+  {
+    return 0;
+  }
+
+  template <typename S>
+  NTTConfig<S> default_ntt_config(const device_context::DeviceContext& ctx)
+  {
+    NTTConfig<S> config = {
+      ctx,                // ctx
+      S::one(),           // coset_gen
+      1,                  // batch_size
+      false,              // columns_batch
+      Ordering::kNN,      // ordering
+      false,              // are_inputs_on_device
+      false,              // are_outputs_on_device
+      false,              // is_async
+      NttAlgorithm::Auto, // ntt_algorithm
+    };
+    return config;
+  }
+  // explicit instantiation to avoid having to include this file
+  template NTTConfig<scalar_t> default_ntt_config(const device_context::DeviceContext& ctx);
+} // namespace ntt
--- a/icicle/src/ntt/ntt.cu
+++ b/icicle/src/ntt/ntt.cu
@@ -717,8 +717,7 @@ namespace ntt {
          d_input, d_output, domain.twiddles, size, domain.max_size, batch_size, is_inverse, config.ordering, coset,
          coset_index, stream));
      } else {
-        const bool is_on_coset = (coset_index != 0) || coset;
-        const bool is_fast_twiddles_enabled = (domain.fast_external_twiddles != nullptr) && !is_on_coset;
+        const bool is_fast_twiddles_enabled = (domain.fast_external_twiddles != nullptr);
        S* twiddles = is_fast_twiddles_enabled
                        ? (is_inverse ? domain.fast_external_twiddles_inv : domain.fast_external_twiddles)
                        : domain.twiddles;
@@ -728,9 +727,11 @@ namespace ntt {
        S* basic_twiddles = is_fast_twiddles_enabled
                              ? (is_inverse ? domain.fast_basic_twiddles_inv : domain.fast_basic_twiddles)
                              : domain.basic_twiddles;
+        S* linear_twiddles = domain.twiddles; // twiddles organized as [1,w,w^2,...]
        CHK_IF_RETURN(mxntt::mixed_radix_ntt(
-          d_input, d_output, twiddles, internal_twiddles, basic_twiddles, size, domain.max_log_size, batch_size,
-          config.columns_batch, is_inverse, is_fast_twiddles_enabled, config.ordering, coset, coset_index, stream));
+          d_input, d_output, twiddles, internal_twiddles, basic_twiddles, linear_twiddles, size, domain.max_log_size,
+          batch_size, config.columns_batch, is_inverse, is_fast_twiddles_enabled, config.ordering, coset, coset_index,
+          stream));
      }
    }

--- a/icicle/src/ntt/tests/verification.cpp
+++ b/icicle/src/ntt/tests/verification.cpp
@@ -0,0 +1,46 @@
+#include "../../../include/fields/id.h"
+#define FIELD_ID BN254
+
+#ifdef ECNTT
+#define CURVE_ID BN254
+#include "../../../include/curves/curve_config.cuh"
+typedef field_config::scalar_t test_scalar;
+typedef curve_config::projective_t test_data;
+#else
+#include "../../../include/fields/field_config.cuh"
+typedef field_config::scalar_t test_scalar;
+typedef field_config::scalar_t test_data;
+#endif
+
+#include "../../../include/fields/field.cuh"
+#include "../../../include/curves/projective.cuh"
+#include <chrono>
+#include <iostream>
+#include <vector>
+
+#include "../ntt.cpp"
+// #include "../kernel_ntt.cpp"
+#include <memory>
+
+void random_samples(test_data* res, uint32_t count)
+{
+  for (int i = 0; i < count; i++)
+    res[i] = i < 1000 ? test_data::rand_host() : res[i - 1000];
+}
+
+void incremental_values(test_scalar* res, uint32_t count)
+{
+  for (int i = 0; i < count; i++) {
+    res[i] = i ? res[i - 1] + test_scalar::one() : test_scalar::zero();
+  }
+}
+
+void transpose_batch(test_scalar* in, test_scalar* out, int row_size, int column_size)
+{
+  return;
+}
+
+int main(int argc, char** argv)
+{
+  return 0;
+}
--- a/icicle/src/ntt/tests/verification.cu
+++ b/icicle/src/ntt/tests/verification.cu
@@ -1,207 +0,0 @@
-#include "fields/id.h"
-#define FIELD_ID BN254
-
-#ifdef ECNTT
-#define CURVE_ID BN254
-#include "curves/curve_config.cuh"
-typedef field_config::scalar_t test_scalar;
-typedef curve_config::projective_t test_data;
-#else
-#include "fields/field_config.cuh"
-typedef field_config::scalar_t test_scalar;
-typedef field_config::scalar_t test_data;
-#endif
-
-#include "fields/field.cuh"
-#include "curves/projective.cuh"
-#include <chrono>
-#include <iostream>
-#include <vector>
-
-#include "ntt.cu"
-#include "kernel_ntt.cu"
-#include <memory>
-
-void random_samples(test_data* res, uint32_t count)
-{
-  for (int i = 0; i < count; i++)
-    res[i] = i < 1000 ? test_data::rand_host() : res[i - 1000];
-}
-
-void incremental_values(test_scalar* res, uint32_t count)
-{
-  for (int i = 0; i < count; i++) {
-    res[i] = i ? res[i - 1] + test_scalar::one() : test_scalar::zero();
-  }
-}
-
-__global__ void transpose_batch(test_scalar* in, test_scalar* out, int row_size, int column_size)
-{
-  int tid = blockDim.x * blockIdx.x + threadIdx.x;
-  if (tid >= row_size * column_size) return;
-  out[(tid % row_size) * column_size + (tid / row_size)] = in[tid];
-}
-
-int main(int argc, char** argv)
-{
-  cudaEvent_t icicle_start, icicle_stop, new_start, new_stop;
-  float icicle_time, new_time;
-
-  int NTT_LOG_SIZE = (argc > 1) ? atoi(argv[1]) : 19;
-  int NTT_SIZE = 1 << NTT_LOG_SIZE;
-  bool INPLACE = (argc > 2) ? atoi(argv[2]) : false;
-  int INV = (argc > 3) ? atoi(argv[3]) : false;
-  int BATCH_SIZE = (argc > 4) ? atoi(argv[4]) : 150;
-  bool COLUMNS_BATCH = (argc > 5) ? atoi(argv[5]) : false;
-  int COSET_IDX = (argc > 6) ? atoi(argv[6]) : 2;
-  const ntt::Ordering ordering = (argc > 7) ? ntt::Ordering(atoi(argv[7])) : ntt::Ordering::kNN;
-  bool FAST_TW = (argc > 8) ? atoi(argv[8]) : true;
-
-  // Note: NM, MN are not expected to be equal when comparing mixed-radix and radix-2 NTTs
-  const char* ordering_str = ordering == ntt::Ordering::kNN   ? "NN"
-                             : ordering == ntt::Ordering::kNR ? "NR"
-                             : ordering == ntt::Ordering::kRN ? "RN"
-                             : ordering == ntt::Ordering::kRR ? "RR"
-                             : ordering == ntt::Ordering::kNM ? "NM"
-                                                              : "MN";
-
-  printf(
-    "running ntt 2^%d, inplace=%d, inverse=%d, batch_size=%d, columns_batch=%d coset-idx=%d, ordering=%s, fast_tw=%d\n",
-    NTT_LOG_SIZE, INPLACE, INV, BATCH_SIZE, COLUMNS_BATCH, COSET_IDX, ordering_str, FAST_TW);
-
-  CHK_IF_RETURN(cudaFree(nullptr)); // init GPU context (warmup)
-
-  // init domain
-  auto ntt_config = ntt::default_ntt_config<test_scalar>();
-  ntt_config.ordering = ordering;
-  ntt_config.are_inputs_on_device = true;
-  ntt_config.are_outputs_on_device = true;
-  ntt_config.batch_size = BATCH_SIZE;
-  ntt_config.columns_batch = COLUMNS_BATCH;
-
-  CHK_IF_RETURN(cudaEventCreate(&icicle_start));
-  CHK_IF_RETURN(cudaEventCreate(&icicle_stop));
-  CHK_IF_RETURN(cudaEventCreate(&new_start));
-  CHK_IF_RETURN(cudaEventCreate(&new_stop));
-
-  auto start = std::chrono::high_resolution_clock::now();
-  const scalar_t basic_root = test_scalar::omega(NTT_LOG_SIZE);
-  ntt::init_domain(basic_root, ntt_config.ctx, FAST_TW);
-  auto stop = std::chrono::high_resolution_clock::now();
-  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
-  std::cout << "initDomain took: " << duration / 1000 << " MS" << std::endl;
-
-  // cpu allocation
-  auto CpuScalars = std::make_unique<test_data[]>(NTT_SIZE * BATCH_SIZE);
-  auto CpuOutputOld = std::make_unique<test_data[]>(NTT_SIZE * BATCH_SIZE);
-  auto CpuOutputNew = std::make_unique<test_data[]>(NTT_SIZE * BATCH_SIZE);
-
-  // gpu allocation
-  scalar_t *GpuScalars, *GpuOutputOld, *GpuOutputNew;
-  scalar_t* GpuScalarsTransposed;
-  CHK_IF_RETURN(cudaMalloc(&GpuScalars, sizeof(test_data) * NTT_SIZE * BATCH_SIZE));
-  CHK_IF_RETURN(cudaMalloc(&GpuScalarsTransposed, sizeof(test_data) * NTT_SIZE * BATCH_SIZE));
-  CHK_IF_RETURN(cudaMalloc(&GpuOutputOld, sizeof(test_data) * NTT_SIZE * BATCH_SIZE));
-  CHK_IF_RETURN(cudaMalloc(&GpuOutputNew, sizeof(test_data) * NTT_SIZE * BATCH_SIZE));
-
-  // init inputs
-  // incremental_values(CpuScalars.get(), NTT_SIZE * BATCH_SIZE);
-  random_samples(CpuScalars.get(), NTT_SIZE * BATCH_SIZE);
-  CHK_IF_RETURN(
-    cudaMemcpy(GpuScalars, CpuScalars.get(), NTT_SIZE * BATCH_SIZE * sizeof(test_data), cudaMemcpyHostToDevice));
-
-  if (COLUMNS_BATCH) {
-    transpose_batch<<<(NTT_SIZE * BATCH_SIZE + 256 - 1) / 256, 256>>>(
-      GpuScalars, GpuScalarsTransposed, NTT_SIZE, BATCH_SIZE);
-  }
-
-  // inplace
-  if (INPLACE) {
-    CHK_IF_RETURN(cudaMemcpy(
-      GpuOutputNew, COLUMNS_BATCH ? GpuScalarsTransposed : GpuScalars, NTT_SIZE * BATCH_SIZE * sizeof(test_data),
-      cudaMemcpyDeviceToDevice));
-  }
-
-  for (int coset_idx = 0; coset_idx < COSET_IDX; ++coset_idx) {
-    ntt_config.coset_gen = ntt_config.coset_gen * basic_root;
-  }
-
-  auto benchmark = [&](bool is_print, int iterations) -> cudaError_t {
-    // NEW
-    CHK_IF_RETURN(cudaEventRecord(new_start, ntt_config.ctx.stream));
-    ntt_config.ntt_algorithm = ntt::NttAlgorithm::MixedRadix;
-    for (size_t i = 0; i < iterations; i++) {
-      CHK_IF_RETURN(ntt::ntt(
-        INPLACE         ? GpuOutputNew
-        : COLUMNS_BATCH ? GpuScalarsTransposed
-                        : GpuScalars,
-        NTT_SIZE, INV ? ntt::NTTDir::kInverse : ntt::NTTDir::kForward, ntt_config, GpuOutputNew));
-    }
-    CHK_IF_RETURN(cudaEventRecord(new_stop, ntt_config.ctx.stream));
-    CHK_IF_RETURN(cudaStreamSynchronize(ntt_config.ctx.stream));
-    CHK_IF_RETURN(cudaEventElapsedTime(&new_time, new_start, new_stop));
-
-    // OLD
-    CHK_IF_RETURN(cudaEventRecord(icicle_start, ntt_config.ctx.stream));
-    ntt_config.ntt_algorithm = ntt::NttAlgorithm::Radix2;
-    for (size_t i = 0; i < iterations; i++) {
-      CHK_IF_RETURN(
-        ntt::ntt(GpuScalars, NTT_SIZE, INV ? ntt::NTTDir::kInverse : ntt::NTTDir::kForward, ntt_config, GpuOutputOld));
-    }
-    CHK_IF_RETURN(cudaEventRecord(icicle_stop, ntt_config.ctx.stream));
-    CHK_IF_RETURN(cudaStreamSynchronize(ntt_config.ctx.stream));
-    CHK_IF_RETURN(cudaEventElapsedTime(&icicle_time, icicle_start, icicle_stop));
-
-    if (is_print) {
-      printf("Old Runtime=%0.3f MS\n", icicle_time / iterations);
-      printf("New Runtime=%0.3f MS\n", new_time / iterations);
-    }
-
-    return CHK_LAST();
-  };
-
-  CHK_IF_RETURN(benchmark(false /*=print*/, 1)); // warmup
-  int count = INPLACE ? 1 : 10;
-  if (INPLACE) {
-    CHK_IF_RETURN(cudaMemcpy(
-      GpuOutputNew, COLUMNS_BATCH ? GpuScalarsTransposed : GpuScalars, NTT_SIZE * BATCH_SIZE * sizeof(test_data),
-      cudaMemcpyDeviceToDevice));
-  }
-  CHK_IF_RETURN(benchmark(true /*=print*/, count));
-
-  if (COLUMNS_BATCH) {
-    transpose_batch<<<(NTT_SIZE * BATCH_SIZE + 256 - 1) / 256, 256>>>(
-      GpuOutputNew, GpuScalarsTransposed, BATCH_SIZE, NTT_SIZE);
-    CHK_IF_RETURN(cudaMemcpy(
-      GpuOutputNew, GpuScalarsTransposed, NTT_SIZE * BATCH_SIZE * sizeof(test_data), cudaMemcpyDeviceToDevice));
-  }
-
-  // verify
-  CHK_IF_RETURN(
-    cudaMemcpy(CpuOutputNew.get(), GpuOutputNew, NTT_SIZE * BATCH_SIZE * sizeof(test_data), cudaMemcpyDeviceToHost));
-  CHK_IF_RETURN(
-    cudaMemcpy(CpuOutputOld.get(), GpuOutputOld, NTT_SIZE * BATCH_SIZE * sizeof(test_data), cudaMemcpyDeviceToHost));
-
-  bool success = true;
-  for (int i = 0; i < NTT_SIZE * BATCH_SIZE; i++) {
-    // if (i%64==0) printf("\n");
-    if (CpuOutputNew[i] != CpuOutputOld[i]) {
-      success = false;
-      // std::cout << i << " ref " << CpuOutputOld[i] << " != " << CpuOutputNew[i] << std::endl;
-      // break;
-    } else {
-      // std::cout << i << " ref " << CpuOutputOld[i] << " == " << CpuOutputNew[i] << std::endl;
-      // break;
-    }
-  }
-  const char* success_str = success ? "SUCCESS!" : "FAIL!";
-  printf("%s\n", success_str);
-
-  CHK_IF_RETURN(cudaFree(GpuScalars));
-  CHK_IF_RETURN(cudaFree(GpuOutputOld));
-  CHK_IF_RETURN(cudaFree(GpuOutputNew));
-
-  ntt::release_domain<test_scalar>(ntt_config.ctx);
-
-  return CHK_LAST();
-}
--- a/icicle/src/ntt/thread_ntt.cu
+++ b/icicle/src/ntt/thread_ntt.cu
@@ -1,721 +0,0 @@
-#ifndef T_NTT
-#define T_NTT
-#pragma once
-
-#include <stdio.h>
-#include <stdint.h>
-#include "gpu-utils/modifiers.cuh"
-
-struct stage_metadata {
-  uint32_t th_stride;
-  uint32_t ntt_block_size;
-  uint32_t batch_id;
-  uint32_t ntt_block_id;
-  uint32_t ntt_inp_id;
-};
-
-#define STAGE_SIZES_DATA                                                                                               \
-  {                                                                                                                    \
-    {0, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {4, 0, 0, 0, 0}, {5, 0, 0, 0, 0},              \
-      {6, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {4, 4, 0, 0, 0}, {5, 4, 0, 0, 0}, {5, 5, 0, 0, 0}, {6, 5, 0, 0, 0},            \
-      {6, 6, 0, 0, 0}, {4, 5, 4, 0, 0}, {4, 6, 4, 0, 0}, {5, 5, 5, 0, 0}, {6, 4, 6, 0, 0}, {6, 5, 6, 0, 0},            \
-      {6, 6, 6, 0, 0}, {6, 5, 4, 4, 0}, {5, 5, 5, 5, 0}, {6, 5, 5, 5, 0}, {6, 5, 5, 6, 0}, {6, 6, 6, 5, 0},            \
-      {6, 6, 6, 6, 0}, {5, 5, 5, 5, 5}, {6, 5, 4, 5, 6}, {6, 5, 5, 5, 6}, {6, 5, 6, 5, 6}, {6, 6, 5, 6, 6},            \
-      {6, 6, 6, 6, 6},                                                                                                 \
-  }
-uint32_t constexpr STAGE_SIZES_HOST[31][5] = STAGE_SIZES_DATA;
-__device__ constexpr uint32_t STAGE_SIZES_DEVICE[31][5] = STAGE_SIZES_DATA;
-
-// construction for fast-twiddles
-uint32_t constexpr STAGE_PREV_SIZES[31] = {0,  0,  0,  0,  0,  0,  0,  0,  4,  5,  5,  6,  6,  9,  9, 10,
-                                           11, 11, 12, 15, 15, 16, 16, 18, 18, 20, 21, 21, 22, 23, 24};
-
-#define STAGE_SIZES_DATA_FAST_TW                                                                                       \
-  {                                                                                                                    \
-    {0, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {4, 0, 0, 0, 0}, {5, 0, 0, 0, 0},              \
-      {6, 0, 0, 0, 0}, {0, 0, 0, 0, 0}, {4, 4, 0, 0, 0}, {5, 4, 0, 0, 0}, {5, 5, 0, 0, 0}, {6, 5, 0, 0, 0},            \
-      {6, 6, 0, 0, 0}, {5, 4, 4, 0, 0}, {5, 4, 5, 0, 0}, {5, 5, 5, 0, 0}, {6, 5, 5, 0, 0}, {6, 5, 6, 0, 0},            \
-      {6, 6, 6, 0, 0}, {5, 5, 5, 4, 0}, {5, 5, 5, 5, 0}, {6, 5, 5, 5, 0}, {6, 5, 5, 6, 0}, {6, 6, 6, 5, 0},            \
-      {6, 6, 6, 6, 0}, {5, 5, 5, 5, 5}, {6, 5, 5, 5, 5}, {6, 5, 5, 5, 6}, {6, 5, 5, 6, 6}, {6, 6, 6, 5, 6},            \
-      {6, 6, 6, 6, 6},                                                                                                 \
-  }
-uint32_t constexpr STAGE_SIZES_HOST_FT[31][5] = STAGE_SIZES_DATA_FAST_TW;
-__device__ uint32_t constexpr STAGE_SIZES_DEVICE_FT[31][5] = STAGE_SIZES_DATA_FAST_TW;
-
-template <typename E, typename S>
-class NTTEngine
-{
-public:
-  E X[8];
-  S WB[3];
-  S WI[7];
-  S WE[8];
-
-  DEVICE_INLINE void loadBasicTwiddles(S* basic_twiddles)
-  {
-    UNROLL
-    for (int i = 0; i < 3; i++) {
-      WB[i] = basic_twiddles[i];
-    }
-  }
-
-  DEVICE_INLINE void loadBasicTwiddlesGeneric(S* basic_twiddles, bool inv)
-  {
-    UNROLL
-    for (int i = 0; i < 3; i++) {
-      WB[i] = basic_twiddles[inv ? i + 3 : i];
-    }
-  }
-
-  DEVICE_INLINE void loadInternalTwiddles64(S* data, bool stride)
-  {
-    UNROLL
-    for (int i = 0; i < 7; i++) {
-      WI[i] = data[((stride ? (threadIdx.x >> 3) : (threadIdx.x)) & 0x7) * (i + 1)];
-    }
-  }
-
-  DEVICE_INLINE void loadInternalTwiddles32(S* data, bool stride)
-  {
-    UNROLL
-    for (int i = 0; i < 7; i++) {
-      WI[i] = data[2 * ((stride ? (threadIdx.x >> 4) : (threadIdx.x)) & 0x3) * (i + 1)];
-    }
-  }
-
-  DEVICE_INLINE void loadInternalTwiddles16(S* data, bool stride)
-  {
-    UNROLL
-    for (int i = 0; i < 7; i++) {
-      WI[i] = data[4 * ((stride ? (threadIdx.x >> 5) : (threadIdx.x)) & 0x1) * (i + 1)];
-    }
-  }
-
-  DEVICE_INLINE void loadInternalTwiddlesGeneric64(S* data, bool stride, bool inv)
-  {
-    UNROLL
-    for (int i = 0; i < 7; i++) {
-      uint32_t exp = ((stride ? (threadIdx.x >> 3) : (threadIdx.x)) & 0x7) * (i + 1);
-      WI[i] = data[(inv && exp) ? 64 - exp : exp]; // if exp = 0 we also take exp and not 64-exp
-    }
-  }
-
-  DEVICE_INLINE void loadInternalTwiddlesGeneric32(S* data, bool stride, bool inv)
-  {
-    UNROLL
-    for (int i = 0; i < 7; i++) {
-      uint32_t exp = 2 * ((stride ? (threadIdx.x >> 4) : (threadIdx.x)) & 0x3) * (i + 1);
-      WI[i] = data[(inv && exp) ? 64 - exp : exp];
-    }
-  }
-
-  DEVICE_INLINE void loadInternalTwiddlesGeneric16(S* data, bool stride, bool inv)
-  {
-    UNROLL
-    for (int i = 0; i < 7; i++) {
-      uint32_t exp = 4 * ((stride ? (threadIdx.x >> 5) : (threadIdx.x)) & 0x1) * (i + 1);
-      WI[i] = data[(inv && exp) ? 64 - exp : exp];
-    }
-  }
-
-  DEVICE_INLINE void loadExternalTwiddles64(S* data, uint32_t tw_order, uint32_t tw_log_order, stage_metadata s_meta)
-  {
-    data += tw_order * s_meta.ntt_inp_id + (s_meta.ntt_block_id & (tw_order - 1));
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      WE[i] = data[8 * i * tw_order + (1 << tw_log_order + 6) - 1];
-    }
-  }
-
-  DEVICE_INLINE void loadExternalTwiddles32(S* data, uint32_t tw_order, uint32_t tw_log_order, stage_metadata s_meta)
-  {
-    data += tw_order * s_meta.ntt_inp_id * 2 + (s_meta.ntt_block_id & (tw_order - 1));
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        WE[4 * j + i] = data[(8 * i + j) * tw_order + (1 << tw_log_order + 5) - 1];
-      }
-    }
-  }
-
-  DEVICE_INLINE void loadExternalTwiddles16(S* data, uint32_t tw_order, uint32_t tw_log_order, stage_metadata s_meta)
-  {
-    data += tw_order * s_meta.ntt_inp_id * 4 + (s_meta.ntt_block_id & (tw_order - 1));
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        WE[2 * j + i] = data[(8 * i + j) * tw_order + (1 << tw_log_order + 4) - 1];
-      }
-    }
-  }
-
-  DEVICE_INLINE void loadExternalTwiddlesGeneric64(
-    S* data, uint32_t tw_order, uint32_t tw_log_order, stage_metadata s_meta, uint32_t tw_log_size, bool inv)
-  {
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      uint32_t exp = (s_meta.ntt_inp_id + 8 * i) * (s_meta.ntt_block_id & (tw_order - 1))
-                     << (tw_log_size - tw_log_order - 6);
-      WE[i] = data[(inv && exp) ? ((1 << tw_log_size) - exp) : exp];
-    }
-  }
-
-  DEVICE_INLINE void loadExternalTwiddlesGeneric32(
-    S* data, uint32_t tw_order, uint32_t tw_log_order, stage_metadata s_meta, uint32_t tw_log_size, bool inv)
-  {
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        uint32_t exp = (s_meta.ntt_inp_id * 2 + 8 * i + j) * (s_meta.ntt_block_id & (tw_order - 1))
-                       << (tw_log_size - tw_log_order - 5);
-        WE[4 * j + i] = data[(inv && exp) ? ((1 << tw_log_size) - exp) : exp];
-      }
-    }
-  }
-
-  DEVICE_INLINE void loadExternalTwiddlesGeneric16(
-    S* data, uint32_t tw_order, uint32_t tw_log_order, stage_metadata s_meta, uint32_t tw_log_size, bool inv)
-  {
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        uint32_t exp = (s_meta.ntt_inp_id * 4 + 8 * i + j) * (s_meta.ntt_block_id & (tw_order - 1))
-                       << (tw_log_size - tw_log_order - 4);
-        WE[2 * j + i] = data[(inv && exp) ? ((1 << tw_log_size) - exp) : exp];
-      }
-    }
-  }
-
-  DEVICE_INLINE void
-  loadGlobalData(const E* data, uint32_t data_stride, uint32_t log_data_stride, bool strided, stage_metadata s_meta)
-  {
-    if (strided) {
-      data += (s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id +
-              (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size;
-    } else {
-      data += s_meta.ntt_block_id * s_meta.ntt_block_size + s_meta.ntt_inp_id;
-    }
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      X[i] = data[s_meta.th_stride * i * data_stride];
-    }
-  }
-
-  DEVICE_INLINE void loadGlobalDataColumnBatch(
-    const E* data, uint32_t data_stride, uint32_t log_data_stride, stage_metadata s_meta, uint32_t batch_size)
-  {
-    data += ((s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id +
-             (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size) *
-              batch_size +
-            s_meta.batch_id;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      X[i] = data[s_meta.th_stride * i * data_stride * batch_size];
-    }
-  }
-
-  DEVICE_INLINE void
-  storeGlobalData(E* data, uint32_t data_stride, uint32_t log_data_stride, bool strided, stage_metadata s_meta)
-  {
-    if (strided) {
-      data += (s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id +
-              (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size;
-    } else {
-      data += s_meta.ntt_block_id * s_meta.ntt_block_size + s_meta.ntt_inp_id;
-    }
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      data[s_meta.th_stride * i * data_stride] = X[i];
-    }
-  }
-
-  DEVICE_INLINE void storeGlobalDataColumnBatch(
-    E* data, uint32_t data_stride, uint32_t log_data_stride, stage_metadata s_meta, uint32_t batch_size)
-  {
-    data += ((s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id +
-             (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size) *
-              batch_size +
-            s_meta.batch_id;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      data[s_meta.th_stride * i * data_stride * batch_size] = X[i];
-    }
-  }
-
-  DEVICE_INLINE void
-  loadGlobalData32(const E* data, uint32_t data_stride, uint32_t log_data_stride, bool strided, stage_metadata s_meta)
-  {
-    if (strided) {
-      data += (s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 2 +
-              (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size;
-    } else {
-      data += s_meta.ntt_block_id * s_meta.ntt_block_size + s_meta.ntt_inp_id * 2;
-    }
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        X[4 * j + i] = data[(8 * i + j) * data_stride];
-      }
-    }
-  }
-
-  DEVICE_INLINE void loadGlobalData32ColumnBatch(
-    const E* data, uint32_t data_stride, uint32_t log_data_stride, stage_metadata s_meta, uint32_t batch_size)
-  {
-    data += ((s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 2 +
-             (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size) *
-              batch_size +
-            s_meta.batch_id;
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        X[4 * j + i] = data[(8 * i + j) * data_stride * batch_size];
-      }
-    }
-  }
-
-  DEVICE_INLINE void
-  storeGlobalData32(E* data, uint32_t data_stride, uint32_t log_data_stride, bool strided, stage_metadata s_meta)
-  {
-    if (strided) {
-      data += (s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 2 +
-              (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size;
-    } else {
-      data += s_meta.ntt_block_id * s_meta.ntt_block_size + s_meta.ntt_inp_id * 2;
-    }
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        data[(8 * i + j) * data_stride] = X[4 * j + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void storeGlobalData32ColumnBatch(
-    E* data, uint32_t data_stride, uint32_t log_data_stride, stage_metadata s_meta, uint32_t batch_size)
-  {
-    data += ((s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 2 +
-             (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size) *
-              batch_size +
-            s_meta.batch_id;
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        data[(8 * i + j) * data_stride * batch_size] = X[4 * j + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void
-  loadGlobalData16(const E* data, uint32_t data_stride, uint32_t log_data_stride, bool strided, stage_metadata s_meta)
-  {
-    if (strided) {
-      data += (s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 4 +
-              (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size;
-    } else {
-      data += s_meta.ntt_block_id * s_meta.ntt_block_size + s_meta.ntt_inp_id * 4;
-    }
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        X[2 * j + i] = data[(8 * i + j) * data_stride];
-      }
-    }
-  }
-
-  DEVICE_INLINE void loadGlobalData16ColumnBatch(
-    const E* data, uint32_t data_stride, uint32_t log_data_stride, stage_metadata s_meta, uint32_t batch_size)
-  {
-    data += ((s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 4 +
-             (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size) *
-              batch_size +
-            s_meta.batch_id;
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        X[2 * j + i] = data[(8 * i + j) * data_stride * batch_size];
-      }
-    }
-  }
-
-  DEVICE_INLINE void
-  storeGlobalData16(E* data, uint32_t data_stride, uint32_t log_data_stride, bool strided, stage_metadata s_meta)
-  {
-    if (strided) {
-      data += (s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 4 +
-              (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size;
-    } else {
-      data += s_meta.ntt_block_id * s_meta.ntt_block_size + s_meta.ntt_inp_id * 4;
-    }
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        data[(8 * i + j) * data_stride] = X[2 * j + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void storeGlobalData16ColumnBatch(
-    E* data, uint32_t data_stride, uint32_t log_data_stride, stage_metadata s_meta, uint32_t batch_size)
-  {
-    data += ((s_meta.ntt_block_id & (data_stride - 1)) + data_stride * s_meta.ntt_inp_id * 4 +
-             (s_meta.ntt_block_id >> log_data_stride) * data_stride * s_meta.ntt_block_size) *
-              batch_size +
-            s_meta.batch_id;
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        data[(8 * i + j) * data_stride * batch_size] = X[2 * j + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void ntt4_2()
-  {
-    UNROLL
-    for (int i = 0; i < 2; i++) {
-      ntt4(X[4 * i], X[4 * i + 1], X[4 * i + 2], X[4 * i + 3]);
-    }
-  }
-
-  DEVICE_INLINE void ntt2_4()
-  {
-    UNROLL
-    for (int i = 0; i < 4; i++) {
-      ntt2(X[2 * i], X[2 * i + 1]);
-    }
-  }
-
-  DEVICE_INLINE void ntt2(E& X0, E& X1)
-  {
-    E T;
-
-    T = X0 + X1;
-    X1 = X0 - X1;
-    X0 = T;
-  }
-
-  DEVICE_INLINE void ntt4(E& X0, E& X1, E& X2, E& X3)
-  {
-    E T;
-
-    T = X0 + X2;
-    X2 = X0 - X2;
-    X0 = X1 + X3;
-    X1 = X1 - X3; // T has X0, X0 has X1, X2 has X2, X1 has X3
-
-    X1 = X1 * WB[0];
-
-    X3 = X2 - X1;
-    X1 = X2 + X1;
-    X2 = T - X0;
-    X0 = T + X0;
-  }
-
-  // rbo version
-  DEVICE_INLINE void ntt4rbo(E& X0, E& X1, E& X2, E& X3)
-  {
-    E T;
-
-    T = X0 - X1;
-    X0 = X0 + X1;
-    X1 = X2 + X3;
-    X3 = X2 - X3; // T has X0, X0 has X1, X2 has X2, X1 has X3
-
-    X3 = X3 * WB[0];
-
-    X2 = X0 - X1;
-    X0 = X0 + X1;
-    X1 = T + X3;
-    X3 = T - X3;
-  }
-
-  DEVICE_INLINE void ntt8(E& X0, E& X1, E& X2, E& X3, E& X4, E& X5, E& X6, E& X7)
-  {
-    E T;
-
-    // out of 56,623,104 possible mappings, we have:
-    T = X3 - X7;
-    X7 = X3 + X7;
-    X3 = X1 - X5;
-    X5 = X1 + X5;
-    X1 = X2 + X6;
-    X2 = X2 - X6;
-    X6 = X0 + X4;
-    X0 = X0 - X4;
-
-    T = T * WB[1];
-    X2 = X2 * WB[1];
-
-    X4 = X6 + X1;
-    X6 = X6 - X1;
-    X1 = X3 + T;
-    X3 = X3 - T;
-    T = X5 + X7;
-    X5 = X5 - X7;
-    X7 = X0 + X2;
-    X0 = X0 - X2;
-
-    X1 = X1 * WB[0];
-    X5 = X5 * WB[1];
-    X3 = X3 * WB[2];
-
-    X2 = X6 + X5;
-    X6 = X6 - X5;
-    X5 = X7 - X1;
-    X1 = X7 + X1;
-    X7 = X0 - X3;
-    X3 = X0 + X3;
-    X0 = X4 + T;
-    X4 = X4 - T;
-  }
-
-  DEVICE_INLINE void ntt8win()
-  {
-    E T;
-
-    T = X[3] - X[7];
-    X[7] = X[3] + X[7];
-    X[3] = X[1] - X[5];
-    X[5] = X[1] + X[5];
-    X[1] = X[2] + X[6];
-    X[2] = X[2] - X[6];
-    X[6] = X[0] + X[4];
-    X[0] = X[0] - X[4];
-
-    X[2] = X[2] * WB[0];
-
-    X[4] = X[6] + X[1];
-    X[6] = X[6] - X[1];
-    X[1] = X[3] + T;
-    X[3] = X[3] - T;
-    T = X[5] + X[7];
-    X[5] = X[5] - X[7];
-    X[7] = X[0] + X[2];
-    X[0] = X[0] - X[2];
-
-    X[1] = X[1] * WB[1];
-    X[5] = X[5] * WB[0];
-    X[3] = X[3] * WB[2];
-
-    X[2] = X[6] + X[5];
-    X[6] = X[6] - X[5];
-
-    X[5] = X[1] + X[3];
-    X[3] = X[1] - X[3];
-
-    X[1] = X[7] + X[5];
-    X[5] = X[7] - X[5];
-    X[7] = X[0] - X[3];
-    X[3] = X[0] + X[3];
-    X[0] = X[4] + T;
-    X[4] = X[4] - T;
-  }
-
-  DEVICE_INLINE void SharedData64Columns8(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0x7 : threadIdx.x >> 3;
-    uint32_t column_id = stride ? threadIdx.x >> 3 : threadIdx.x & 0x7;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      if (store) {
-        shmem[ntt_id * 64 + i * 8 + column_id] = X[i];
-      } else {
-        X[i] = shmem[ntt_id * 64 + i * 8 + column_id];
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData64Rows8(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0x7 : threadIdx.x >> 3;
-    uint32_t row_id = stride ? threadIdx.x >> 3 : threadIdx.x & 0x7;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      if (store) {
-        shmem[ntt_id * 64 + row_id * 8 + i] = X[i];
-      } else {
-        X[i] = shmem[ntt_id * 64 + row_id * 8 + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData32Columns8(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0xf : threadIdx.x >> 2;
-    uint32_t column_id = stride ? threadIdx.x >> 4 : threadIdx.x & 0x3;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      if (store) {
-        shmem[ntt_id * 32 + i * 4 + column_id] = X[i];
-      } else {
-        X[i] = shmem[ntt_id * 32 + i * 4 + column_id];
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData32Rows8(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0xf : threadIdx.x >> 2;
-    uint32_t row_id = stride ? threadIdx.x >> 4 : threadIdx.x & 0x3;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      if (store) {
-        shmem[ntt_id * 32 + row_id * 8 + i] = X[i];
-      } else {
-        X[i] = shmem[ntt_id * 32 + row_id * 8 + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData32Columns4_2(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0xf : threadIdx.x >> 2;
-    uint32_t column_id = (stride ? threadIdx.x >> 4 : threadIdx.x & 0x3) * 2;
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        if (store) {
-          shmem[ntt_id * 32 + i * 8 + column_id + j] = X[4 * j + i];
-        } else {
-          X[4 * j + i] = shmem[ntt_id * 32 + i * 8 + column_id + j];
-        }
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData32Rows4_2(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0xf : threadIdx.x >> 2;
-    uint32_t row_id = (stride ? threadIdx.x >> 4 : threadIdx.x & 0x3) * 2;
-
-    UNROLL
-    for (uint32_t j = 0; j < 2; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 4; i++) {
-        if (store) {
-          shmem[ntt_id * 32 + row_id * 4 + 4 * j + i] = X[4 * j + i];
-        } else {
-          X[4 * j + i] = shmem[ntt_id * 32 + row_id * 4 + 4 * j + i];
-        }
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData16Columns8(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0x1f : threadIdx.x >> 1;
-    uint32_t column_id = stride ? threadIdx.x >> 5 : threadIdx.x & 0x1;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      if (store) {
-        shmem[ntt_id * 16 + i * 2 + column_id] = X[i];
-      } else {
-        X[i] = shmem[ntt_id * 16 + i * 2 + column_id];
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData16Rows8(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0x1f : threadIdx.x >> 1;
-    uint32_t row_id = stride ? threadIdx.x >> 5 : threadIdx.x & 0x1;
-
-    UNROLL
-    for (uint32_t i = 0; i < 8; i++) {
-      if (store) {
-        shmem[ntt_id * 16 + row_id * 8 + i] = X[i];
-      } else {
-        X[i] = shmem[ntt_id * 16 + row_id * 8 + i];
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData16Columns2_4(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0x1f : threadIdx.x >> 1;
-    uint32_t column_id = (stride ? threadIdx.x >> 5 : threadIdx.x & 0x1) * 4;
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        if (store) {
-          shmem[ntt_id * 16 + i * 8 + column_id + j] = X[2 * j + i];
-        } else {
-          X[2 * j + i] = shmem[ntt_id * 16 + i * 8 + column_id + j];
-        }
-      }
-    }
-  }
-
-  DEVICE_INLINE void SharedData16Rows2_4(E* shmem, bool store, bool high_bits, bool stride)
-  {
-    uint32_t ntt_id = stride ? threadIdx.x & 0x1f : threadIdx.x >> 1;
-    uint32_t row_id = (stride ? threadIdx.x >> 5 : threadIdx.x & 0x1) * 4;
-
-    UNROLL
-    for (uint32_t j = 0; j < 4; j++) {
-      UNROLL
-      for (uint32_t i = 0; i < 2; i++) {
-        if (store) {
-          shmem[ntt_id * 16 + row_id * 2 + 2 * j + i] = X[2 * j + i];
-        } else {
-          X[2 * j + i] = shmem[ntt_id * 16 + row_id * 2 + 2 * j + i];
-        }
-      }
-    }
-  }
-
-  DEVICE_INLINE void twiddlesInternal()
-  {
-    UNROLL
-    for (int i = 1; i < 8; i++) {
-      X[i] = X[i] * WI[i - 1];
-    }
-  }
-
-  DEVICE_INLINE void twiddlesExternal()
-  {
-    UNROLL
-    for (int i = 0; i < 8; i++) {
-      X[i] = X[i] * WE[i];
-    }
-  }
-};
-
-#endif
--- a/icicle/src/polynomials/CMakeLists.txt
+++ b/icicle/src/polynomials/CMakeLists.txt
@@ -0,0 +1,27 @@
+set(TARGET icicle_poly)
+set(CURVE_TARGET icicle_curve)
+set(FIELD_TARGET icicle_field)
+
+set(SRC ../)
+
+set(POLY_SOURCE ${SRC}/polynomials/polynomials.cpp)
+set(POLY_API_SOURCE ${SRC}/polynomials/polynomials_c_api.cpp)
+# if(MSM)
+#   list(APPEND CURVE_SOURCE ${SRC}/msm/extern.cpp)
+#   if(G2)
+#     list(APPEND CURVE_SOURCE ${SRC}/msm/extern_g2.cpp)
+#   endif()
+# endif()
+# if(ECNTT)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/extern_ecntt.cpp)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/kernel_ntt.cpp)
+# endif()
+
+add_library(${TARGET} STATIC ${POLY_SOURCE})
+add_library(${TARGET} STATIC ${POLY_API_SOURCE})
+target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
+set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_curve_${CURVE}")
+target_compile_definitions(${TARGET} PUBLIC CURVE=${CURVE})
+target_link_libraries(${TARGET} PRIVATE ${FIELD_TARGET})
+target_link_libraries(${TARGET} PRIVATE ${CURVE_TARGET})
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/polynomials/cuda_backend/kernels.cuh
+++ b/icicle/src/polynomials/cuda_backend/kernels.cuh
@@ -39,7 +39,7 @@ namespace polynomials {

  /*============================== evaluate ==============================*/
  template <typename T>
-  __device__ T pow(T base, int exp)
+  T pow(T base, int exp)
  {
    T result = T::one();
    while (exp > 0) {
--- a/icicle/src/polynomials/polynomials.cpp
+++ b/icicle/src/polynomials/polynomials.cpp
@@ -0,0 +1,204 @@
+#define FIELD_ID BN254
+#include "../../include/polynomials/polynomials.h"
+namespace polynomials {
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I>::Polynomial()
+  {
+    if (nullptr == s_factory) {
+      throw std::runtime_error("Polynomial factory not initialized. Must call Polynomial::initialize(factory)");
+    }
+    m_context = s_factory->create_context();
+    m_backend = s_factory->create_backend();
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::from_coefficients(const C* coefficients, uint64_t nof_coefficients)
+  {
+    Polynomial<C, D, I> P = {};
+    P.m_backend->from_coefficients(P.m_context, nof_coefficients, coefficients);
+    return P;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::from_rou_evaluations(const I* evaluations, uint64_t nof_evaluations)
+  {
+    Polynomial<C, D, I> P = {};
+    P.m_backend->from_rou_evaluations(P.m_context, nof_evaluations, evaluations);
+    return P;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::clone() const
+  {
+    Polynomial<C, D, I> P = {};
+    m_backend->clone(P.m_context, m_context);
+    return P;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::slice(uint64_t offset, uint64_t stride, uint64_t size)
+  {
+    Polynomial res = {};
+    m_backend->slice(res.m_context, this->m_context, offset, stride, size);
+    return res;
+  }
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::even()
+  {
+    return slice(0, 2, 0 /*all elements*/);
+  }
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::odd()
+  {
+    return slice(1, 2, 0 /*all elements*/);
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::operator+(const Polynomial<C, D, I>& rhs) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->add(res.m_context, m_context, rhs.m_context);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::operator-(const Polynomial<C, D, I>& rhs) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->subtract(res.m_context, m_context, rhs.m_context);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::operator*(const Polynomial& rhs) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->multiply(res.m_context, m_context, rhs.m_context);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::operator*(const D& scalar) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->multiply(res.m_context, m_context, scalar);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> operator*(const D& scalar, const Polynomial<C, D, I>& rhs)
+  {
+    return rhs * scalar;
+  }
+
+  template <typename C, typename D, typename I>
+  std::pair<Polynomial<C, D, I>, Polynomial<C, D, I>> Polynomial<C, D, I>::divide(const Polynomial<C, D, I>& rhs) const
+  {
+    Polynomial<C, D, I> Q = {}, R = {};
+    m_backend->divide(Q.m_context, R.m_context, m_context, rhs.m_context);
+    return std::make_pair(std::move(Q), std::move(R));
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::operator/(const Polynomial& rhs) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->quotient(res.m_context, m_context, rhs.m_context);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::operator%(const Polynomial& rhs) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->remainder(res.m_context, m_context, rhs.m_context);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I> Polynomial<C, D, I>::divide_by_vanishing_polynomial(uint64_t vanishing_polynomial_degree) const
+  {
+    Polynomial<C, D, I> res = {};
+    m_backend->divide_by_vanishing_polynomial(res.m_context, m_context, vanishing_polynomial_degree);
+    return res;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I>& Polynomial<C, D, I>::operator+=(const Polynomial& rhs)
+  {
+    m_backend->add(m_context, m_context, rhs.m_context);
+    return *this;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I>& Polynomial<C, D, I>::add_monomial_inplace(C monomial_coeff, uint64_t monomial)
+  {
+    m_backend->add_monomial_inplace(m_context, monomial_coeff, monomial);
+    return *this;
+  }
+
+  template <typename C, typename D, typename I>
+  Polynomial<C, D, I>& Polynomial<C, D, I>::sub_monomial_inplace(C monomial_coeff, uint64_t monomial)
+  {
+    m_backend->sub_monomial_inplace(m_context, monomial_coeff, monomial);
+    return *this;
+  }
+
+  template <typename C, typename D, typename I>
+  I Polynomial<C, D, I>::operator()(const D& x) const
+  {
+    I eval = {};
+    evaluate(&x, &eval);
+    return eval;
+  }
+
+  template <typename C, typename D, typename I>
+  void Polynomial<C, D, I>::evaluate(const D* x, I* eval) const
+  {
+    m_backend->evaluate(m_context, x, eval);
+  }
+
+  template <typename C, typename D, typename I>
+  void Polynomial<C, D, I>::evaluate_on_domain(D* domain, uint64_t size, I* evals /*OUT*/) const
+  {
+    return m_backend->evaluate_on_domain(m_context, domain, size, evals);
+  }
+
+  template <typename C, typename D, typename I>
+  int64_t Polynomial<C, D, I>::degree()
+  {
+    return m_backend->degree(m_context);
+  }
+
+  template <typename C, typename D, typename I>
+  C Polynomial<C, D, I>::get_coeff(uint64_t idx) const
+  {
+    return m_backend->get_coeff(m_context, idx);
+  }
+
+  template <typename C, typename D, typename I>
+  uint64_t Polynomial<C, D, I>::copy_coeffs(C* host_coeffs, uint64_t start_idx, uint64_t end_idx) const
+  {
+    return m_backend->copy_coeffs(m_context, host_coeffs, start_idx, end_idx);
+  }
+
+  template <typename C, typename D, typename I>
+  std::tuple<IntegrityPointer<C>, uint64_t /*size*/, uint64_t /*device_id*/>
+  Polynomial<C, D, I>::get_coefficients_view()
+  {
+    return m_backend->get_coefficients_view(m_context);
+  }
+
+  template <typename C, typename D, typename I>
+  std::tuple<IntegrityPointer<I>, uint64_t /*size*/, uint64_t /*device_id*/>
+  Polynomial<C, D, I>::get_rou_evaluations_view(uint64_t nof_evaluations, bool is_reversed)
+  {
+    return m_backend->get_rou_evaluations_view(m_context, nof_evaluations, is_reversed);
+  }
+
+  // explicit instantiation for default type (scalar field)
+  template class Polynomial<scalar_t>;
+  template Polynomial<scalar_t> operator*(const scalar_t& c, const Polynomial<scalar_t>& rhs);
+
+} // namespace polynomials
--- a/icicle/src/polynomials/polynomials_c_api.cpp
+++ b/icicle/src/polynomials/polynomials_c_api.cpp
@@ -0,0 +1,284 @@
+#define FIELD_ID BN254
+#include "../../include/polynomials/polynomials.h"
+#include "../../include/fields/field_config.cuh"
+#include "../../include/utils/utils.h"
+#include "../../include/utils/integrity_pointer.h"
+#include "../../include/polynomials/cuda_backend/polynomial_cuda_backend.cuh"
+
+namespace polynomials {
+  extern "C" {
+
+  // Defines a polynomial instance based on the scalar type from the FIELD configuration.
+  typedef Polynomial<scalar_t> PolynomialInst;
+
+  bool CONCAT_EXPAND(FIELD, polynomial_init_cuda_backend)()
+  {
+    static auto cuda_factory = std::make_shared<CUDAPolynomialFactory<scalar_t>>();
+    PolynomialInst::initialize(cuda_factory);
+    return cuda_factory != nullptr;
+  }
+
+  // Constructs a polynomial from a set of coefficients.
+  // coeffs: Array of coefficients.
+  // size: Number of coefficients in the array.
+  // Returns a pointer to the newly created polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_create_from_coefficients)(scalar_t* coeffs, size_t size)
+  {
+    auto result = new PolynomialInst(PolynomialInst::from_coefficients(coeffs, size));
+    return result;
+  }
+
+  // Constructs a polynomial from evaluations at the roots of unity.
+  // evals: Array of evaluations.
+  // size: Number of evaluations in the array.
+  // Returns a pointer to the newly created polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_create_from_rou_evaluations)(scalar_t* evals, size_t size)
+  {
+    auto result = new PolynomialInst(PolynomialInst::from_rou_evaluations(evals, size));
+    return result;
+  }
+
+  // Clones an existing polynomial instance.
+  // p: Pointer to the polynomial instance to clone.
+  // Returns a pointer to the cloned polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_clone)(const PolynomialInst* p)
+  {
+    auto result = new PolynomialInst(p->clone());
+    return result;
+  }
+
+  // Deletes a polynomial instance, freeing its memory.
+  // instance: Pointer to the polynomial instance to delete.
+  void CONCAT_EXPAND(FIELD, polynomial_delete)(PolynomialInst* instance) { delete instance; }
+
+  // Prints a polynomial to stdout
+  void CONCAT_EXPAND(FIELD, polynomial_print(PolynomialInst* p)) { std::cout << *p << std::endl; }
+
+  // Adds two polynomials.
+  // a, b: Pointers to the polynomial instances to add.
+  // Returns a pointer to the resulting polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_add)(const PolynomialInst* a, const PolynomialInst* b)
+  {
+    auto result = new PolynomialInst(std::move(*a + *b));
+    return result;
+  }
+
+  // Adds a polynomial to another in place.
+  // a: Pointer to the polynomial to add to.
+  // b: Pointer to the polynomial to add.
+  void CONCAT_EXPAND(FIELD, polynomial_add_inplace)(PolynomialInst* a, const PolynomialInst* b) { *a += *b; }
+
+  // Subtracts one polynomial from another.
+  // a, b: Pointers to the polynomial instances (minuend and subtrahend, respectively).
+  // Returns a pointer to the resulting polynomial instance.
+
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_subtract)(const PolynomialInst* a, const PolynomialInst* b)
+  {
+    auto result = new PolynomialInst(std::move(*a - *b));
+    return result;
+  }
+
+  // Multiplies two polynomials.
+  // a, b: Pointers to the polynomial instances to multiply.
+  // Returns a pointer to the resulting polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_multiply)(const PolynomialInst* a, const PolynomialInst* b)
+  {
+    auto result = new PolynomialInst(std::move(*a * *b));
+    return result;
+  }
+
+  // Multiplies a polynomial by scalar.
+  // a: Pointer to the polynomial instance.
+  // scalar: Scalar to multiply by.
+  // Returns a pointer to the resulting polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_multiply_by_scalar)(const PolynomialInst* a, const scalar_t& scalar)
+  {
+    auto result = new PolynomialInst(std::move(*a * scalar));
+    return result;
+  }
+
+  // Divides one polynomial by another, returning both quotient and remainder.
+  // a, b: Pointers to the polynomial instances (dividend and divisor, respectively).
+  // q: Output parameter for the quotient.
+  // r: Output parameter for the remainder.
+  void CONCAT_EXPAND(FIELD, polynomial_division)(
+    const PolynomialInst* a, const PolynomialInst* b, PolynomialInst** q /*OUT*/, PolynomialInst** r /*OUT*/)
+  {
+    auto [_q, _r] = a->divide(*b);
+    *q = new PolynomialInst(std::move(_q));
+    *r = new PolynomialInst(std::move(_r));
+  }
+
+  // Calculates the quotient of dividing one polynomial by another.
+  // a, b: Pointers to the polynomial instances (dividend and divisor, respectively).
+  // Returns a pointer to the resulting quotient polynomial instance.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_quotient)(const PolynomialInst* a, const PolynomialInst* b)
+  {
+    auto result = new PolynomialInst(std::move(*a / *b));
+    return result;
+  }
+
+  // Calculates the remainder of dividing one polynomial by another.
+  // a, b: Pointers to the polynomial instances (dividend and divisor, respectively).
+  // Returns a pointer to the resulting remainder polynomial instance.
+
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_remainder)(const PolynomialInst* a, const PolynomialInst* b)
+  {
+    auto result = new PolynomialInst(std::move(*a % *b));
+    return result;
+  }
+
+  // Divides a polynomial by a vanishing polynomial of a given degree, over rou domain.
+  // p: Pointer to the polynomial instance.
+  // vanishing_poly_degree: Degree of the vanishing polynomial.
+  // Returns a pointer to the resulting polynomial instance.
+  PolynomialInst*
+  CONCAT_EXPAND(FIELD, polynomial_divide_by_vanishing)(const PolynomialInst* p, uint64_t vanishing_poly_degree)
+  {
+    auto result = new PolynomialInst(std::move(p->divide_by_vanishing_polynomial(vanishing_poly_degree)));
+    return result;
+  }
+
+  // Adds a monomial to a polynomial in place.
+  // p: Pointer to the polynomial instance.
+  // monomial_coeff: Coefficient of the monomial to add.
+  // monomial: Degree of the monomial to add.
+  void CONCAT_EXPAND(FIELD, polynomial_add_monomial_inplace)(
+    PolynomialInst* p, const scalar_t& monomial_coeff, uint64_t monomial)
+  {
+    p->add_monomial_inplace(monomial_coeff, monomial);
+  }
+
+  // Subtracts a monomial from a polynomial in place.
+  // p: Pointer to the polynomial instance.
+  // monomial_coeff: Coefficient of the monomial to subtract.
+  // monomial: Degree of the monomial to subtract.
+  void CONCAT_EXPAND(FIELD, polynomial_sub_monomial_inplace)(
+    PolynomialInst* p, const scalar_t& monomial_coeff, uint64_t monomial)
+  {
+    p->sub_monomial_inplace(monomial_coeff, monomial);
+  }
+
+  // Creates a new polynomial instance by slicing an existing polynomial.
+  // p: Pointer to the original polynomial instance to be sliced.
+  // offset: Starting index for the slice.
+  // stride: Interval between elements in the slice.
+  // size: Number of elements in the slice.
+  // Returns: Pointer to the new polynomial instance containing the slice.
+  PolynomialInst*
+  CONCAT_EXPAND(FIELD, polynomial_slice)(PolynomialInst* p, uint64_t offset, uint64_t stride, uint64_t size)
+  {
+    auto result = new PolynomialInst(std::move(p->slice(offset, stride, size)));
+    return result;
+  }
+
+  // Creates a new polynomial instance containing only the even-powered terms of the original polynomial.
+  // p: Pointer to the original polynomial instance.
+  // Returns: Pointer to the new polynomial instance containing only even-powered terms.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_even)(PolynomialInst* p)
+  {
+    auto result = new PolynomialInst(std::move(p->even()));
+    return result;
+  }
+
+  // Creates a new polynomial instance containing only the odd-powered terms of the original polynomial.
+  // p: Pointer to the original polynomial instance.
+  // Returns: Pointer to the new polynomial instance containing only odd-powered terms.
+  PolynomialInst* CONCAT_EXPAND(FIELD, polynomial_odd)(PolynomialInst* p)
+  {
+    auto result = new PolynomialInst(std::move(p->odd()));
+    return result;
+  }
+
+  // Evaluates a polynomial on a domain of points.
+  // p: Pointer to the polynomial instance.
+  // domain: Array of points constituting the domain.
+  // domain_size: Number of points in the domain.
+  // evals: Output array for the evaluations.
+  void CONCAT_EXPAND(FIELD, polynomial_evaluate_on_domain)(
+    const PolynomialInst* p, scalar_t* domain, uint64_t domain_size, scalar_t* evals /*OUT*/)
+  {
+    return p->evaluate_on_domain(domain, domain_size, evals);
+  }
+
+  // Returns the degree of a polynomial.
+  // p: Pointer to the polynomial instance.
+  // Returns the degree of the polynomial.
+  int64_t CONCAT_EXPAND(FIELD, polynomial_degree)(PolynomialInst* p) { return p->degree(); }
+
+  // Copies a range of polynomial coefficients to host/device memory.
+  // p: Pointer to the polynomial instance.
+  // host_memory: Array to copy the coefficients into. If NULL, not copying.
+  // start_idx: Start index of the range to copy.
+  // end_idx: End index of the range to copy.
+  // Returns the number of coefficients copied. if memory is NULL, returns number of coefficients.
+  uint64_t CONCAT_EXPAND(FIELD, polynomial_copy_coeffs_range)(
+    PolynomialInst* p, scalar_t* memory, uint64_t start_idx, uint64_t end_idx)
+  {
+    return p->copy_coeffs(memory, start_idx, end_idx);
+  }
+
+  // Retrieves a device-memory raw-ptr of the polynomial coefficients.
+  // p: Pointer to the polynomial instance.
+  // size: Output parameter for the size of the view.
+  // device_id: Output parameter for the device ID.
+  // Returns a raw mutable pointer to the coefficients.
+  scalar_t* CONCAT_EXPAND(FIELD, polynomial_get_coeffs_raw_ptr)(
+    PolynomialInst* p, uint64_t* size /*OUT*/, uint64_t* device_id /*OUT*/)
+  {
+    auto [coeffs, _size, _device_id] = p->get_coefficients_view();
+    *size = _size;
+    *device_id = _device_id;
+    return const_cast<scalar_t*>(coeffs.get());
+  }
+
+  // Retrieves a device-memory view of the polynomial coefficients.
+  // p: Pointer to the polynomial instance.
+  // size: Output parameter for the size of the view.
+  // device_id: Output parameter for the device ID.
+  // Returns a pointer to an integrity pointer encapsulating the coefficients view.
+  IntegrityPointer<scalar_t>* CONCAT_EXPAND(FIELD, polynomial_get_coeff_view)(
+    PolynomialInst* p, uint64_t* size /*OUT*/, uint64_t* device_id /*OUT*/)
+  {
+    auto [coeffs, _size, _device_id] = p->get_coefficients_view();
+    *size = _size;
+    *device_id = _device_id;
+    return new IntegrityPointer<scalar_t>(std::move(coeffs));
+  }
+
+  // Retrieves a device-memory view of the polynomial's evaluations on the roots of unity.
+  // p: Pointer to the polynomial instance.
+  // nof_evals: Number of evaluations.
+  // is_reversed: Whether the evaluations are in reversed order.
+  // size: Output parameter for the size of the view.
+  // device_id: Output parameter for the device ID.
+  // Returns a pointer to an integrity pointer encapsulating the evaluations view.
+  IntegrityPointer<scalar_t>* CONCAT_EXPAND(FIELD, polynomial_get_rou_evaluations_view)(
+    PolynomialInst* p, uint64_t nof_evals, bool is_reversed, uint64_t* size /*OUT*/, uint64_t* device_id /*OUT*/)
+  {
+    auto [rou_evals, _size, _device_id] = p->get_rou_evaluations_view(nof_evals, is_reversed);
+    *size = _size;
+    *device_id = _device_id;
+    return new IntegrityPointer<scalar_t>(std::move(rou_evals));
+  }
+
+  // Reads the pointer from an integrity pointer.
+  // p: Pointer to the integrity pointer.
+  // Returns the raw pointer if still valid, otherwise NULL.
+  const scalar_t* CONCAT_EXPAND(FIELD, polynomial_intergrity_ptr_get)(IntegrityPointer<scalar_t>* p)
+  {
+    return p->get();
+  }
+
+  // Checks if an integrity pointer is still valid.
+  // p: Pointer to the integrity pointer.
+  // Returns true if the pointer is valid, false otherwise.
+  bool CONCAT_EXPAND(FIELD, polynomial_intergrity_ptr_is_valid)(IntegrityPointer<scalar_t>* p) { return p->isValid(); }
+
+  // Destroys an integrity pointer, freeing its resources.
+  // p: Pointer to the integrity pointer to destroy.
+  void CONCAT_EXPAND(FIELD, polynomial_intergrity_ptr_destroy)(IntegrityPointer<scalar_t>* p) { delete p; }
+
+  } // extern "C"
+
+} // namespace polynomials
--- a/icicle/src/poseidon/CMakeLists.txt
+++ b/icicle/src/poseidon/CMakeLists.txt
@@ -0,0 +1,25 @@
+set(TARGET icicle_poseidon)
+set(CURVE_TARGET icicle_curve)
+set(FIELD_TARGET icicle_field)
+
+set(SRC ../)
+
+set(POLY_SOURCE ${SRC}/poseidon/poseidon.cpp)
+# if(MSM)
+#   list(APPEND CURVE_SOURCE ${SRC}/msm/extern.cpp)
+#   if(G2)
+#     list(APPEND CURVE_SOURCE ${SRC}/msm/extern_g2.cpp)
+#   endif()
+# endif()
+# if(ECNTT)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/extern_ecntt.cpp)
+#   list(APPEND CURVE_SOURCE ${SRC}/ntt/kernel_ntt.cpp)
+# endif()
+
+add_library(${TARGET} STATIC ${POLY_SOURCE})
+target_include_directories(${TARGET} PUBLIC ${CMAKE_SOURCE_DIR}/include/)
+set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME "ingo_curve_${CURVE}")
+target_compile_definitions(${TARGET} PUBLIC CURVE=${CURVE})
+target_link_libraries(${TARGET} PRIVATE ${FIELD_TARGET})
+target_link_libraries(${TARGET} PRIVATE ${CURVE_TARGET})
+target_compile_features(${TARGET} PUBLIC cxx_std_17)
--- a/icicle/src/poseidon/constants.cu
+++ b/icicle/src/poseidon/constants.cu
@@ -1,21 +1,21 @@
-#include "poseidon/poseidon.cuh"
+#include "../../include/poseidon/poseidon.cuh"

 /// These are pre-calculated constants for different curves
-#include "fields/id.h"
+#include "../../include/fields/id.h"
 #if FIELD_ID == BN254
-#include "poseidon/constants/bn254_poseidon.h"
+#include "../../include/poseidon/constants/bn254_poseidon.h"
 using namespace poseidon_constants_bn254;
 #elif FIELD_ID == BLS12_381
-#include "poseidon/constants/bls12_381_poseidon.h"
+#include "../../include/poseidon/constants/bls12_381_poseidon.h"
 using namespace poseidon_constants_bls12_381;
 #elif FIELD_ID == BLS12_377
-#include "poseidon/constants/bls12_377_poseidon.h"
+#include "../../include/poseidon/constants/bls12_377_poseidon.h"
 using namespace poseidon_constants_bls12_377;
 #elif FIELD_ID == BW6_761
-#include "poseidon/constants/bw6_761_poseidon.h"
+#include "../../include/poseidon/constants/bw6_761_poseidon.h"
 using namespace poseidon_constants_bw6_761;
 #elif FIELD_ID == GRUMPKIN
-#include "poseidon/constants/grumpkin_poseidon.h"
+#include "../../include/poseidon/constants/grumpkin_poseidon.h"
 using namespace poseidon_constants_grumpkin;
 #endif

@@ -29,8 +29,8 @@ namespace poseidon {
    device_context::DeviceContext& ctx,
    PoseidonConstants<S>* poseidon_constants)
  {
-    CHK_INIT_IF_RETURN();
-    cudaStream_t& stream = ctx.stream;
+    // CHK_INIT_IF_RETURN();
+    int& stream = ctx.stream;
    int width = arity + 1;
    int round_constants_len = width * full_rounds_half * 2 + partial_rounds;
    int mds_matrix_len = width * width;
@@ -39,10 +39,10 @@ namespace poseidon {

    // Malloc memory for copying constants
    S* d_constants;
-    CHK_IF_RETURN(cudaMallocAsync(&d_constants, sizeof(S) * constants_len, stream));
+    // CHK_IF_RETURN(cudaMallocAsync(&d_constants, sizeof(S) * constants_len, stream));

    // Copy constants
-    CHK_IF_RETURN(cudaMemcpyAsync(d_constants, constants, sizeof(S) * constants_len, cudaMemcpyHostToDevice, stream));
+    // CHK_IF_RETURN(cudaMemcpyAsync(d_constants, constants, sizeof(S) * constants_len, cudaMemcpyHostToDevice, stream));

    S* round_constants = d_constants;
    S* mds_matrix = round_constants + round_constants_len;
@@ -56,18 +56,18 @@ namespace poseidon {
    S domain_tag = S::from(tree_domain_tag_value);

    // Make sure all the constants have been copied
-    CHK_IF_RETURN(cudaStreamSynchronize(stream));
+    // CHK_IF_RETURN(cudaStreamSynchronize(stream));
    *poseidon_constants = {arity,      partial_rounds,    full_rounds_half, round_constants,
                           mds_matrix, non_sparse_matrix, sparse_matrices,  domain_tag};

-    return CHK_LAST();
+    return 0;
  }

  template <typename S>
  cudaError_t init_optimized_poseidon_constants(
    int arity, device_context::DeviceContext& ctx, PoseidonConstants<S>* poseidon_constants)
  {
-    CHK_INIT_IF_RETURN();
+    //CHK_INIT_IF_RETURN();
    int full_rounds_half = FULL_ROUNDS_DEFAULT;
    int partial_rounds;
    unsigned char* constants;
@@ -96,7 +96,7 @@ namespace poseidon {

    create_optimized_poseidon_constants(arity, full_rounds_half, partial_rounds, h_constants, ctx, poseidon_constants);

-    return CHK_LAST();
+    return 0; //CHK_LAST();
  }

  extern "C" cudaError_t CONCAT_EXPAND(FIELD, create_optimized_poseidon_constants_cuda)(
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
IdoAtlas	42cffb1c88	vec ops compiles	2024-05-12 14:01:17 +03:00
IdoAtlas	d3274a9eaa	poseidon compiles	2024-05-12 13:43:47 +03:00
IdoAtlas	d31a7019fe	polynomial compiles	2024-05-12 13:28:11 +03:00
IdoAtlas	84a0d3c348	ntt compiled	2024-05-12 12:55:07 +03:00
IdoAtlas	eb87970325	msm compiled	2024-05-09 11:31:22 +03:00
IdoAtlas	a9081aabbf	build hash	2024-05-09 10:59:44 +03:00
IdoAtlas	b564c6670d	curve build	2024-05-09 10:45:01 +03:00
Otsar	1f9f3f13ea	cuda cpu empty mock, field compiles	2024-05-09 10:02:59 +03:00
Stas	41294b12e0	Stas/example poly (#434 ) ## Describe the changes Added examples for Poly API --------- Co-authored-by: Yuval Shekel <yshekel@gmail.com>	2024-05-07 11:52:13 +03:00
Jeremy Felder	6134cfe177	[DOCS]: Tidy up docs (#502 ) ## Describe the changes This PR tidies up docs and updates golang build instructions	2024-05-06 15:35:19 +03:00
VitaliiH	34f0212c0d	rust classic benches with Criterion for ecntt/msm/ntt (#499 ) Rust idiomatic benches for EC NTT, NTT, MSM	2024-05-05 10:28:41 +02:00
release-bot	f6758f3447	Bump rust crates' version icicle-babybear@2.1.0 icicle-bls12-377@2.1.0 icicle-bls12-381@2.1.0 icicle-bn254@2.1.0 icicle-bw6-761@2.1.0 icicle-core@2.1.0 icicle-cuda-runtime@2.1.0 icicle-grumpkin@2.1.0 icicle-hash@2.1.0 icicle-stark252@2.1.0 Generated by cargo-workspaces	2024-05-01 20:11:42 +00:00
nonam3e	e2ad621f97	Nonam3e/golang/keccak (#496 ) ## Describe the changes This PR adds keccak bindings + passes cfg as reference in keccak cuda functions	2024-05-01 14:08:33 +03:00
PatStiles	bdc3da98d6	FEAT(stark252 field): Adds Stark252 curve (#494 ) ## Describe the changes Adds support for the stark252 base field.	2024-05-01 14:08:05 +03:00
yshekel	36e288c1fa	fix: bug regarding MixedRadix coset (I)NTT for NM/MN ordering (#497 ) The bug is in how twiddles array is indexed when multiplied by a mixed (M) vector to implement (I)NTT on cosets. The fix is to use the DIF-digit-reverse to compute the index of the element in the natural (N) vector that moved to index 'i' in the M vector. This is emulating a DIT-digit-reverse (which is mixing like a DIF-compute) reorder of the twiddles array and element-wise multiplication without reordering the twiddles memory.	2024-04-25 18:09:27 +03:00
nonam3e	f8d15e2613	update imports in golang bindings (#498 ) ## Describe the changes This PR updates imports in golang bindings to the v2 version	2024-04-25 03:46:14 +07:00
release-bot	14b39b57cc	Bump rust crates' version icicle-babybear@2.0.1 icicle-bls12-377@2.0.1 icicle-bls12-381@2.0.1 icicle-bn254@2.0.1 icicle-bw6-761@2.0.1 icicle-core@2.0.1 icicle-cuda-runtime@2.0.1 icicle-grumpkin@2.0.1 icicle-hash@2.0.1 Generated by cargo-workspaces	2024-04-24 07:13:05 +00:00
Jeremy Felder	999167afe1	[PATCH]: Update module with v2 versioning (#495 ) ## Describe the changes This PR fixes the issue of v2 ICICLE not being discovered by Go's packaging service by adding the required "v2" to the module path: https://go.dev/doc/modules/release-workflow#breaking	2024-04-24 10:09:45 +03:00
release-bot	ff374fcac7	Bump rust crates' version icicle-babybear@2.0.0 icicle-bls12-377@2.0.0 icicle-bls12-381@2.0.0 icicle-bn254@2.0.0 icicle-bw6-761@2.0.0 icicle-core@2.0.0 icicle-cuda-runtime@2.0.0 icicle-grumpkin@2.0.0 icicle-hash@2.0.0 Generated by cargo-workspaces	2024-04-23 02:30:18 +00:00
ChickenLover	7265d18d48	ICICLE V2 Release (#492 ) This PR introduces major updates for ICICLE Core, Rust and Golang bindings --------- Co-authored-by: Yuval Shekel <yshekel@gmail.com> Co-authored-by: DmytroTym <dmytrotym1@gmail.com> Co-authored-by: Otsar <122266060+Otsar-Raikou@users.noreply.github.com> Co-authored-by: VitaliiH <vhnatyk@gmail.com> Co-authored-by: release-bot <release-bot@ingonyama.com> Co-authored-by: Stas <spolonsky@icloud.com> Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com> Co-authored-by: ImmanuelSegol <3ditds@gmail.com> Co-authored-by: JimmyHongjichuan <45908291+JimmyHongjichuan@users.noreply.github.com> Co-authored-by: pierre <pierreuu@gmail.com> Co-authored-by: Leon Hibnik <107353745+LeonHibnik@users.noreply.github.com> Co-authored-by: nonam3e <timur@ingonyama.com> Co-authored-by: Vlad <88586482+vladfdp@users.noreply.github.com> Co-authored-by: LeonHibnik <leon@ingonyama.com> Co-authored-by: nonam3e <71525212+nonam3e@users.noreply.github.com> Co-authored-by: vladfdp <vlad.heintz@gmail.com>	2024-04-23 05:26:40 +03:00