new api - wip

wip - towrds better rust frontend
2026-01-12 08:58:09 -05:00 · 2023-06-22 23:39:54 +03:00 · 2023-06-05 14:41:29 +03:00
822 changed files with 22560 additions and 151676 deletions
--- a/.clang-format
+++ b/.clang-format
@@ -1,39 +0,0 @@
-Language: Cpp
-AlignAfterOpenBracket: AlwaysBreak
-AlignConsecutiveMacros: true
-AlignTrailingComments: true
-AllowAllParametersOfDeclarationOnNextLine: true
-AllowShortBlocksOnASingleLine: true
-AllowShortCaseLabelsOnASingleLine: false
-AllowShortFunctionsOnASingleLine: All
-AllowShortIfStatementsOnASingleLine: true
-AlwaysBreakTemplateDeclarations: true
-BinPackArguments: true
-BinPackParameters: false
-BreakBeforeBraces: Custom
-BraceWrapping:
-  AfterClass: true
-  AfterFunction: true
-BreakBeforeBinaryOperators: false
-BreakBeforeTernaryOperators: true
-ColumnLimit: 120
-ContinuationIndentWidth: 2
-Cpp11BracedListStyle: true
-DisableFormat: false
-IndentFunctionDeclarationAfterType: false
-IndentWidth: 2
-KeepEmptyLinesAtTheStartOfBlocks: false
-MaxEmptyLinesToKeep: 1
-NamespaceIndentation: All
-PointerAlignment: Left
-SortIncludes: false
-SpaceBeforeAssignmentOperators: true
-SpaceBeforeParens: ControlStatements
-SpaceInEmptyParentheses: false
-SpacesBeforeTrailingComments: 1
-SpacesInAngles: false
-SpacesInContainerLiterals: false
-SpacesInCStyleCastParentheses: false
-SpacesInParentheses: false
-Standard: c++17
-UseTab: Never
--- a/.codespellignore
+++ b/.codespellignore
@@ -1,6 +0,0 @@
-inout
-crate
-lmit
-mut
-uint
-dout
--- a/.github/ISSUE_TEMPLATE/bug_issue.md
+++ b/.github/ISSUE_TEMPLATE/bug_issue.md
@@ -2,7 +2,7 @@
 name: ":bug: Bug Report"
 about: Create a bug report to help us improve the repo
 title: "[BUG]: "
-labels: type:bug
+labels: bug
 ---

 ## Description
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -2,7 +2,7 @@
 name: ":sparkles: Feature Request"
 about: Request the inclusion of a new feature or functionality
 title: "[FEAT]: "
-labels: type:feature
+labels: enhancement
 ---

 ## Description
--- a/.github/changed-files.yml
+++ b/.github/changed-files.yml
@@ -1,25 +0,0 @@
-golang:
-  - wrappers/golang/**/*.go
-  - wrappers/golang/**/*.h
-  - wrappers/golang/**/*.tmpl
-  - go.mod
-  - .github/workflows/golang.yml
-rust:
-  - wrappers/rust/**/*
-  - '!wrappers/rust/README.md'
-  - .github/workflows/rust.yml
-cpp:
-  - icicle/**/*.cu
-  - icicle/**/*.cuh
-  - icicle/**/*.cpp
-  - icicle/**/*.hpp
-  - icicle/**/*.c
-  - icicle/**/*.h
-  - icicle/CMakeLists.txt
-  - .github/workflows/cpp_cuda.yml  
-  - icicle/cmake/Common.cmake
-  - icicle/cmake/CurvesCommon.cmake
-  - icicle/cmake/FieldsCommon.cmake
-examples:
-  - examples/**/*
-  - .github/workflows/examples.yml
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -0,0 +1,49 @@
+name: Build
+
+on: 
+  pull_request:
+    branches:
+      - "main"
+      - "dev"
+    paths:
+      - "icicle/**"
+      - "src/**"
+      - "Cargo.toml"
+      - "build.rs"
+
+env:
+  CARGO_TERM_COLOR: always
+  ARCH_TYPE: sm_70
+  DEFAULT_STREAM: per-thread
+
+jobs:
+  build-linux:
+    runs-on: ubuntu-latest
+
+    steps:
+    # Checkout code
+    - uses: actions/checkout@v3
+    # Download (or from cache) and install CUDA Toolkit 12.1.0
+    - uses: Jimver/cuda-toolkit@v0.2.9
+      id: cuda-toolkit
+      with:
+        cuda: '12.1.0'
+        use-github-cache: true
+      # Build from cargo - Rust utils are preinstalled on latest images
+      # https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md#rust-tools
+    - name: Build
+      run: cargo build --release --verbose
+      
+  
+  build-windows:
+    runs-on: windows-latest
+
+    steps:     
+    - uses: actions/checkout@v3
+    - uses: Jimver/cuda-toolkit@v0.2.9
+      id: cuda-toolkit
+      with:
+        cuda: '12.1.0'
+        use-github-cache: true
+    - name: Build
+      run: cargo build --release --verbose
--- a/.github/workflows/check-changed-files.yml
+++ b/.github/workflows/check-changed-files.yml
@@ -1,44 +0,0 @@
-name: Check Changed Files
-
-on:
-  workflow_call:
-    outputs:
-      golang:
-        description: "Flag for if GoLang files changed"
-        value: ${{ jobs.check-changed-files.outputs.golang }}
-      rust:
-        description: "Flag for if Rust files changed"
-        value: ${{ jobs.check-changed-files.outputs.rust }}
-      cpp_cuda:
-        description: "Flag for if C++/CUDA files changed"
-        value: ${{ jobs.check-changed-files.outputs.cpp_cuda }}
-      examples:
-        description: "Flag for if example files changed"
-        value: ${{ jobs.check-changed-files.outputs.examples }}
-
-jobs:
-  check-changed-files:
-    name: Check Changed Files
-    runs-on: ubuntu-22.04
-    outputs:
-      golang: ${{ steps.changed_files.outputs.golang }}
-      rust: ${{ steps.changed_files.outputs.rust }}
-      cpp_cuda: ${{ steps.changed_files.outputs.cpp_cuda }}
-      examples: ${{ steps.changed_files.outputs.examples }}
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Get all changed files
-      id: changed-files-yaml
-      uses: tj-actions/changed-files@v39
-      # https://github.com/tj-actions/changed-files#input_files_yaml_from_source_file
-      with:
-        files_yaml_from_source_file: .github/changed-files.yml
-    - name: Run Changed Files script
-      id: changed_files
-      # https://github.com/tj-actions/changed-files#outputs-
-      run: |
-        echo "golang=${{ steps.changed-files-yaml.outputs.golang_any_modified }}" >> "$GITHUB_OUTPUT"
-        echo "rust=${{ steps.changed-files-yaml.outputs.rust_any_modified }}" >> "$GITHUB_OUTPUT"
-        echo "cpp_cuda=${{ steps.changed-files-yaml.outputs.cpp_any_modified }}" >> "$GITHUB_OUTPUT"
-        echo "examples=${{ steps.changed-files-yaml.outputs.examples_any_modified }}" >> "$GITHUB_OUTPUT"
--- a/.github/workflows/codespell.yml
+++ b/.github/workflows/codespell.yml
@@ -1,20 +0,0 @@
-name: Check Spelling
-
-on:
-  pull_request:
-    branches:
-      - main
-      - V2
-
-jobs:
-  spelling-checker:
-    name: Check Spelling
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: codespell-project/actions-codespell@v2
-        with:
-          # https://github.com/codespell-project/actions-codespell?tab=readme-ov-file#parameter-skip
-          skip: ./**/target,./**/build,./docs/*.js,./docs/*.json
-          # https://github.com/codespell-project/actions-codespell?tab=readme-ov-file#parameter-ignore_words_file
-          ignore_words_file: .codespellignore
--- a/.github/workflows/cpp_cuda.yml
+++ b/.github/workflows/cpp_cuda.yml
@@ -1,91 +0,0 @@
-name: C++/CUDA
-
-on:
-  pull_request:
-    branches:
-      - main
-      - V2
-  push:
-    branches:
-      - main
-      - V2
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: true
-
-jobs:
-  check-changed-files:
-    uses: ./.github/workflows/check-changed-files.yml
-
-  check-format:
-    name: Check Code Format
-    runs-on: ubuntu-22.04
-    needs: check-changed-files
-    steps:
-    - name: Checkout
-      uses: actions/checkout@v4
-    - name: Check clang-format
-      if: needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: if [[ $(find ./ \( -path ./icicle/build -prune -o -path ./**/target -prune -o -path ./examples -prune \) -iname *.h -or -iname *.cuh -or -iname *.cu -or -iname *.c -or -iname *.cpp | xargs clang-format --dry-run -ferror-limit=1 -style=file 2>&1) ]]; then echo "Please run clang-format"; exit 1; fi
-
-  test-linux-curve:
-    name: Test on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, check-format]
-    strategy:
-      matrix:
-        curve:
-          - name: bn254
-            build_args: -DG2=ON -DECNTT=ON
-          - name: bls12_381
-            build_args: -DG2=ON -DECNTT=ON
-          - name: bls12_377
-            build_args: -DG2=ON -DECNTT=ON
-          - name: bw6_761
-            build_args: -DG2=ON -DECNTT=ON
-          - name: grumpkin
-            build_args:
-
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Build curve
-      working-directory: ./icicle
-      if: needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        mkdir -p build && rm -rf build/*
-        cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DCURVE=${{ matrix.curve.name }} ${{ matrix.curve.build_args }} -S . -B build
-        cmake --build build -j
-    - name: Run C++ curve Tests
-      working-directory: ./icicle/build/tests
-      if: needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: ctest
-
-  test-linux-field:
-    name: Test on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, check-format]
-    strategy:
-      matrix:
-        field: 
-          - name: babybear
-            build_args: -DEXT_FIELD=ON
-          - name: stark252
-            build_args: -DEXT_FIELD=OFF
-          - name: m31
-            build_args: -DEXT_FIELD=ON
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Build field
-      working-directory: ./icicle
-      if: needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        mkdir -p build && rm -rf build/*
-        cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON -DFIELD=${{ matrix.field.name }} ${{ matrix.field.build_args }} -S . -B build
-        cmake --build build -j
-    - name: Run C++ field Tests
-      working-directory: ./icicle/build/tests
-      if: needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: ctest
--- a/.github/workflows/deploy-docs.yml
+++ b/.github/workflows/deploy-docs.yml
@@ -1,46 +0,0 @@
-name: Deploy to GitHub Pages
-
-on:
-  push:
-    branches:
-      - main
-    paths:
-      - 'docs/**'
-
-permissions:
-  contents: write
-
-jobs:
-  deploy:
-    name: Deploy to GitHub Pages
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v3
-        with:
-          path: 'repo'
-      - uses: actions/setup-node@v3
-        with:
-          node-version: 18
-          cache: npm
-          cache-dependency-path: ./repo/docs/package-lock.json
-
-      - name: Install dependencies
-        run: npm install --frozen-lockfile
-        working-directory: ./repo/docs
-
-      - name: Build website
-        run: npm run build
-        working-directory: ./repo/docs
-
-      - name: Copy CNAME to build directory
-        run: echo "dev.ingonyama.com" > ./build/CNAME
-        working-directory: ./repo/docs
-
-      - name: Deploy to GitHub Pages
-        uses: peaceiris/actions-gh-pages@v3
-        with:
-          github_token: ${{ secrets.GITHUB_TOKEN }}
-          publish_dir: ./repo/docs/build
-          user_name: github-actions[bot]
-          user_email: 41898282+github-actions[bot]@users.noreply.github.com
-          working-directory: ./repo/docs
--- a/.github/workflows/examples.yml
+++ b/.github/workflows/examples.yml
@@ -1,60 +0,0 @@
-# This workflow is a demo of how to run all examples in the Icicle repository.
-# For each language directory (c++, Rust, etc.) the workflow 
-#   (1) loops over all examples (msm, ntt, etc.) and 
-#   (2) runs ./compile.sh and ./run.sh in each directory.
-# The script ./compile.sh should compile the example and ./run.sh should run it.
-# Each script should return 0 for success and 1 otherwise.
-
-name: Examples
-
-on:
-  pull_request:
-    branches:
-      - main
-      - V2
-  push:
-    branches:
-      - main
-      - V2
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: true
-
-jobs:
-  check-changed-files:
-    uses: ./.github/workflows/check-changed-files.yml
-
-  run-examples:
-    runs-on: [self-hosted, Linux, X64, icicle, examples]
-    needs: check-changed-files
-    steps:
-    - name: Checkout
-      uses: actions/checkout@v4
-    - name: c++ examples
-      working-directory: ./examples/c++
-      if: needs.check-changed-files.outputs.cpp_cuda == 'true' || needs.check-changed-files.outputs.examples == 'true'
-      run: |
-        # loop over all directories in the current directory
-        for dir in $(find . -mindepth 1 -maxdepth 1 -type d); do
-          if [ -d "$dir" ]; then
-            echo "Running command in $dir"
-            cd $dir
-            ./compile.sh
-            ./run.sh
-            cd -
-          fi
-        done    
-    - name: Rust examples
-      working-directory: ./examples/rust
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.examples == 'true'
-      run: |
-        # loop over all directories in the current directory
-        for dir in $(find . -mindepth 1 -maxdepth 1 -type d); do
-          if [ -d "$dir" ]; then
-            echo "Running command in $dir"
-            cd $dir
-            cargo run --release
-            cd -
-          fi
-        done      
--- a/.github/workflows/golang.yml
+++ b/.github/workflows/golang.yml
@@ -1,162 +0,0 @@
-name: GoLang
-
-on:
-  pull_request:
-    branches:
-      - main
-      - V2
-  push:
-    branches:
-      - main
-      - V2
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: true
-
-jobs:
-  check-changed-files:
-    uses: ./.github/workflows/check-changed-files.yml
-
-  check-format:
-    name: Check Code Format
-    runs-on: ubuntu-22.04
-    needs: check-changed-files
-    steps:
-    - name: Checkout
-      uses: actions/checkout@v4
-    - name: Setup go
-      uses: actions/setup-go@v5
-      with:
-        go-version: '1.20.0'
-    - name: Check gofmt
-      if: needs.check-changed-files.outputs.golang == 'true'
-      run: if [[ $(go list ./... | xargs go fmt) ]]; then echo "Please run go fmt"; exit 1; fi
-
-  build-curves-linux:
-    name: Build and test curves on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, check-format]
-    strategy:
-      matrix:
-        curve: 
-          - name: bn254
-            build_args: -g2 -ecntt
-          - name: bls12_381
-            build_args: -g2 -ecntt
-          - name: bls12_377
-            build_args: -g2 -ecntt
-          - name: bw6_761
-            build_args: -g2 -ecntt
-          - name: grumpkin
-            build_args:
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Setup go
-      uses: actions/setup-go@v5
-      with:
-        go-version: '1.20.0'
-    - name: Build
-      working-directory: ./wrappers/golang
-      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      # builds a single curve with the curve's specified build args
-      run: ./build.sh -curve=${{ matrix.curve.name }} ${{ matrix.curve.build_args }}
-    - name: Test
-      working-directory: ./wrappers/golang/curves
-      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        CURVE=$(echo ${{ matrix.curve.name }} | sed -e 's/_//g')
-        export CPATH=$CPATH:/usr/local/cuda/include
-        go test ./$CURVE/tests -count=1 -failfast -p 2 -timeout 60m -v
- 
-  build-fields-linux:
-    name: Build and test fields on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, check-format]
-    strategy:
-      matrix:
-        field:
-          - name: babybear
-            build_args: -field-ext
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Setup go
-      uses: actions/setup-go@v5
-      with:
-        go-version: '1.20.0'
-    - name: Build
-      working-directory: ./wrappers/golang
-      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      # builds a single field with the fields specified build args
-      run: ./build.sh -field=${{ matrix.field.name }} ${{ matrix.field.build_args }}
-    - name: Test
-      working-directory: ./wrappers/golang/fields
-      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        FIELD=$(echo ${{ matrix.field.name }} | sed -e 's/_//g')
-        export CPATH=$CPATH:/usr/local/cuda/include
-        go test ./$FIELD/tests -count=1 -failfast -p 2 -timeout 60m -v
-    
-  build-hashes-linux:
-    name: Build and test hashes on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, check-format]
-    strategy:
-      matrix:
-        hash:
-          - name: keccak
-            build_args:
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Setup go
-      uses: actions/setup-go@v5
-      with:
-        go-version: '1.20.0'
-    - name: Build
-      working-directory: ./wrappers/golang
-      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      # builds a single hash algorithm with the hash's specified build args
-      run: ./build.sh -hash=${{ matrix.hash.name }} ${{ matrix.hash.build_args }}
-    - name: Test
-      working-directory: ./wrappers/golang/hash
-      if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        HASH=$(echo ${{ matrix.hash.name }} | sed -e 's/_//g')
-        export CPATH=$CPATH:/usr/local/cuda/include
-        go test ./$HASH/tests -count=1 -failfast -p 2 -timeout 60m -v
-  
-  # TODO: bw6 on windows requires more memory than the standard runner has
-  # Add a large runner and then enable this job
-  # build-windows:
-  #   name: Build on Windows
-  #   runs-on: windows-2022
-  #   needs: [check-changed-files, check-format]
-  #   strategy:
-  #     matrix:
-  #       curve: [bn254, bls12_381, bls12_377, bw6_761]
-  #   steps:     
-  #   - name: Checkout Repo
-  #     uses: actions/checkout@v4
-  #   - name: Setup go
-  #     uses: actions/setup-go@v5
-  #     with:
-  #       go-version: '1.20.0'
-  #   - name: Download and Install Cuda
-  #     if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-  #     id: cuda-toolkit
-  #     uses: Jimver/cuda-toolkit@v0.2.11
-  #     with:
-  #       cuda: '12.0.0'
-  #       method: 'network'
-  #       # https://docs.nvidia.com/cuda/archive/12.0.0/cuda-installation-guide-microsoft-windows/index.html
-  #       sub-packages: '["cudart", "nvcc", "thrust", "visual_studio_integration"]'
-  #   - name: Build libs
-  #     if: needs.check-changed-files.outputs.golang == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-  #     working-directory: ./wrappers/golang
-  #     env:
-  #       CUDA_PATH: ${{ steps.cuda-toolkit.outputs.CUDA_PATH }}
-  #     shell: pwsh
-  #     run: ./build.ps1 ${{ matrix.curve }} ON # builds a single curve with G2 enabled
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -1,50 +0,0 @@
-name: Release
-
-on:
-  workflow_dispatch:
-    inputs:
-      releaseType:
-        description: 'Release type'
-        required: true
-        default: 'minor'
-        type: choice
-        options:
-          - patch
-          - minor
-          - major
-
-jobs:
-  release:
-    name: Release
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v4
-        with:
-          ssh-key: ${{ secrets.DEPLOY_KEY }}
-      - name: Setup Cache
-        id: cache
-        uses: actions/cache@v4
-        with:
-          path: |
-            ~/.cargo/bin/
-            ~/.cargo/registry/index/
-            ~/.cargo/registry/cache/
-            ~/.cargo/git/db/
-          key: ${{ runner.os }}-cargo-${{ hashFiles('~/.cargo/bin/cargo-workspaces') }}
-      - name: Install cargo-workspaces
-        if: steps.cache.outputs.cache-hit != 'true'
-        run: cargo install cargo-workspaces
-      - name: Bump rust crate versions, commit, and tag
-        working-directory: wrappers/rust
-        # https://github.com/pksunkara/cargo-workspaces?tab=readme-ov-file#version
-        run: |
-          git config user.name release-bot
-          git config user.email release-bot@ingonyama.com
-          cargo workspaces version ${{ inputs.releaseType }} -y --no-individual-tags -m "Bump rust crates' version"
-      - name: Create draft release
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-        run: |
-          LATEST_TAG=$(git describe --tags --abbrev=0)
-          gh release create $LATEST_TAG --generate-notes -d --verify-tag -t "Release $LATEST_TAG"
--- a/.github/workflows/rust.yml
+++ b/.github/workflows/rust.yml
@@ -1,112 +0,0 @@
-name: Rust
-
-on:
-  pull_request:
-    branches:
-      - main
-      - V2
-  push:
-    branches:
-      - main
-      - V2
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: true
-
-jobs:
-  check-changed-files:
-    uses: ./.github/workflows/check-changed-files.yml
-
-  check-format:
-    name: Check Code Format
-    runs-on: ubuntu-22.04
-    needs: check-changed-files
-    steps:
-    - name: Checkout
-      uses: actions/checkout@v4
-    - name: Check rustfmt
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      working-directory: ./wrappers/rust
-      # "-name target -prune" removes searching in any directory named "target"
-      # Formatting by single file is necessary due to generated files not being present
-      # before building the project.
-      # e.g. icicle-cuda-runtime/src/bindings.rs is generated and icicle-cuda-runtime/src/lib.rs includes that module
-      # causing rustfmt to fail.
-      run: if [[ $(find . -path ./icicle-curves/icicle-curve-template -prune -o -name target -prune -o -iname *.rs -print | xargs cargo fmt --check --) ]]; then echo "Please run cargo fmt"; exit 1; fi
-
-  build-linux:
-    name: Build on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, check-format]
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Build
-      working-directory: ./wrappers/rust
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      # Building from the root workspace will build all members of the workspace by default
-      run: cargo build --release --verbose
-  
-  test-linux:
-    name: Test on Linux
-    runs-on: [self-hosted, Linux, X64, icicle]
-    needs: [check-changed-files, build-linux]
-    steps:
-    - name: Checkout Repo
-      uses: actions/checkout@v4
-    - name: Run tests
-      working-directory: ./wrappers/rust
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      # Running tests from the root workspace will run all workspace members' tests by default
-      # We need to limit the number of threads to avoid running out of memory on weaker machines
-      # ignored tests are polynomial tests. Since they conflict with NTT tests, they are executed separately
-      run: |
-        cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --exclude icicle-m31 --release --verbose --features=g2 -- --test-threads=2 --ignored
-        cargo test --workspace --exclude icicle-babybear --exclude icicle-stark252 --exclude icicle-m31 --release --verbose --features=g2 -- --test-threads=2
-
-    - name: Run baby bear tests
-      working-directory: ./wrappers/rust/icicle-fields/icicle-babybear
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        cargo test --release --verbose -- --ignored
-        cargo test --release --verbose
-
-    - name: Run stark252 tests
-      working-directory: ./wrappers/rust/icicle-fields/icicle-stark252
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        cargo test --release --verbose -- --ignored
-        cargo test --release --verbose
-
-    - name: Run m31 tests
-      working-directory: ./wrappers/rust/icicle-fields/icicle-m31
-      if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-      run: |
-        cargo test --release --verbose -- --ignored
-        cargo test --release --verbose
-
-  # build-windows:
-  #   name: Build on Windows
-  #   runs-on: windows-2022
-  #   needs: check-changed-files
-  #   steps:     
-  #   - name: Checkout Repo
-  #     uses: actions/checkout@v4
-  #   - name: Download and Install Cuda
-  #     if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-  #     id: cuda-toolkit
-  #     uses: Jimver/cuda-toolkit@v0.2.11
-  #     with:
-  #       cuda: '12.0.0'
-  #       method: 'network'
-  #       # https://docs.nvidia.com/cuda/archive/12.0.0/cuda-installation-guide-microsoft-windows/index.html
-  #       sub-packages: '["cudart", "nvcc", "thrust", "visual_studio_integration"]'
-  #   - name: Build targets
-  #     working-directory: ./wrappers/rust
-  #     if: needs.check-changed-files.outputs.rust == 'true' || needs.check-changed-files.outputs.cpp_cuda == 'true'
-  #     env:
-  #       CUDA_PATH: ${{ steps.cuda-toolkit.outputs.CUDA_PATH }}
-  #       CUDA_ARCH: 50 # Using CUDA_ARCH=50 env variable since the CI machines have no GPUs
-  #     # Building from the root workspace will build all members of the workspace by default
-  #     run: cargo build --release --verbose
--- a/.github/workflows/test-deploy-docs.yml
+++ b/.github/workflows/test-deploy-docs.yml
@@ -1,29 +0,0 @@
-name: Test Deploy to GitHub Pages
-
-on:
-  pull_request:
-    branches:
-      - main
-    paths:
-      - 'docs/**'
-
-jobs:
-  test-deploy:
-    name: Test deployment of docs website
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v3
-        with:
-          path: 'repo'
-      - uses: actions/setup-node@v3
-        with:
-          node-version: 18
-          cache: npm
-          cache-dependency-path: ./repo/docs/package-lock.json
-
-      - name: Install dependencies
-        run: npm install --frozen-lockfile
-        working-directory: ./repo/docs
-      - name: Test build website
-        run: npm run build
-        working-directory: ./repo/docs
--- a/.gitignore
+++ b/.gitignore
@@ -5,10 +5,6 @@
 *.cubin
 *.bin
 *.fatbin
-*.so
-*.nsys-rep
-*.ncu-rep
-*.sage.py
 **/target
 **/.vscode
 **/.*lock*csv#
@@ -16,5 +12,3 @@
 **/.DS_Store
 **/Cargo.lock
 **/icicle/build/
-**/wrappers/rust/icicle-cuda-runtime/src/bindings.rs
-**/build*
--- a/.rustfmt.toml
+++ b/.rustfmt.toml
@@ -1,10 +0,0 @@
-# https://github.com/rust-lang/rustfmt/blob/master/Configurations.md
-
-# Stable Configs
-chain_width = 0
-max_width = 120
-merge_derives = true
-use_field_init_shorthand = true
-use_try_shorthand = true
-
-# Unstable Configs
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -1,8 +0,0 @@
-cff-version: 1.2.0
-message: "If you use this software, please cite it as below."
-authors:
- family-names: "Ingonyama"
-title: "ICICLE: GPU Library for ZK Acceleration"
-version: 1.0.0
-date-released: 2024-01-04
-url: "https://github.com/ingonyama-zk/icicle"
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -0,0 +1,9 @@
+[workspace]
+name = "icicle"
+version = "0.1.0"
+edition = "2021"
+
+# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
+
+members = ["icicle-core", "bls12-381", "bls12-377", "bn254"]
+
--- a/28
+++ b/28
@@ -1,28 +0,0 @@
-# Use the specified base image
-FROM nvidia/cuda:12.0.0-devel-ubuntu22.04
-
-# Update and install dependencies
-RUN apt-get update && apt-get install -y \
-    cmake \
-    protobuf-compiler \
-    curl \
-    build-essential \
-    && rm -rf /var/lib/apt/lists/*
-
-# Install Rust
-RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
-ENV PATH="/root/.cargo/bin:${PATH}"
-
-# Install Golang
-ENV GOLANG_VERSION 1.21.1
-RUN curl -L https://go.dev/dl/go${GOLANG_VERSION}.linux-amd64.tar.gz | tar -xz -C /usr/local
-ENV PATH="/usr/local/go/bin:${PATH}"
-
-# Set the working directory in the container
-WORKDIR /app
-
-# Copy the content of the local directory to the working directory
-COPY . .
-
-# Specify the default command for the container
-CMD ["/bin/bash"]
--- a/README.md
+++ b/README.md
@@ -1,130 +1,150 @@
 # ICICLE
+ <div align="center">Icicle is a library for ZK acceleration using CUDA-enabled GPUs.</div>

-<div align="center">ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.</div>
-
-<p align="center">
-  <img alt="ICICLE" width="300" height="300" src="https://user-images.githubusercontent.com/2446179/223707486-ed8eb5ab-0616-4601-8557-12050df8ccf7.png"/>
-</p>
-<p align="center">
-  <a href="https://discord.gg/EVVXTdt6DF">
-    <img src="https://img.shields.io/discord/1063033227788423299?logo=discord" alt="Chat with us on Discord">
-  </a>
-  <a href="https://twitter.com/intent/follow?screen_name=Ingo_zk">
-    <img src="https://img.shields.io/twitter/follow/Ingo_zk?style=social&logo=twitter" alt="Follow us on Twitter">
-  <a href="https://github.com/ingonyama-zk/icicle/releases">
-    <img src="https://img.shields.io/github/v/release/ingonyama-zk/icicle" alt="GitHub Release">
-  </a>
-</p>
-
+                  
+![image (4)](https://user-images.githubusercontent.com/2446179/223707486-ed8eb5ab-0616-4601-8557-12050df8ccf7.png)

 ## Background

 Zero Knowledge Proofs (ZKPs) are considered one of the greatest achievements of modern cryptography. Accordingly, ZKPs are expected to disrupt a number of industries and will usher in an era of trustless and privacy preserving services and infrastructure.

-We believe GPUs are as important for ZK as for AI.
+If we want ZK hardware today we have FPGAs or GPUs which are relatively inexpensive. However, the biggest selling point of GPUs is the software; we talk in particular about CUDA, which makes it easy to write code running on Nvidia GPUs, taking advantage of their highly parallel architecture. Together with the widespread availability of these devices, if we can get GPUs to work on ZK workloads, then we have made a giant step towards accessible and efficient ZK provers.

- GPUs are a perfect match for ZK compute - around 97% of ZK protocol runtime is parallel by nature.
- GPUs are simple for developers to use and scale compared to other hardware platforms.
- GPUs are extremely competitive in terms of power / performance and price (3x cheaper).
- GPUs are popular and readily available.
+## Zero Knowledge on GPU

-## Getting Started
+ICICLE is a CUDA implementation of general functions widely used in ZKP. ICICLE currently provides support for MSM, NTT, and ECNTT, with plans to support Hash functions soon.

-ICICLE is a CUDA implementation of general functions widely used in ZKP.
+### Supported primitives

-> [!NOTE]
-> Developers: We highly recommend reading our [documentation]
+- Fields
+    - Scalars
+    - Points
+        - Projective: {x, y, z}
+        - Affine: {x, y}
+- Curves
+    - [BLS12-381]
+    - [BLS12-377]
+    - [BN254]

-> [!TIP]
-> Try out ICICLE by running some [examples] using ICICLE in C++ and our Rust bindings 
+## Build and usage

-### Prerequisites
+> NOTE: [NVCC] is a prerequisite for building.

- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 12.0 or newer.
- [CMake]((https://cmake.org/files/)), version 3.18 and above. Latest version is recommended.
- [GCC](https://gcc.gnu.org/install/download.html) version 9, latest version is recommended.
- Any Nvidia GPU (which supports CUDA Toolkit version 12.0 or above).
+1. Define or select a curve for your application; we've provided a [template][CRV_TEMPLATE] for defining a curve
+2. Include the curve in [`curve_config.cuh`][CRV_CONFIG]
+3. Now you can build the ICICLE library using nvcc

-> [!NOTE]
-> It is possible to use CUDA 11 for cards which don't support CUDA 12, however we don't officially support this version and in the future there may be issues.
-
-### Accessing Hardware
-
-If you don't have access to an Nvidia GPU we have some options for you. 
-
-Checkout [Google Colab](https://colab.google/). Google Colab offers a free [T4 GPU](https://www.nvidia.com/en-us/data-center/tesla-t4/) instance and ICICLE can be used with it, reference this guide for setting up your [Google Colab workplace][GOOGLE-COLAB-ICICLE].
-
-If you require more compute and have an interesting research project, we have [bounty and grant programs][GRANT_PROGRAM].
-
-
-### Build systems
-
-ICICLE has three build systems.
-
- [ICICLE core][ICICLE-CORE], C++ and CUDA
- [ICICLE Rust][ICICLE-RUST] bindings, requires [Rust](https://www.rust-lang.org/) version 1.70 and above
- [ICICLE Golang][ICICLE-GO] bindings, requires [Go](https://go.dev/) version 1.20 and above
-
-ICICLE core always needs to be built as part of the other build systems as it contains the core ICICLE primitives implemented in CUDA. Reference these guides for the different build systems, [ICICLE core guide][ICICLE-CORE-README], [ICICLE Rust guide][ICICLE-RUST-README] and [ICICLE Golang guide][ICICLE-GO-README].
-
-### Compiling ICICLE
-
-Running ICICLE via Rust bindings is highly recommended and simple:
- Clone this repo
-  - go to our [Rust bindings][ICICLE-RUST]
-  - Enter a [curve](./wrappers/rust/icicle-curves) implementation
-  - run `cargo build --release` to build or `cargo test` to build and execute tests
-
-In any case you would want to compile and run core icicle c++ tests, just follow these setps:
- Clone this repo
-  - go to [ICICLE core][ICICLE-CORE]
-  - execute the small [script](https://github.com/ingonyama-zk/icicle/tree/main/icicle#running-tests) to compile via cmake and run c++ and cuda tests
-
-## Docker
-
-We offer a simple Docker container so you can simply run ICICLE without setting everything up locally.
-
-```
-docker build -t <name_of_your_choice> .
-docker run --gpus all -it <name_of_your_choice> /bin/bash
+```sh
+mkdir -p build
+nvcc -o build/<ENTER_DIR_NAME> ./icicle/appUtils/ntt/ntt.cu ./icicle/appUtils/msm/msm.cu ./icicle/appUtils/vector_manipulation/ve_mod_mult.cu ./icicle/primitives/projective.cu -lib -arch=native
 ```

+### Testing the CUDA code
+
+We are using [googletest] library for testing. To build and run [the test suite](./icicle/README.md) for finite field and elliptic curve arithmetic, run from the `icicle` folder:
+
+```sh
+mkdir -p build
+cmake -S . -B build
+cmake --build build
+cd build && ctest
+```
+
+### Rust Bindings
+
+For convenience, we also provide rust bindings to the ICICLE library for the following primitives:
+
+- MSM
+- NTT
+    - Forward NTT
+    - Inverse NTT
+- ECNTT
+    - Forward ECNTT
+    - Inverse NTT
+- Scalar Vector Multiplication
+- Point Vector Multiplication
+
+A custom [build script][B_SCRIPT] is used to compile and link the ICICLE library. The environement variable `ARCH_TYPE` is used to determine which GPU type the library should be compiled for and it defaults to `native` when it is not set allowing the compiler to detect the installed GPU type.
+
+> NOTE: A GPU must be detectable and therefore installed if the `ARCH_TYPE` is not set.
+
+Once you have your parameters set, run:
+
+```sh
+cargo build --release
+```
+
+You'll find a release ready library at `target/release/libicicle_utils.rlib`.
+
+To benchmark and test the functionality available in RUST, run:
+
+```
+cargo bench
+cargo test -- --test-threads=1
+```
+
+The flag `--test-threads=1` is needed because currently some tests might interfere with one another inside the GPU.
+
+### Example Usage
+
+An example of using the Rust bindings library can be found in our [fast-danksharding implementation][FDI]
+
+### Supporting Additional Curves
+
+Supporting additional curves can be done as follows:
+
+Create a JSON file with the curve parameters. The curve is defined by the following parameters: 
+- ``curve_name`` - e.g. ``bls12_381``.
+- ``modolus_p`` - scalar field modolus (in decimal).
+- ``bit_count_p`` - number of bits needed to represent `` modolus_p`` .
+- ``limb_p`` - number of bytes needed to represent `` modolus_p``  (rounded).
+- ``ntt_size`` - log of the maximal size subgroup of the scalar field.    
+- ``modolus_q`` - base field modulus (in decimal).
+- ``bit_count_q`` - number of bits needed to represent `` modolus_q`` .
+- ``limb_q`` number of bytes needed to represent `` modolus_p``  (rounded).
+- ``weierstrass_b`` - Weierstrauss constant of the curve. 
+- ``gen_x`` - x-value of a generator element for the curve. 
+- ``gen_y`` - y-value of a generator element for the curve.
+
+Here's an example for BLS12-381.
+```
+{
+    "curve_name" : "bls12_381", 
+    "modolus_p" : 52435875175126190479447740508185965837690552500527637822603658699938581184513,
+    "bit_count_p" : 255,
+    "limb_p" :  8,
+    "ntt_size" : 32,
+    "modolus_q" : 4002409555221667393417789825735904156556882819939007885332058136124031650490837864442687629129015664037894272559787,
+    "bit_count_q" : 381,
+    "limb_q" : 12,
+    "weierstrass_b" : 4,
+    "gen_x" : 3685416753713387016781088315183077757961620795782546409894578378688607592378376318836054947676345821548104185464507,
+    "gen_y" : 1339506544944476473020471379941921221584933875938349620426543736416511423956333506472724655353366534992391756441569
+}
+```
+
+Save the parameters JSON file in ``curve_parameters``.
+
+Then run the Python script ``new_curve_script.py `` from the main icicle folder:
+
+```
+python3 ./curve_parameters/new_curve_script_rust.py ./curve_parameters/bls12_381.json
+```
+
+The script does the following:
+- Creates a folder in ``icicle/curves`` with the curve name, which contains all of the files needed for the supported operations in cuda.
+- Adds the curve exported operations to ``icicle/curves/index.cu``. 
+- Creates a file with the curve name in ``src/curves`` with the relevant objects for the curve. 
+- Creates a test file with the curve name in ``src``. 
+
+Testing the new curve could be done by running the tests in ``tests_curve_name`` (e.g. ``tests_bls12_381``).
 ## Contributions

-Join our [Discord Server][DISCORD] and find us on the icicle channel. We will be happy to work together to support your use case and talk features, bugs and design.
-
-### Development Contributions
-
-If you are changing code, please make sure to change your [git hooks path][HOOKS_DOCS] to the repo's [hooks directory][HOOKS_PATH] by running the following command:
-
-```sh
-git config core.hooksPath ./scripts/hooks
-```
-
-In case `clang-format` is missing on your system, you can install it  using the following command:
-
-```sh
-sudo apt install clang-format
-```
-
-You will also need to install [codespell](https://github.com/codespell-project/codespell?tab=readme-ov-file#installation) to check for typos.
-
-This will ensure our custom hooks are run and will make it easier to follow our coding guidelines.
+Join our [Discord Server](https://discord.gg/Y4SkbDf2Ff) and find us on the icicle channel. We will be happy to work together to support your use case and talk features, bugs and design.

 ### Hall of Fame

- [Robik](https://github.com/robik75), for his ongoing support and mentorship
- [liuxiao](https://github.com/liuxiaobleach), for being a top notch bug smasher
- [gkigiermo](https://github.com/gkigiermo), for making it intuitive to use ICICLE in Google Colab
- [nonam3e](https://github.com/nonam3e), for adding Grumpkin curve support into ICICLE
- [alxiong](https://github.com/alxiong), for adding warmup for CudaStream
- [cyl19970726](https://github.com/cyl19970726), for updating go install source in Dockerfile
- [PatStiles](https://github.com/PatStiles), for adding Stark252 field
-
-## Help & Support
-
-For help and support talk to our devs in our discord channel ["ICICLE"](https://discord.gg/EVVXTdt6DF) 
-
+- [Robik](https://github.com/robik75), for his on-going support and mentorship 

 ## License

@@ -133,26 +153,13 @@ ICICLE is distributed under the terms of the MIT License.
 See [LICENSE-MIT][LMIT] for details.

 <!-- Begin Links -->
-[BLS12-381]: ./icicle/curves/
-[BLS12-377]: ./icicle/curves/
-[BN254]: ./icicle/curves/
-[BW6-671]: ./icicle/curves/
+[BLS12-381]: ./icicle/curves/bls12_381.cuh
 [NVCC]: https://docs.nvidia.com/cuda/#installation-guides
+[CRV_TEMPLATE]: ./icicle/curves/curve_template.cuh
+[CRV_CONFIG]: ./icicle/curves/curve_config.cuh
+[B_SCRIPT]: ./build.rs
+[FDI]: https://github.com/ingonyama-zk/fast-danksharding
 [LMIT]: ./LICENSE
-[DISCORD]: https://discord.gg/Y4SkbDf2Ff
 [googletest]: https://github.com/google/googletest/
-[HOOKS_DOCS]: https://git-scm.com/docs/githooks
-[HOOKS_PATH]: ./scripts/hooks/
-[CMAKELISTS]: https://github.com/ingonyama-zk/icicle/blob/f0e6b465611227b858ec4590f4de5432e892748d/icicle/CMakeLists.txt#L28
-[GOOGLE-COLAB-ICICLE]: https://dev.ingonyama.com/icicle/colab-instructions
-[GRANT_PROGRAM]: https://medium.com/@ingonyama/icicle-for-researchers-grants-challenges-9be1f040998e
-[ICICLE-CORE]: ./icicle/
-[ICICLE-RUST]: ./wrappers/rust/
-[ICICLE-GO]: ./wrappers/golang/
-[ICICLE-CORE-README]: ./icicle/README.md
-[ICICLE-RUST-README]: ./wrappers/rust/README.md
-[ICICLE-GO-README]: ./wrappers/golang/README.md
-[documentation]: https://dev.ingonyama.com/icicle/overview
-[examples]: ./examples/

 <!-- End Links -->
--- a/bls12-377/Cargo.toml
+++ b/bls12-377/Cargo.toml
@@ -0,0 +1,34 @@
+[package]
+name = "bls12-377"
+version = "0.1.0"
+edition = "2021"
+authors = [ "Ingonyama" ]
+
+[dependencies]
+icicle-core = { path = "../icicle-core" }
+
+hex = "*"
+ark-std = "0.3.0"
+ark-ff = "0.3.0"
+ark-poly = "0.3.0"
+ark-ec = { version = "0.3.0", features = [ "parallel" ] }
+ark-bls12-377 = "0.3.0"
+
+serde = { version = "1.0", features = ["derive"] }
+serde_derive = "1.0"
+serde_cbor = "0.11.2"
+
+rustacuda = "0.1"
+rustacuda_core = "0.1"
+rustacuda_derive = "0.1"
+
+rand = "*" #TODO: move rand and ark dependencies to dev once random scalar/point generation is done "natively"
+
+[build-dependencies]
+cc = { version = "1.0", features = ["parallel"] }
+
+[dev-dependencies]
+"criterion" = "0.4.0"
+
+[features]
+g2 = []
--- a/bls12-377/build.rs
+++ b/bls12-377/build.rs
@@ -0,0 +1,34 @@
+use std::env;
+
+fn main() {
+    //TODO: check cargo features selected
+    //TODO: can conflict/duplicate with make ?
+
+    println!("cargo:rerun-if-env-changed=CXXFLAGS");
+    println!("cargo:rerun-if-changed=./icicle");
+
+    let arch_type = env::var("ARCH_TYPE").unwrap_or(String::from("native"));
+    let stream_type = env::var("DEFAULT_STREAM").unwrap_or(String::from("legacy"));
+
+    let mut arch = String::from("-arch=");
+    arch.push_str(&arch_type);
+    let mut stream = String::from("-default-stream=");
+    stream.push_str(&stream_type);
+
+    let mut nvcc = cc::Build::new();
+
+    println!("Compiling icicle library using arch: {}", &arch);
+
+    if cfg!(feature = "g2") {
+        nvcc.define("G2_DEFINED", None);
+    }
+    nvcc.cuda(true);
+    nvcc.define("FEATURE_BLS12_377", None);
+    nvcc.debug(false);
+    nvcc.flag(&arch);
+    nvcc.flag(&stream);
+    nvcc.files([
+        "../icicle-cuda/curves/index.cu",
+    ]);
+    nvcc.compile("ingo_icicle"); //TODO: extension??
+}
--- a/bls12-377/src/basic_structs/field.rs
+++ b/bls12-377/src/basic_structs/field.rs
@@ -0,0 +1,4 @@
+pub trait Field<const NUM_LIMBS: usize> {
+    const MODOLUS: [u32;NUM_LIMBS];
+    const LIMBS: usize = NUM_LIMBS;
+}
--- a/bls12-377/src/basic_structs/mod.rs
+++ b/bls12-377/src/basic_structs/mod.rs
@@ -0,0 +1,3 @@
+pub mod field; 
+pub mod scalar; 
+pub mod point; 
--- a/bls12-377/src/basic_structs/point.rs
+++ b/bls12-377/src/basic_structs/point.rs
@@ -0,0 +1,106 @@
+use std::ffi::c_uint;
+
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger256, PrimeField};
+use std::mem::transmute;
+use ark_ff::Field;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+
+use rustacuda_core::DeviceCopy;
+use rustacuda_derive::DeviceCopy;
+
+use super::scalar::{get_fixed_limbs, self};
+
+
+#[derive(Debug, Clone, Copy, DeviceCopy)]
+#[repr(C)]
+pub struct PointT<BF: scalar::ScalarTrait> {
+    pub x: BF,
+    pub y: BF,
+    pub z: BF,
+}
+
+impl<BF: DeviceCopy + scalar::ScalarTrait> Default for PointT<BF> {
+    fn default() -> Self {
+        PointT::zero()
+    }
+}
+
+impl<BF: DeviceCopy + scalar::ScalarTrait> PointT<BF> {
+    pub fn zero() -> Self {
+        PointT {
+            x: BF::zero(),
+            y: BF::one(),
+            z: BF::zero(),
+        }
+    }
+
+    pub fn infinity() -> Self {
+        Self::zero()
+    }
+}
+
+#[derive(Debug, PartialEq, Clone, Copy, DeviceCopy)]
+#[repr(C)]
+pub struct PointAffineNoInfinityT<BF> {
+    pub x: BF,
+    pub y: BF,
+}
+
+impl<BF: scalar::ScalarTrait> Default for PointAffineNoInfinityT<BF> {
+    fn default() -> Self {
+        PointAffineNoInfinityT {
+            x: BF::zero(),
+            y: BF::zero(),
+        }
+    }
+}
+
+impl<BF: Copy + scalar::ScalarTrait> PointAffineNoInfinityT<BF> {
+    ///From u32 limbs x,y
+    pub fn from_limbs(x: &[u32], y: &[u32]) -> Self {
+        PointAffineNoInfinityT {
+            x: BF::from_limbs(x),
+            y: BF::from_limbs(y)
+        }
+    }
+
+    pub fn limbs(&self) -> Vec<u32> {
+        [self.x.limbs(), self.y.limbs()].concat()
+    }
+
+    pub fn to_projective(&self) -> PointT<BF> {
+        PointT {
+            x: self.x,
+            y: self.y,
+            z: BF::one(),
+        }
+    }
+}
+
+impl<BF: Copy + scalar::ScalarTrait> PointT<BF>  {
+    pub fn from_limbs(x: &[u32], y: &[u32], z: &[u32]) -> Self {
+        PointT {
+            x: BF::from_limbs(x),
+            y: BF::from_limbs(y),
+            z: BF::from_limbs(z)
+        }
+    }
+
+    pub fn from_xy_limbs(value: &[u32]) -> PointT<BF> {
+        let l = value.len();
+        assert_eq!(l, 3 * BF::base_limbs(), "length must be 3 * {}", BF::base_limbs());
+        PointT {
+            x: BF::from_limbs(value[..BF::base_limbs()].try_into().unwrap()),
+            y: BF::from_limbs(value[BF::base_limbs()..BF::base_limbs() * 2].try_into().unwrap()),
+            z: BF::from_limbs(value[BF::base_limbs() * 2..].try_into().unwrap())
+        }
+    }
+
+    pub fn to_xy_strip_z(&self) -> PointAffineNoInfinityT<BF> {
+        PointAffineNoInfinityT {
+            x: self.x,
+            y: self.y,
+        }
+    }
+}
--- a/bls12-377/src/basic_structs/scalar.rs
+++ b/bls12-377/src/basic_structs/scalar.rs
@@ -0,0 +1,102 @@
+use std::ffi::{c_int, c_uint};
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda_core::DeviceCopy;
+use rustacuda_derive::DeviceCopy;
+use std::mem::transmute;
+use rustacuda::prelude::*;
+use rustacuda_core::DevicePointer;
+use rustacuda::memory::{DeviceBox, CopyDestination};
+
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+
+use std::marker::PhantomData;
+use std::convert::TryInto;
+
+use super::field::{Field, self};
+
+pub fn get_fixed_limbs<const NUM_LIMBS: usize>(val: &[u32]) -> [u32; NUM_LIMBS] {
+    match val.len() {
+        n if n < NUM_LIMBS => {
+            let mut padded: [u32; NUM_LIMBS] = [0; NUM_LIMBS];
+            padded[..val.len()].copy_from_slice(&val);
+            padded
+        }
+        n if n == NUM_LIMBS => val.try_into().unwrap(),
+        _ => panic!("slice has too many elements"),
+    }
+}
+
+pub trait ScalarTrait{
+    fn base_limbs() -> usize;
+    fn zero() -> Self;
+    fn from_limbs(value: &[u32]) -> Self;
+    fn one() -> Self;
+    fn to_bytes_le(&self) -> Vec<u8>;
+    fn limbs(&self) -> &[u32];
+}
+
+#[derive(Debug, PartialEq, Clone, Copy)]
+#[repr(C)]
+pub struct ScalarT<M, const NUM_LIMBS: usize> {
+    pub(crate) phantom: PhantomData<M>,
+    pub(crate) value : [u32; NUM_LIMBS]
+}
+
+impl<M, const NUM_LIMBS: usize> ScalarTrait for ScalarT<M, NUM_LIMBS>
+where
+    M: Field<NUM_LIMBS>,
+{
+
+    fn base_limbs() -> usize {
+        return NUM_LIMBS; 
+    }
+
+    fn zero() -> Self {
+        ScalarT {
+            value: [0u32; NUM_LIMBS],
+            phantom: PhantomData,
+        }
+    }
+
+    fn from_limbs(value: &[u32]) -> Self {
+        Self {
+            value: get_fixed_limbs(value),
+            phantom: PhantomData,
+        }
+    }
+
+    fn one() -> Self {
+        let mut s = [0u32; NUM_LIMBS];
+        s[0] = 1;
+        ScalarT { value: s, phantom: PhantomData }
+    }
+
+    fn to_bytes_le(&self) -> Vec<u8> {
+        self.value
+            .iter()
+            .map(|s| s.to_le_bytes().to_vec())
+            .flatten()
+            .collect::<Vec<_>>()
+    }
+
+    fn limbs(&self) -> &[u32] {
+        &self.value
+    }
+}
+
+impl<M, const NUM_LIMBS: usize> ScalarT<M, NUM_LIMBS> where M: field::Field<NUM_LIMBS>{
+    pub fn from_limbs_le(value: &[u32]) -> ScalarT<M,NUM_LIMBS> {
+        Self::from_limbs(value)
+     }
+ 
+    pub fn from_limbs_be(value: &[u32]) -> ScalarT<M,NUM_LIMBS> {
+         let mut value = value.to_vec();
+         value.reverse();
+         Self::from_limbs_le(&value)
+     }
+ 
+     // Additional Functions
+     pub fn add(&self, other:ScalarT<M, NUM_LIMBS>) -> ScalarT<M,NUM_LIMBS>{  // overload + 
+         return ScalarT{value: [self.value[0] + other.value[0];NUM_LIMBS], phantom: PhantomData }; 
+     }
+}
--- a/bls12-377/src/curve_structs.rs
+++ b/bls12-377/src/curve_structs.rs
@@ -0,0 +1,62 @@
+use std::ffi::{c_int, c_uint};
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda_derive::DeviceCopy;
+use std::mem::transmute;
+use rustacuda::prelude::*;
+use rustacuda_core::DevicePointer;
+use rustacuda::memory::{DeviceBox, CopyDestination, DeviceCopy};
+
+use std::marker::PhantomData;
+use std::convert::TryInto;
+
+use crate::basic_structs::point::{PointT, PointAffineNoInfinityT};
+use crate::basic_structs::scalar::ScalarT;
+use crate::basic_structs::field::Field;
+
+
+#[derive(Debug, PartialEq, Clone, Copy,DeviceCopy)]
+#[repr(C)]
+pub struct ScalarField;
+impl Field<8> for ScalarField {
+    const MODOLUS: [u32; 8] = [0x0;8];
+}
+
+#[derive(Debug, PartialEq, Clone, Copy,DeviceCopy)]
+#[repr(C)]
+pub struct BaseField;
+impl Field<12> for BaseField {
+    const MODOLUS: [u32; 12] = [0x0;12];
+}
+
+
+pub type Scalar = ScalarT<ScalarField,8>;
+impl Default for Scalar {
+    fn default() -> Self {
+        Self{value: [0x0;ScalarField::LIMBS], phantom: PhantomData }
+    }
+}
+
+unsafe impl DeviceCopy for Scalar{}
+
+
+pub type Base = ScalarT<BaseField,12>;
+impl Default for Base {
+    fn default() -> Self {
+        Self{value: [0x0;BaseField::LIMBS], phantom: PhantomData }
+    }
+}
+
+unsafe impl DeviceCopy for Base{}
+
+pub type Point = PointT<Base>;
+pub type PointAffineNoInfinity = PointAffineNoInfinityT<Base>;
+
+extern "C" {
+    fn eq(point1: *const Point, point2: *const Point) -> c_uint;
+}
+
+impl PartialEq for Point {
+    fn eq(&self, other: &Self) -> bool {
+        unsafe { eq(self, other) != 0 }
+    }
+}
--- a/bls12-377/src/from_cuda.rs
+++ b/bls12-377/src/from_cuda.rs
@@ -0,0 +1,798 @@
+use std::ffi::{c_int, c_uint};
+use ark_std::UniformRand;
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda::CudaFlags;
+use rustacuda::memory::DeviceBox;
+use rustacuda::prelude::{DeviceBuffer, Device, ContextFlags, Context};
+use rustacuda_core::DevicePointer;
+use std::mem::transmute;
+use crate::basic_structs::scalar::ScalarTrait;
+use crate::curve_structs::*;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+use std::marker::PhantomData;
+use std::convert::TryInto;
+use ark_bls12_377::{Fq as Fq_BLS12_377, Fr as Fr_BLS12_377, G1Affine as G1Affine_BLS12_377, G1Projective as G1Projective_BLS12_377};
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger384, BigInteger256, PrimeField};
+use rustacuda::memory::{CopyDestination, DeviceCopy};
+
+extern "C" {
+    fn msm_cuda(
+        out: *mut Point,
+        points: *const PointAffineNoInfinity,
+        scalars: *const Scalar,
+        count: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn msm_batch_cuda(
+        out: *mut Point,
+        points: *const PointAffineNoInfinity,
+        scalars: *const Scalar,
+        batch_size: usize,
+        msm_size: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn commit_cuda(
+        d_out: DevicePointer<Point>,
+        d_scalars: DevicePointer<Scalar>,
+        d_points: DevicePointer<PointAffineNoInfinity>,
+        count: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn commit_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_scalars: DevicePointer<Scalar>,
+        d_points: DevicePointer<PointAffineNoInfinity>,
+        count: usize,
+        batch_size: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn build_domain_cuda(domain_size: usize, logn: usize, inverse: bool, device_id: usize) -> DevicePointer<Scalar>;
+
+    fn ntt_cuda(inout: *mut Scalar, n: usize, inverse: bool, device_id: usize) -> c_int;
+
+    fn ecntt_cuda(inout: *mut Point, n: usize, inverse: bool, device_id: usize) -> c_int;
+
+    fn ntt_batch_cuda(
+        inout: *mut Scalar,
+        arr_size: usize,
+        n: usize,
+        inverse: bool,
+    ) -> c_int;
+
+    fn ecntt_batch_cuda(inout: *mut Point, arr_size: usize, n: usize, inverse: bool) -> c_int;
+
+    fn interpolate_scalars_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_evaluations: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>, 
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_scalars_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_evaluations: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_points_cuda(
+        d_out: DevicePointer<Point>,
+        d_evaluations: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_points_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_evaluations: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_on_coset_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_on_coset_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_on_coset_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_on_coset_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_scalars_cuda(
+        d_arr: DevicePointer<Scalar>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_scalars_batch_cuda(
+        d_arr: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_points_cuda(
+        d_arr: DevicePointer<Point>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_points_batch_cuda(
+        d_arr: DevicePointer<Point>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn vec_mod_mult_point(
+        inout: *mut Point,
+        scalars: *const Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+
+    fn vec_mod_mult_scalar(
+        inout: *mut Scalar,
+        scalars: *const Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+
+    fn matrix_vec_mod_mult(
+        matrix_flattened: *const Scalar,
+        input: *const Scalar,
+        output: *mut Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+}
+
+pub fn msm(points: &[PointAffineNoInfinity], scalars: &[Scalar], device_id: usize) -> Point {
+    let count = points.len();
+    if count != scalars.len() {
+        todo!("variable length")
+    }
+
+    let mut ret = Point::zero();
+    unsafe {
+        msm_cuda(
+            &mut ret as *mut _ as *mut Point,
+            points as *const _ as *const PointAffineNoInfinity,
+            scalars as *const _ as *const Scalar,
+            scalars.len(),
+            device_id,
+        )
+    };
+
+    ret
+}
+
+pub fn msm_batch(
+    points: &[PointAffineNoInfinity],
+    scalars: &[Scalar],
+    batch_size: usize,
+    device_id: usize,
+) -> Vec<Point> {
+    let count = points.len();
+    if count != scalars.len() {
+        todo!("variable length")
+    }
+
+    let mut ret = vec![Point::zero(); batch_size];
+
+    unsafe {
+        msm_batch_cuda(
+            &mut ret[0] as *mut _ as *mut Point,
+            points as *const _ as *const PointAffineNoInfinity,
+            scalars as *const _ as *const Scalar,
+            batch_size,
+            count / batch_size,
+            device_id,
+        )
+    };
+
+    ret
+}
+
+pub fn commit(
+    points: &mut DeviceBuffer<PointAffineNoInfinity>,
+    scalars: &mut DeviceBuffer<Scalar>,
+) -> DeviceBox<Point> {
+    let mut res = DeviceBox::new(&Point::zero()).unwrap();
+    unsafe {
+        commit_cuda(
+            res.as_device_ptr(),
+            scalars.as_device_ptr(),
+            points.as_device_ptr(),
+            scalars.len(),
+            0,
+        );
+    }
+    return res;
+}
+
+pub fn commit_batch(
+    points: &mut DeviceBuffer<PointAffineNoInfinity>,
+    scalars: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(batch_size).unwrap() };
+    unsafe {
+        commit_batch_cuda(
+            res.as_device_ptr(),
+            scalars.as_device_ptr(),
+            points.as_device_ptr(),
+            scalars.len() / batch_size,
+            batch_size,
+            0,
+        );
+    }
+    return res;
+}
+
+/// Compute an in-place NTT on the input data.
+fn ntt_internal(values: &mut [Scalar], device_id: usize, inverse: bool) -> i32 {
+    let ret_code = unsafe {
+        ntt_cuda(
+            values as *mut _ as *mut Scalar,
+            values.len(),
+            inverse,
+            device_id,
+        )
+    };
+    ret_code
+}
+
+pub fn ntt(values: &mut [Scalar], device_id: usize) {
+    ntt_internal(values, device_id, false);
+}
+
+pub fn intt(values: &mut [Scalar], device_id: usize) {
+    ntt_internal(values, device_id, true);
+}
+
+/// Compute an in-place NTT on the input data.
+fn ntt_internal_batch(
+    values: &mut [Scalar],
+    device_id: usize,
+    batch_size: usize,
+    inverse: bool,
+) -> i32 {
+    unsafe {
+        ntt_batch_cuda(
+            values as *mut _ as *mut Scalar,
+            values.len(),
+            batch_size,
+            inverse,
+        )
+    }
+}
+
+pub fn ntt_batch(values: &mut [Scalar], batch_size: usize, device_id: usize) {
+    ntt_internal_batch(values, 0, batch_size, false);
+}
+
+pub fn intt_batch(values: &mut [Scalar], batch_size: usize, device_id: usize) {
+    ntt_internal_batch(values, 0, batch_size, true);
+}
+
+/// Compute an in-place ECNTT on the input data.
+fn ecntt_internal(values: &mut [Point], inverse: bool, device_id: usize) -> i32 {
+    unsafe {
+        ecntt_cuda(
+            values as *mut _ as *mut Point,
+            values.len(),
+            inverse,
+            device_id,
+        )
+    }
+}
+
+pub fn ecntt(values: &mut [Point], device_id: usize) {
+    ecntt_internal(values, false, device_id);
+}
+
+/// Compute an in-place iECNTT on the input data.
+pub fn iecntt(values: &mut [Point], device_id: usize) {
+    ecntt_internal(values, true, device_id);
+}
+
+/// Compute an in-place ECNTT on the input data.
+fn ecntt_internal_batch(
+    values: &mut [Point],
+    device_id: usize,
+    batch_size: usize,
+    inverse: bool,
+) -> i32 {
+    unsafe {
+        ecntt_batch_cuda(
+            values as *mut _ as *mut Point,
+            values.len(),
+            batch_size,
+            inverse,
+        )
+    }
+}
+
+pub fn ecntt_batch(values: &mut [Point], batch_size: usize, device_id: usize) {
+    ecntt_internal_batch(values, 0, batch_size, false);
+}
+
+/// Compute an in-place iECNTT on the input data.
+pub fn iecntt_batch(values: &mut [Point], batch_size: usize, device_id: usize) {
+    ecntt_internal_batch(values, 0, batch_size, true);
+}
+
+pub fn build_domain(domain_size: usize, logn: usize, inverse: bool) -> DeviceBuffer<Scalar> {
+    unsafe {
+        DeviceBuffer::from_raw_parts(build_domain_cuda(
+            domain_size,
+            logn,
+            inverse,
+            0
+        ), domain_size)
+    }
+}
+
+
+pub fn reverse_order_scalars(
+    d_scalars: &mut DeviceBuffer<Scalar>,
+) {
+    unsafe { reverse_order_scalars_cuda(
+        d_scalars.as_device_ptr(),
+        d_scalars.len(),
+        0
+    ); }
+}
+
+pub fn reverse_order_scalars_batch(
+    d_scalars: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) {
+    unsafe { reverse_order_scalars_batch_cuda(
+        d_scalars.as_device_ptr(),
+        d_scalars.len() / batch_size,
+        batch_size,
+        0
+    ); }
+}
+
+pub fn reverse_order_points(
+    d_points: &mut DeviceBuffer<Point>,
+) {
+    unsafe { reverse_order_points_cuda(
+        d_points.as_device_ptr(),
+        d_points.len(),
+        0
+    ); }
+}
+
+pub fn reverse_order_points_batch(
+    d_points: &mut DeviceBuffer<Point>,
+    batch_size: usize,
+) {
+    unsafe { reverse_order_points_batch_cuda(
+        d_points.as_device_ptr(),
+        d_points.len() / batch_size,
+        batch_size,
+        0
+    ); }
+}
+
+pub fn interpolate_scalars(
+    d_evaluations: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe { interpolate_scalars_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_scalars_batch(
+    d_evaluations: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe { interpolate_scalars_batch_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        batch_size,
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_points(
+    d_evaluations: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe { interpolate_points_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_points_batch(
+    d_evaluations: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe { interpolate_points_batch_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        batch_size,
+        0
+    ) };
+    return res;
+}
+
+pub fn evaluate_scalars(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_scalars_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_batch(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_scalars_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_points_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_batch(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_points_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_on_coset(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_scalars_on_coset_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_on_coset_batch(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_scalars_on_coset_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_on_coset(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_points_on_coset_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_on_coset_batch(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_points_on_coset_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn multp_vec(a: &mut [Point], b: &[Scalar], device_id: usize) {
+    assert_eq!(a.len(), b.len());
+    unsafe {
+        vec_mod_mult_point(
+            a as *mut _ as *mut Point,
+            b as *const _ as *const Scalar,
+            a.len(),
+            device_id,
+        );
+    }
+}
+
+pub fn mult_sc_vec(a: &mut [Scalar], b: &[Scalar], device_id: usize) {
+    assert_eq!(a.len(), b.len());
+    unsafe {
+        vec_mod_mult_scalar(
+            a as *mut _ as *mut Scalar,
+            b as *const _ as *const Scalar,
+            a.len(),
+            device_id,
+        );
+    }
+}
+
+// Multiply a matrix by a scalar:
+//  `a` - flattenned matrix;
+//  `b` - vector to multiply `a` by;
+pub fn mult_matrix_by_vec(a: &[Scalar], b: &[Scalar], device_id: usize) -> Vec<Scalar> {
+    let mut c = Vec::with_capacity(b.len());
+    for i in 0..b.len() {
+        c.push(Scalar::zero());
+    }
+    unsafe {
+        matrix_vec_mod_mult(
+            a as *const _ as *const Scalar,
+            b as *const _ as *const Scalar,
+            c.as_mut_slice() as *mut _ as *mut Scalar,
+            b.len(),
+            device_id,
+        );
+    }
+    c
+}
+
+pub fn clone_buffer<T: DeviceCopy>(buf: &mut DeviceBuffer<T>) -> DeviceBuffer<T> {
+    let mut buf_cpy = unsafe { DeviceBuffer::uninitialized(buf.len()).unwrap() };
+    unsafe { buf_cpy.copy_from(buf) };
+    return buf_cpy;
+}
+
+pub fn get_rng(seed: Option<u64>) -> Box<dyn RngCore> {
+    let rng: Box<dyn RngCore> = match seed {
+        Some(seed) => Box::new(StdRng::seed_from_u64(seed)),
+        None => Box::new(rand::thread_rng()),
+    };
+    rng
+}
+
+fn set_up_device() {
+    // Set up the context, load the module, and create a stream to run kernels in.
+    rustacuda::init(CudaFlags::empty()).unwrap();
+    let device = Device::get_device(0).unwrap();
+    let _ctx = Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device).unwrap();
+}
+
+pub fn generate_random_points(
+    count: usize,
+    mut rng: Box<dyn RngCore>,
+) -> Vec<PointAffineNoInfinity> {
+    (0..count)
+        .map(|_| Point::from_ark(G1Projective_BLS12_377::rand(&mut rng)).to_xy_strip_z())
+        .collect()
+}
+
+pub fn generate_random_points_proj(count: usize, mut rng: Box<dyn RngCore>) -> Vec<Point> {
+    (0..count)
+        .map(|_| Point::from_ark(G1Projective_BLS12_377::rand(&mut rng)))
+        .collect()
+}
+
+pub fn generate_random_scalars(count: usize, mut rng: Box<dyn RngCore>) -> Vec<Scalar> {
+    (0..count)
+        .map(|_| Scalar::from_ark(Fr_BLS12_377::rand(&mut rng).into_repr()))
+        .collect()
+}
+
+pub fn set_up_points(test_size: usize, log_domain_size: usize, inverse: bool) -> (Vec<Point>, DeviceBuffer<Point>, DeviceBuffer<Scalar>) {
+    set_up_device();
+
+    let d_domain = build_domain(1 << log_domain_size, log_domain_size, inverse);
+
+    let seed = Some(0); // fix the rng to get two equal scalar 
+    let vector = generate_random_points_proj(test_size, get_rng(seed));
+    let mut vector_mut = vector.clone();
+
+    let mut d_vector = DeviceBuffer::from_slice(&vector[..]).unwrap();
+    (vector_mut, d_vector, d_domain)
+}
+
+pub fn set_up_scalars(test_size: usize, log_domain_size: usize, inverse: bool) -> (Vec<Scalar>, DeviceBuffer<Scalar>, DeviceBuffer<Scalar>) {
+    set_up_device();
+
+    let d_domain = build_domain(1 << log_domain_size, log_domain_size, inverse);
+
+    let seed = Some(0); // fix the rng to get two equal scalars
+    let mut vector_mut = generate_random_scalars(test_size, get_rng(seed));
+
+    let mut d_vector = DeviceBuffer::from_slice(&vector_mut[..]).unwrap();
+    (vector_mut, d_vector, d_domain)
+}
+
--- a/bls12-377/src/lib.rs
+++ b/bls12-377/src/lib.rs
@@ -0,0 +1,4 @@
+pub mod test_bls12_377;
+pub mod basic_structs;
+pub mod from_cuda;
+pub mod curve_structs;
--- a/bls12-377/src/test_bls12_377.rs
+++ b/bls12-377/src/test_bls12_377.rs
@@ -0,0 +1,816 @@
+use std::ffi::{c_int, c_uint};
+use ark_std::UniformRand;
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda::CudaFlags;
+use rustacuda::memory::DeviceBox;
+use rustacuda::prelude::{DeviceBuffer, Device, ContextFlags, Context};
+use rustacuda_core::DevicePointer;
+use std::mem::transmute;
+pub use crate::basic_structs::scalar::ScalarTrait;
+pub use crate::curve_structs::*;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+use std::marker::PhantomData;
+use std::convert::TryInto;
+use ark_bls12_377::{Fq as Fq_BLS12_377, Fr as Fr_BLS12_377, G1Affine as G1Affine_BLS12_377, G1Projective as G1Projective_BLS12_377};
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger384, BigInteger256, PrimeField};
+use rustacuda::memory::{CopyDestination, DeviceCopy};
+
+
+impl Scalar {
+    pub fn to_biginteger254(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn to_ark(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn from_biginteger256(ark: BigInteger256) -> Self {
+        Self{ value: u64_vec_to_u32_vec(&ark.0).try_into().unwrap(), phantom : PhantomData}
+    }
+
+    pub fn to_biginteger256_transmute(&self) -> BigInteger256 {
+        unsafe { transmute(*self) }
+    }
+
+    pub fn from_biginteger_transmute(v: BigInteger256) -> Scalar {
+        Scalar{ value: unsafe{ transmute(v)}, phantom : PhantomData }
+    }
+
+    pub fn to_ark_transmute(&self) -> Fr_BLS12_377 {
+        unsafe { std::mem::transmute(*self) }
+    }
+
+    pub fn from_ark_transmute(v: &Fr_BLS12_377) -> Scalar {
+        unsafe { std::mem::transmute_copy(v) }
+    }
+
+    pub fn to_ark_mod_p(&self) -> Fr_BLS12_377 {
+        Fr_BLS12_377::new(BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap()))
+    }
+
+    pub fn to_ark_repr(&self) -> Fr_BLS12_377 {
+        Fr_BLS12_377::from_repr(BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())).unwrap()
+    }
+
+    pub fn from_ark(v: BigInteger256) -> Scalar {
+        Self { value : u64_vec_to_u32_vec(&v.0).try_into().unwrap(), phantom: PhantomData}
+    }
+
+}
+
+impl Base {
+    pub fn to_ark(&self) -> BigInteger384 {
+        BigInteger384::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn from_ark(ark: BigInteger384) -> Self {
+        Self::from_limbs(&u64_vec_to_u32_vec(&ark.0))
+    }
+}
+
+
+impl Point {
+    pub fn to_ark(&self) -> G1Projective_BLS12_377 {
+        self.to_ark_affine().into_projective()
+    }
+
+    pub fn to_ark_affine(&self) -> G1Affine_BLS12_377 {
+        //TODO: generic conversion
+        use ark_ff::Field;
+        use std::ops::Mul;
+        let proj_x_field = Fq_BLS12_377::from_le_bytes_mod_order(&self.x.to_bytes_le());
+        let proj_y_field = Fq_BLS12_377::from_le_bytes_mod_order(&self.y.to_bytes_le());
+        let proj_z_field = Fq_BLS12_377::from_le_bytes_mod_order(&self.z.to_bytes_le());
+        let inverse_z = proj_z_field.inverse().unwrap();
+        let aff_x = proj_x_field.mul(inverse_z);
+        let aff_y = proj_y_field.mul(inverse_z);
+        G1Affine_BLS12_377::new(aff_x, aff_y, false)
+    }
+
+    pub fn from_ark(ark: G1Projective_BLS12_377) -> Point {
+        use ark_ff::Field;
+        let z_inv = ark.z.inverse().unwrap();
+        let z_invsq = z_inv * z_inv;
+        let z_invq3 = z_invsq * z_inv;
+        Point {
+            x: Base::from_ark((ark.x * z_invsq).into_repr()),
+            y: Base::from_ark((ark.y * z_invq3).into_repr()),
+            z: Base::one(),
+        }
+    }
+}
+
+impl PointAffineNoInfinity {
+
+    pub fn to_ark(&self) -> G1Affine_BLS12_377 {
+        G1Affine_BLS12_377::new(Fq_BLS12_377::new(self.x.to_ark()), Fq_BLS12_377::new(self.y.to_ark()), false)
+    }
+
+    pub fn to_ark_repr(&self) -> G1Affine_BLS12_377 {
+        G1Affine_BLS12_377::new(
+            Fq_BLS12_377::from_repr(self.x.to_ark()).unwrap(),
+            Fq_BLS12_377::from_repr(self.y.to_ark()).unwrap(),
+            false,
+        )
+    }
+
+    pub fn from_ark(p: &G1Affine_BLS12_377) -> Self {
+        PointAffineNoInfinity {
+            x: Base::from_ark(p.x.into_repr()),
+            y: Base::from_ark(p.y.into_repr()),
+        }
+    }
+}
+
+impl Point {
+    pub fn to_affine(&self) -> PointAffineNoInfinity {
+        let ark_affine = self.to_ark_affine();
+        PointAffineNoInfinity {
+            x: Base::from_ark(ark_affine.x.into_repr()),
+            y: Base::from_ark(ark_affine.y.into_repr()),
+        }
+    }
+}
+
+
+#[cfg(test)]
+pub(crate) mod tests_bls12_377 {
+    use std::ops::Add;
+    use ark_bls12_377::{Fr, G1Affine, G1Projective};
+    use ark_ec::{msm::VariableBaseMSM, AffineCurve, ProjectiveCurve};
+    use ark_ff::{FftField, Field, Zero, PrimeField};
+    use ark_std::UniformRand;
+    use rustacuda::prelude::{DeviceBuffer, CopyDestination};
+    use crate::curve_structs::{Point, Scalar, Base};
+    use crate::basic_structs::scalar::ScalarTrait;
+    use crate::from_cuda::{generate_random_points, get_rng, generate_random_scalars, msm, msm_batch, set_up_scalars, commit, commit_batch, ntt, intt, generate_random_points_proj, ecntt, iecntt, ntt_batch, ecntt_batch, iecntt_batch, intt_batch, reverse_order_scalars_batch, interpolate_scalars_batch, set_up_points, reverse_order_points, interpolate_points, reverse_order_points_batch, interpolate_points_batch, evaluate_scalars, interpolate_scalars, reverse_order_scalars, evaluate_points, build_domain, evaluate_scalars_on_coset, evaluate_points_on_coset, mult_matrix_by_vec, mult_sc_vec, multp_vec,evaluate_scalars_batch, evaluate_points_batch, evaluate_scalars_on_coset_batch, evaluate_points_on_coset_batch};
+
+    fn random_points_ark_proj(nof_elements: usize) -> Vec<G1Projective> {
+        let mut rng = ark_std::rand::thread_rng();
+        let mut points_ga: Vec<G1Projective> = Vec::new();
+        for _ in 0..nof_elements {
+            let aff = G1Projective::rand(&mut rng);
+            points_ga.push(aff);
+        }
+        points_ga
+    }
+
+    fn ecntt_arc_naive(
+        points: &Vec<G1Projective>,
+        size: usize,
+        inverse: bool,
+    ) -> Vec<G1Projective> {
+        let mut result: Vec<G1Projective> = Vec::new();
+        for _ in 0..size {
+            result.push(G1Projective::zero());
+        }
+        let rou: Fr;
+        if !inverse {
+            rou = Fr::get_root_of_unity(size).unwrap();
+        } else {
+            rou = Fr::inverse(&Fr::get_root_of_unity(size).unwrap()).unwrap();
+        }
+        for k in 0..size {
+            for l in 0..size {
+                let pow: [u64; 1] = [(l * k).try_into().unwrap()];
+                let mul_rou = Fr::pow(&rou, &pow);
+                result[k] = result[k].add(points[l].into_affine().mul(mul_rou));
+            }
+        }
+        if inverse {
+            let size2 = size as u64;
+            for k in 0..size {
+                let multfactor = Fr::inverse(&Fr::from(size2)).unwrap();
+                result[k] = result[k].into_affine().mul(multfactor);
+            }
+        }
+        return result;
+    }
+
+    fn check_eq(points: &Vec<G1Projective>, points2: &Vec<G1Projective>) -> bool {
+        let mut eq = true;
+        for i in 0..points.len() {
+            if points2[i].ne(&points[i]) {
+                eq = false;
+                break;
+            }
+        }
+        return eq;
+    }
+
+    fn test_naive_ark_ecntt(size: usize) {
+        let points = random_points_ark_proj(size);
+        let result1: Vec<G1Projective> = ecntt_arc_naive(&points, size, false);
+        let result2: Vec<G1Projective> = ecntt_arc_naive(&result1, size, true);
+        assert!(!check_eq(&result2, &result1));
+        assert!(check_eq(&result2, &points));
+    }
+
+    #[test]
+    fn test_msm() {
+        let test_sizes = [6, 9];
+
+        for pow2 in test_sizes {
+            let count = 1 << pow2;
+            let seed = None; // set Some to provide seed
+            let points = generate_random_points(count, get_rng(seed));
+            let scalars = generate_random_scalars(count, get_rng(seed));
+
+            let msm_result = msm(&points, &scalars, 0);
+
+            let point_r_ark: Vec<_> = points.iter().map(|x| x.to_ark_repr()).collect();
+            let scalars_r_ark: Vec<_> = scalars.iter().map(|x| x.to_ark()).collect();
+
+            let msm_result_ark = VariableBaseMSM::multi_scalar_mul(&point_r_ark, &scalars_r_ark);
+
+            assert_eq!(msm_result.to_ark_affine(), msm_result_ark);
+            assert_eq!(msm_result.to_ark(), msm_result_ark);
+            assert_eq!(
+                msm_result.to_ark_affine(),
+                Point::from_ark(msm_result_ark).to_ark_affine()
+            );
+        }
+    }
+
+    #[test]
+    fn test_batch_msm() {
+        for batch_pow2 in [2, 4] {
+            for pow2 in [4, 6] {
+                let msm_size = 1 << pow2;
+                let batch_size = 1 << batch_pow2;
+                let seed = None; // set Some to provide seed
+                let points_batch = generate_random_points(msm_size * batch_size, get_rng(seed));
+                let scalars_batch = generate_random_scalars(msm_size * batch_size, get_rng(seed));
+
+                let point_r_ark: Vec<_> = points_batch.iter().map(|x| x.to_ark_repr()).collect();
+                let scalars_r_ark: Vec<_> = scalars_batch.iter().map(|x| x.to_ark()).collect();
+
+                let expected: Vec<_> = point_r_ark
+                    .chunks(msm_size)
+                    .zip(scalars_r_ark.chunks(msm_size))
+                    .map(|p| Point::from_ark(VariableBaseMSM::multi_scalar_mul(p.0, p.1)))
+                    .collect();
+
+                let result = msm_batch(&points_batch, &scalars_batch, batch_size, 0);
+
+                assert_eq!(result, expected);
+            }
+        }
+    }
+
+    #[test]
+    fn test_commit() {
+        let test_size = 1 << 8;
+        let seed = Some(0);
+        let (mut scalars, mut d_scalars, _) = set_up_scalars(test_size, 0, false);
+        let mut points = generate_random_points(test_size, get_rng(seed));
+        let mut d_points = DeviceBuffer::from_slice(&points[..]).unwrap();
+
+        let msm_result = msm(&points, &scalars, 0);
+        let mut d_commit_result = commit(&mut d_points, &mut d_scalars);
+        let mut h_commit_result = Point::zero();
+        d_commit_result.copy_to(&mut h_commit_result).unwrap();
+
+        assert_eq!(msm_result, h_commit_result);
+        assert_ne!(msm_result, Point::zero());
+        assert_ne!(h_commit_result, Point::zero());
+    }
+
+    #[test]
+    fn test_batch_commit() {
+        let batch_size = 4;
+        let test_size = 1 << 12;
+        let seed = Some(0);
+        let (scalars, mut d_scalars, _) = set_up_scalars(test_size * batch_size, 0, false);
+        let points = generate_random_points(test_size * batch_size, get_rng(seed));
+        let mut d_points = DeviceBuffer::from_slice(&points[..]).unwrap();
+
+        let msm_result = msm_batch(&points, &scalars, batch_size, 0);
+        let mut d_commit_result = commit_batch(&mut d_points, &mut d_scalars, batch_size);
+        let mut h_commit_result: Vec<Point> = (0..batch_size).map(|_| Point::zero()).collect();
+        d_commit_result.copy_to(&mut h_commit_result[..]).unwrap();
+
+        assert_eq!(msm_result, h_commit_result);
+        for h in h_commit_result {
+            assert_ne!(h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_ntt() {
+        //NTT
+        let seed = None; //some value to fix the rng
+        let test_size = 1 << 3;
+
+        let scalars = generate_random_scalars(test_size, get_rng(seed));
+
+        let mut ntt_result = scalars.clone();
+        ntt(&mut ntt_result, 0);
+
+        assert_ne!(ntt_result, scalars);
+
+        let mut intt_result = ntt_result.clone();
+
+        intt(&mut intt_result, 0);
+
+        assert_eq!(intt_result, scalars);
+
+        //ECNTT
+        let points_proj = generate_random_points_proj(test_size, get_rng(seed));
+
+        test_naive_ark_ecntt(test_size);
+
+        assert!(points_proj[0].to_ark().into_affine().is_on_curve());
+
+        //naive ark
+        let points_proj_ark = points_proj
+            .iter()
+            .map(|p| p.to_ark())
+            .collect::<Vec<G1Projective>>();
+
+        let ecntt_result_naive = ecntt_arc_naive(&points_proj_ark, points_proj_ark.len(), false);
+
+        let iecntt_result_naive = ecntt_arc_naive(&ecntt_result_naive, points_proj_ark.len(), true);
+
+        assert_eq!(points_proj_ark, iecntt_result_naive);
+
+        //ingo gpu
+        let mut ecntt_result = points_proj.to_vec();
+        ecntt(&mut ecntt_result, 0);
+
+        assert_ne!(ecntt_result, points_proj);
+
+        let mut iecntt_result = ecntt_result.clone();
+        iecntt(&mut iecntt_result, 0);
+
+        assert_eq!(
+            iecntt_result_naive,
+            points_proj
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>()
+        );
+        assert_eq!(
+            iecntt_result
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>(),
+            points_proj
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>()
+        );
+    }
+
+    #[test]
+    fn test_ntt_batch() {
+        //NTT
+        let seed = None; //some value to fix the rng
+        let test_size = 1 << 5;
+        let batches = 4;
+
+        let scalars_batch: Vec<Scalar> =
+            generate_random_scalars(test_size * batches, get_rng(seed));
+
+        let mut scalar_vec_of_vec: Vec<Vec<Scalar>> = Vec::new();
+
+        for i in 0..batches {
+            scalar_vec_of_vec.push(scalars_batch[i * test_size..(i + 1) * test_size].to_vec());
+        }
+
+        let mut ntt_result = scalars_batch.clone();
+
+        // do batch ntt
+        ntt_batch(&mut ntt_result, test_size, 0);
+
+        let mut ntt_result_vec_of_vec = Vec::new();
+
+        // do ntt for every chunk
+        for i in 0..batches {
+            ntt_result_vec_of_vec.push(scalar_vec_of_vec[i].clone());
+            ntt(&mut ntt_result_vec_of_vec[i], 0);
+        }
+
+        // check that the ntt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                ntt_result_vec_of_vec[i],
+                ntt_result[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        // check that ntt output is different from input
+        assert_ne!(ntt_result, scalars_batch);
+
+        let mut intt_result = ntt_result.clone();
+
+        // do batch intt
+        intt_batch(&mut intt_result, test_size, 0);
+
+        let mut intt_result_vec_of_vec = Vec::new();
+
+        // do intt for every chunk
+        for i in 0..batches {
+            intt_result_vec_of_vec.push(ntt_result_vec_of_vec[i].clone());
+            intt(&mut intt_result_vec_of_vec[i], 0);
+        }
+
+        // check that the intt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                intt_result_vec_of_vec[i],
+                intt_result[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_eq!(intt_result, scalars_batch);
+
+        // //ECNTT
+        let points_proj = generate_random_points_proj(test_size * batches, get_rng(seed));
+
+        let mut points_vec_of_vec: Vec<Vec<Point>> = Vec::new();
+
+        for i in 0..batches {
+            points_vec_of_vec.push(points_proj[i * test_size..(i + 1) * test_size].to_vec());
+        }
+
+        let mut ntt_result_points = points_proj.clone();
+
+        // do batch ecintt
+        ecntt_batch(&mut ntt_result_points, test_size, 0);
+
+        let mut ntt_result_points_vec_of_vec = Vec::new();
+
+        for i in 0..batches {
+            ntt_result_points_vec_of_vec.push(points_vec_of_vec[i].clone());
+            ecntt(&mut ntt_result_points_vec_of_vec[i], 0);
+        }
+
+        for i in 0..batches {
+            assert_eq!(
+                ntt_result_points_vec_of_vec[i],
+                ntt_result_points[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_ne!(ntt_result_points, points_proj);
+
+        let mut intt_result_points = ntt_result_points.clone();
+
+        // do batch ecintt
+        iecntt_batch(&mut intt_result_points, test_size, 0);
+
+        let mut intt_result_points_vec_of_vec = Vec::new();
+
+        // do ecintt for every chunk
+        for i in 0..batches {
+            intt_result_points_vec_of_vec.push(ntt_result_points_vec_of_vec[i].clone());
+            iecntt(&mut intt_result_points_vec_of_vec[i], 0);
+        }
+
+        // check that the ecintt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                intt_result_points_vec_of_vec[i],
+                intt_result_points[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_eq!(intt_result_points, points_proj);
+    }
+
+    #[test]
+    fn test_scalar_interpolation() {
+        let log_test_size = 7;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_scalars(test_size, log_test_size, true);
+
+        reverse_order_scalars(&mut d_evals);
+        let mut d_coeffs = interpolate_scalars(&mut d_evals, &mut d_domain);
+        intt(&mut evals_mut, 0);
+        let mut h_coeffs: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+
+        assert_eq!(h_coeffs, evals_mut);
+    }
+
+    #[test]
+    fn test_scalar_batch_interpolation() {
+        let batch_size = 4;
+        let log_test_size = 10;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_scalars(test_size * batch_size, log_test_size, true);
+
+        reverse_order_scalars_batch(&mut d_evals, batch_size);
+        let mut d_coeffs = interpolate_scalars_batch(&mut d_evals, &mut d_domain, batch_size);
+        intt_batch(&mut evals_mut, test_size, 0);
+        let mut h_coeffs: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+
+        assert_eq!(h_coeffs, evals_mut);
+    }
+
+    #[test]
+    fn test_point_interpolation() {
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_points(test_size, log_test_size, true);
+
+        reverse_order_points(&mut d_evals);
+        let mut d_coeffs = interpolate_points(&mut d_evals, &mut d_domain);
+        iecntt(&mut evals_mut[..], 0);
+        let mut h_coeffs: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+        
+        assert_eq!(h_coeffs, *evals_mut);
+        for h in h_coeffs.iter() {
+            assert_ne!(*h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_interpolation() {
+        let batch_size = 4;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_points(test_size * batch_size, log_test_size, true);
+
+        reverse_order_points_batch(&mut d_evals, batch_size);
+        let mut d_coeffs = interpolate_points_batch(&mut d_evals, &mut d_domain, batch_size);
+        iecntt_batch(&mut evals_mut[..], test_size, 0);
+        let mut h_coeffs: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+        
+        assert_eq!(h_coeffs, *evals_mut);
+        for h in h_coeffs.iter() {
+            assert_ne!(*h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_scalar_evaluation() {
+        let log_test_domain_size = 8;
+        let coeff_size = 1 << 6;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut d_coeffs_domain = interpolate_scalars(&mut d_evals, &mut d_domain_inv);
+        let mut h_coeffs_domain: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        assert_eq!(h_coeffs, h_coeffs_domain[..coeff_size]);
+        for i in coeff_size.. (1 << log_test_domain_size) {
+            assert_eq!(Scalar::zero(), h_coeffs_domain[i]);
+        }
+    }
+
+    #[test]
+    fn test_scalar_batch_evaluation() {
+        let batch_size = 6;
+        let log_test_domain_size = 8;
+        let domain_size = 1 << log_test_domain_size;
+        let coeff_size = 1 << 6;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size * batch_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_scalars_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut d_coeffs_domain = interpolate_scalars_batch(&mut d_evals, &mut d_domain_inv, batch_size);
+        let mut h_coeffs_domain: Vec<Scalar> = (0..domain_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        for j in 0..batch_size {
+            assert_eq!(h_coeffs[j * coeff_size..(j + 1) * coeff_size], h_coeffs_domain[j * domain_size..j * domain_size + coeff_size]);
+            for i in coeff_size..domain_size {
+                assert_eq!(Scalar::zero(), h_coeffs_domain[j * domain_size + i]);
+            }
+        }
+    }
+
+    #[test]
+    fn test_point_evaluation() {
+        let log_test_domain_size = 7;
+        let coeff_size = 1 << 7;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_points(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_points(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_points(&mut d_coeffs, &mut d_domain);
+        let mut d_coeffs_domain = interpolate_points(&mut d_evals, &mut d_domain_inv);
+        let mut h_coeffs_domain: Vec<Point> = (0..1 << log_test_domain_size).map(|_| Point::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        assert_eq!(h_coeffs[..], h_coeffs_domain[..coeff_size]);
+        for i in coeff_size..(1 << log_test_domain_size) {
+            assert_eq!(Point::zero(), h_coeffs_domain[i]);
+        }
+        for i in 0..coeff_size {
+            assert_ne!(h_coeffs_domain[i], Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_evaluation() {
+        let batch_size = 4;
+        let log_test_domain_size = 6;
+        let domain_size = 1 << log_test_domain_size;
+        let coeff_size = 1 << 5;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_points(coeff_size * batch_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_points(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_points_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut d_coeffs_domain = interpolate_points_batch(&mut d_evals, &mut d_domain_inv, batch_size);
+        let mut h_coeffs_domain: Vec<Point> = (0..domain_size * batch_size).map(|_| Point::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        for j in 0..batch_size {
+            assert_eq!(h_coeffs[j * coeff_size..(j + 1) * coeff_size], h_coeffs_domain[j * domain_size..(j * domain_size + coeff_size)]);
+            for i in coeff_size..domain_size {
+                assert_eq!(Point::zero(), h_coeffs_domain[j * domain_size + i]);
+            }
+            for i in j * domain_size..(j * domain_size + coeff_size) {
+                assert_ne!(h_coeffs_domain[i], Point::zero());
+            }
+        }
+    }
+
+    #[test]
+    fn test_scalar_evaluation_on_trivial_coset() {
+        // checks that the evaluations on the subgroup is the same as on the coset generated by 1
+        let log_test_domain_size = 8;
+        let coeff_size = 1 << 6;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(coeff_size, log_test_domain_size, true);
+        let mut d_trivial_coset_powers = build_domain(1 << log_test_domain_size, 0, false);
+
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut h_coeffs: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_coeffs[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset(&mut d_coeffs, &mut d_domain, &mut d_trivial_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_coeffs, h_evals_coset);
+    }
+
+    #[test]
+    fn test_scalar_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let log_test_size = 8;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(test_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_scalars(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_scalars(&mut d_coeffs, &mut d_large_domain);
+        let mut h_evals_large: Vec<Scalar> = (0..2 * test_size).map(|_| Scalar::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut h_evals: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset(&mut d_coeffs, &mut d_domain, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_evals[..], h_evals_large[..test_size]);
+        assert_eq!(h_evals_coset[..], h_evals_large[test_size..2 * test_size]);
+    }
+
+    #[test]
+    fn test_scalar_batch_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let batch_size = 4;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(test_size * batch_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_scalars(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_scalars_batch(&mut d_coeffs, &mut d_large_domain, batch_size);
+        let mut h_evals_large: Vec<Scalar> = (0..2 * test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_scalars_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut h_evals: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset_batch(&mut d_coeffs, &mut d_domain, batch_size, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        for i in 0..batch_size {
+            assert_eq!(h_evals_large[2 * i * test_size..(2 * i + 1) * test_size], h_evals[i * test_size..(i + 1) * test_size]);
+            assert_eq!(h_evals_large[(2 * i + 1) * test_size..(2 * i + 2) * test_size], h_evals_coset[i * test_size..(i + 1) * test_size]);
+        }
+    }
+
+    #[test]
+    fn test_point_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let log_test_size = 8;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_points(test_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_points(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_points(&mut d_coeffs, &mut d_large_domain);
+        let mut h_evals_large: Vec<Point> = (0..2 * test_size).map(|_| Point::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_points(&mut d_coeffs, &mut d_domain);
+        let mut h_evals: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_points_on_coset(&mut d_coeffs, &mut d_domain, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_evals[..], h_evals_large[..test_size]);
+        assert_eq!(h_evals_coset[..], h_evals_large[test_size..2 * test_size]);
+        for i in 0..test_size {
+            assert_ne!(h_evals[i], Point::zero());
+            assert_ne!(h_evals_coset[i], Point::zero());
+            assert_ne!(h_evals_large[2 * i], Point::zero());
+            assert_ne!(h_evals_large[2 * i + 1], Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let batch_size = 2;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_points(test_size * batch_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_points(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_points_batch(&mut d_coeffs, &mut d_large_domain, batch_size);
+        let mut h_evals_large: Vec<Point> = (0..2 * test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_points_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut h_evals: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_points_on_coset_batch(&mut d_coeffs, &mut d_domain, batch_size, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        for i in 0..batch_size {
+            assert_eq!(h_evals_large[2 * i * test_size..(2 * i + 1) * test_size], h_evals[i * test_size..(i + 1) * test_size]);
+            assert_eq!(h_evals_large[(2 * i + 1) * test_size..(2 * i + 2) * test_size], h_evals_coset[i * test_size..(i + 1) * test_size]);
+        }
+        for i in 0..test_size * batch_size {
+            assert_ne!(h_evals[i], Point::zero());
+            assert_ne!(h_evals_coset[i], Point::zero());
+            assert_ne!(h_evals_large[2 * i], Point::zero());
+            assert_ne!(h_evals_large[2 * i + 1], Point::zero());
+        }
+    }
+
+    // testing matrix multiplication by comparing the result of FFT with the naive multiplication by the DFT matrix
+    #[test]
+    fn test_matrix_multiplication() {
+        let seed = None; // some value to fix the rng
+        let test_size = 1 << 5;
+        let rou = Fr::get_root_of_unity(test_size).unwrap();
+        let matrix_flattened: Vec<Scalar> = (0..test_size).map(
+            |row_num| { (0..test_size).map( 
+                |col_num| {
+                    let pow: [u64; 1] = [(row_num * col_num).try_into().unwrap()];
+                    Scalar::from_ark(Fr::pow(&rou, &pow).into_repr())
+                }).collect::<Vec<Scalar>>()
+            }).flatten().collect::<Vec<_>>();
+        let vector: Vec<Scalar> = generate_random_scalars(test_size, get_rng(seed));
+
+        let result = mult_matrix_by_vec(&matrix_flattened, &vector, 0);
+        let mut ntt_result = vector.clone();
+        ntt(&mut ntt_result, 0);
+        
+        // we don't use the same roots of unity as arkworks, so the results are permutations
+        // of one another and the only guaranteed fixed scalars are the following ones:
+        assert_eq!(result[0], ntt_result[0]);
+        assert_eq!(result[test_size >> 1], ntt_result[test_size >> 1]);
+    }
+
+    #[test]
+    #[allow(non_snake_case)]
+    fn test_vec_scalar_mul() {
+        let mut intoo = [Scalar::one(), Scalar::one(), Scalar::zero()];
+        let expected = [Scalar::one(), Scalar::zero(), Scalar::zero()];
+        mult_sc_vec(&mut intoo, &expected, 0);
+        assert_eq!(intoo, expected);
+    }
+
+    #[test]
+    #[allow(non_snake_case)]
+    fn test_vec_point_mul() {
+        let dummy_one = Point {
+            x: Base::one(),
+            y: Base::one(),
+            z: Base::one(),
+        };
+
+        let mut inout = [dummy_one, dummy_one, Point::zero()];
+        let scalars = [Scalar::one(), Scalar::zero(), Scalar::zero()];
+        let expected = [dummy_one, Point::zero(), Point::zero()];
+        multp_vec(&mut inout, &scalars, 0);
+        assert_eq!(inout, expected);
+    }
+}
--- a/bls12-381/Cargo.toml
+++ b/bls12-381/Cargo.toml
@@ -0,0 +1,34 @@
+[package]
+name = "bls12-381"
+version = "0.1.0"
+edition = "2021"
+authors = [ "Ingonyama" ]
+
+[dependencies]
+icicle-core = { path = "../icicle-core" }
+
+hex = "*"
+ark-std = "0.3.0"
+ark-ff = "0.3.0"
+ark-poly = "0.3.0"
+ark-ec = { version = "0.3.0", features = [ "parallel" ] }
+ark-bls12-381 = "0.3.0"
+
+serde = { version = "1.0", features = ["derive"] }
+serde_derive = "1.0"
+serde_cbor = "0.11.2"
+
+rustacuda = "0.1"
+rustacuda_core = "0.1"
+rustacuda_derive = "0.1"
+
+rand = "*" #TODO: move rand and ark dependencies to dev once random scalar/point generation is done "natively"
+
+[build-dependencies]
+cc = { version = "1.0", features = ["parallel"] }
+
+[dev-dependencies]
+"criterion" = "0.4.0"
+
+[features]
+g2 = []
--- a/bls12-381/build.rs
+++ b/bls12-381/build.rs
@@ -0,0 +1,36 @@
+use std::env;
+
+fn main() {
+    //TODO: check cargo features selected
+    //TODO: can conflict/duplicate with make ?
+
+    println!("cargo:rerun-if-env-changed=CXXFLAGS");
+    println!("cargo:rerun-if-changed=./icicle");
+
+    let arch_type = env::var("ARCH_TYPE").unwrap_or(String::from("native"));
+    let stream_type = env::var("DEFAULT_STREAM").unwrap_or(String::from("legacy"));
+
+    let mut arch = String::from("-arch=");
+    arch.push_str(&arch_type);
+    let mut stream = String::from("-default-stream=");
+    stream.push_str(&stream_type);
+
+    let mut nvcc = cc::Build::new();
+
+    println!("Compiling icicle library using arch: {}", &arch);
+
+    if cfg!(feature = "g2") {
+        nvcc.define("G2_DEFINED", None);
+    }
+    nvcc.cuda(true);
+    nvcc.define("FEATURE_BLS12_381", None);
+    nvcc.debug(false);
+    nvcc.flag(&arch);
+    nvcc.flag(&stream);
+    nvcc.shared_flag(false);
+    // nvcc.static_flag(true);
+    nvcc.files([
+        "../icicle-cuda/curves/index.cu",
+    ]);
+    nvcc.compile("ingo_icicle"); //TODO: extension??
+}
--- a/bls12-381/src/basic_structs/field.rs
+++ b/bls12-381/src/basic_structs/field.rs
@@ -0,0 +1,4 @@
+pub trait Field<const NUM_LIMBS: usize> {
+    const MODOLUS: [u32;NUM_LIMBS];
+    const LIMBS: usize = NUM_LIMBS;
+}
--- a/bls12-381/src/basic_structs/mod.rs
+++ b/bls12-381/src/basic_structs/mod.rs
@@ -0,0 +1,3 @@
+pub mod field; 
+pub mod scalar; 
+pub mod point; 
--- a/bls12-381/src/basic_structs/point.rs
+++ b/bls12-381/src/basic_structs/point.rs
@@ -0,0 +1,106 @@
+use std::ffi::c_uint;
+
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger256, PrimeField};
+use std::mem::transmute;
+use ark_ff::Field;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+
+use rustacuda_core::DeviceCopy;
+use rustacuda_derive::DeviceCopy;
+
+use super::scalar::{get_fixed_limbs, self};
+
+
+#[derive(Debug, Clone, Copy, DeviceCopy)]
+#[repr(C)]
+pub struct PointT<BF: scalar::ScalarTrait> {
+    pub x: BF,
+    pub y: BF,
+    pub z: BF,
+}
+
+impl<BF: DeviceCopy + scalar::ScalarTrait> Default for PointT<BF> {
+    fn default() -> Self {
+        PointT::zero()
+    }
+}
+
+impl<BF: DeviceCopy + scalar::ScalarTrait> PointT<BF> {
+    pub fn zero() -> Self {
+        PointT {
+            x: BF::zero(),
+            y: BF::one(),
+            z: BF::zero(),
+        }
+    }
+
+    pub fn infinity() -> Self {
+        Self::zero()
+    }
+}
+
+#[derive(Debug, PartialEq, Clone, Copy, DeviceCopy)]
+#[repr(C)]
+pub struct PointAffineNoInfinityT<BF> {
+    pub x: BF,
+    pub y: BF,
+}
+
+impl<BF: scalar::ScalarTrait> Default for PointAffineNoInfinityT<BF> {
+    fn default() -> Self {
+        PointAffineNoInfinityT {
+            x: BF::zero(),
+            y: BF::zero(),
+        }
+    }
+}
+
+impl<BF: Copy + scalar::ScalarTrait> PointAffineNoInfinityT<BF> {
+    ///From u32 limbs x,y
+    pub fn from_limbs(x: &[u32], y: &[u32]) -> Self {
+        PointAffineNoInfinityT {
+            x: BF::from_limbs(x),
+            y: BF::from_limbs(y)
+        }
+    }
+
+    pub fn limbs(&self) -> Vec<u32> {
+        [self.x.limbs(), self.y.limbs()].concat()
+    }
+
+    pub fn to_projective(&self) -> PointT<BF> {
+        PointT {
+            x: self.x,
+            y: self.y,
+            z: BF::one(),
+        }
+    }
+}
+
+impl<BF: Copy + scalar::ScalarTrait> PointT<BF>  {
+    pub fn from_limbs(x: &[u32], y: &[u32], z: &[u32]) -> Self {
+        PointT {
+            x: BF::from_limbs(x),
+            y: BF::from_limbs(y),
+            z: BF::from_limbs(z)
+        }
+    }
+
+    pub fn from_xy_limbs(value: &[u32]) -> PointT<BF> {
+        let l = value.len();
+        assert_eq!(l, 3 * BF::base_limbs(), "length must be 3 * {}", BF::base_limbs());
+        PointT {
+            x: BF::from_limbs(value[..BF::base_limbs()].try_into().unwrap()),
+            y: BF::from_limbs(value[BF::base_limbs()..BF::base_limbs() * 2].try_into().unwrap()),
+            z: BF::from_limbs(value[BF::base_limbs() * 2..].try_into().unwrap())
+        }
+    }
+
+    pub fn to_xy_strip_z(&self) -> PointAffineNoInfinityT<BF> {
+        PointAffineNoInfinityT {
+            x: self.x,
+            y: self.y,
+        }
+    }
+}
--- a/bls12-381/src/basic_structs/scalar.rs
+++ b/bls12-381/src/basic_structs/scalar.rs
@@ -0,0 +1,102 @@
+use std::ffi::{c_int, c_uint};
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda_core::DeviceCopy;
+use rustacuda_derive::DeviceCopy;
+use std::mem::transmute;
+use rustacuda::prelude::*;
+use rustacuda_core::DevicePointer;
+use rustacuda::memory::{DeviceBox, CopyDestination};
+
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+
+use std::marker::PhantomData;
+use std::convert::TryInto;
+
+use super::field::{Field, self};
+
+pub fn get_fixed_limbs<const NUM_LIMBS: usize>(val: &[u32]) -> [u32; NUM_LIMBS] {
+    match val.len() {
+        n if n < NUM_LIMBS => {
+            let mut padded: [u32; NUM_LIMBS] = [0; NUM_LIMBS];
+            padded[..val.len()].copy_from_slice(&val);
+            padded
+        }
+        n if n == NUM_LIMBS => val.try_into().unwrap(),
+        _ => panic!("slice has too many elements"),
+    }
+}
+
+pub trait ScalarTrait{
+    fn base_limbs() -> usize;
+    fn zero() -> Self;
+    fn from_limbs(value: &[u32]) -> Self;
+    fn one() -> Self;
+    fn to_bytes_le(&self) -> Vec<u8>;
+    fn limbs(&self) -> &[u32];
+}
+
+#[derive(Debug, PartialEq, Clone, Copy)]
+#[repr(C)]
+pub struct ScalarT<M, const NUM_LIMBS: usize> {
+    pub(crate) phantom: PhantomData<M>,
+    pub(crate) value : [u32; NUM_LIMBS]
+}
+
+impl<M, const NUM_LIMBS: usize> ScalarTrait for ScalarT<M, NUM_LIMBS>
+where
+    M: Field<NUM_LIMBS>,
+{
+
+    fn base_limbs() -> usize {
+        return NUM_LIMBS; 
+    }
+
+    fn zero() -> Self {
+        ScalarT {
+            value: [0u32; NUM_LIMBS],
+            phantom: PhantomData,
+        }
+    }
+
+    fn from_limbs(value: &[u32]) -> Self {
+        Self {
+            value: get_fixed_limbs(value),
+            phantom: PhantomData,
+        }
+    }
+
+    fn one() -> Self {
+        let mut s = [0u32; NUM_LIMBS];
+        s[0] = 1;
+        ScalarT { value: s, phantom: PhantomData }
+    }
+
+    fn to_bytes_le(&self) -> Vec<u8> {
+        self.value
+            .iter()
+            .map(|s| s.to_le_bytes().to_vec())
+            .flatten()
+            .collect::<Vec<_>>()
+    }
+
+    fn limbs(&self) -> &[u32] {
+        &self.value
+    }
+}
+
+impl<M, const NUM_LIMBS: usize> ScalarT<M, NUM_LIMBS> where M: field::Field<NUM_LIMBS>{
+    pub fn from_limbs_le(value: &[u32]) -> ScalarT<M,NUM_LIMBS> {
+        Self::from_limbs(value)
+     }
+ 
+    pub fn from_limbs_be(value: &[u32]) -> ScalarT<M,NUM_LIMBS> {
+         let mut value = value.to_vec();
+         value.reverse();
+         Self::from_limbs_le(&value)
+     }
+ 
+     // Additional Functions
+     pub fn add(&self, other:ScalarT<M, NUM_LIMBS>) -> ScalarT<M,NUM_LIMBS>{  // overload + 
+         return ScalarT{value: [self.value[0] + other.value[0];NUM_LIMBS], phantom: PhantomData }; 
+     }
+}
--- a/bls12-381/src/curve_structs.rs
+++ b/bls12-381/src/curve_structs.rs
@@ -0,0 +1,62 @@
+use std::ffi::{c_int, c_uint};
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda_derive::DeviceCopy;
+use std::mem::transmute;
+use rustacuda::prelude::*;
+use rustacuda_core::DevicePointer;
+use rustacuda::memory::{DeviceBox, CopyDestination, DeviceCopy};
+
+use std::marker::PhantomData;
+use std::convert::TryInto;
+
+use crate::basic_structs::point::{PointT, PointAffineNoInfinityT};
+use crate::basic_structs::scalar::ScalarT;
+use crate::basic_structs::field::Field;
+
+
+#[derive(Debug, PartialEq, Clone, Copy,DeviceCopy)]
+#[repr(C)]
+pub struct ScalarField;
+impl Field<8> for ScalarField {
+    const MODOLUS: [u32; 8] = [0x0;8];
+}
+
+#[derive(Debug, PartialEq, Clone, Copy,DeviceCopy)]
+#[repr(C)]
+pub struct BaseField;
+impl Field<12> for BaseField {
+    const MODOLUS: [u32; 12] = [0x0;12];
+}
+
+
+pub type Scalar = ScalarT<ScalarField,8>;
+impl Default for Scalar {
+    fn default() -> Self {
+        Self{value: [0x0;ScalarField::LIMBS], phantom: PhantomData }
+    }
+}
+
+unsafe impl DeviceCopy for Scalar{}
+
+
+pub type Base = ScalarT<BaseField,12>;
+impl Default for Base {
+    fn default() -> Self {
+        Self{value: [0x0;BaseField::LIMBS], phantom: PhantomData }
+    }
+}
+
+unsafe impl DeviceCopy for Base{}
+
+pub type Point = PointT<Base>;
+pub type PointAffineNoInfinity = PointAffineNoInfinityT<Base>;
+
+extern "C" {
+    fn eq(point1: *const Point, point2: *const Point) -> c_uint;
+}
+
+impl PartialEq for Point {
+    fn eq(&self, other: &Self) -> bool {
+        unsafe { eq(self, other) != 0 }
+    }
+}
--- a/bls12-381/src/from_cuda.rs
+++ b/bls12-381/src/from_cuda.rs
@@ -0,0 +1,798 @@
+use std::ffi::{c_int, c_uint};
+use ark_std::UniformRand;
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda::CudaFlags;
+use rustacuda::memory::DeviceBox;
+use rustacuda::prelude::{DeviceBuffer, Device, ContextFlags, Context};
+use rustacuda_core::DevicePointer;
+use std::mem::transmute;
+use crate::basic_structs::scalar::ScalarTrait;
+use crate::curve_structs::*;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+use std::marker::PhantomData;
+use std::convert::TryInto;
+use ark_bls12_381::{Fq as Fq_BLS12_381, Fr as Fr_BLS12_381, G1Affine as G1Affine_BLS12_381, G1Projective as G1Projective_BLS12_381};
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger384, BigInteger256, PrimeField};
+use rustacuda::memory::{CopyDestination, DeviceCopy};
+
+extern "C" {
+    fn msm_cuda(
+        out: *mut Point,
+        points: *const PointAffineNoInfinity,
+        scalars: *const Scalar,
+        count: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn msm_batch_cuda(
+        out: *mut Point,
+        points: *const PointAffineNoInfinity,
+        scalars: *const Scalar,
+        batch_size: usize,
+        msm_size: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn commit_cuda(
+        d_out: DevicePointer<Point>,
+        d_scalars: DevicePointer<Scalar>,
+        d_points: DevicePointer<PointAffineNoInfinity>,
+        count: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn commit_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_scalars: DevicePointer<Scalar>,
+        d_points: DevicePointer<PointAffineNoInfinity>,
+        count: usize,
+        batch_size: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn build_domain_cuda(domain_size: usize, logn: usize, inverse: bool, device_id: usize) -> DevicePointer<Scalar>;
+
+    fn ntt_cuda(inout: *mut Scalar, n: usize, inverse: bool, device_id: usize) -> c_int;
+
+    fn ecntt_cuda(inout: *mut Point, n: usize, inverse: bool, device_id: usize) -> c_int;
+
+    fn ntt_batch_cuda(
+        inout: *mut Scalar,
+        arr_size: usize,
+        n: usize,
+        inverse: bool,
+    ) -> c_int;
+
+    fn ecntt_batch_cuda(inout: *mut Point, arr_size: usize, n: usize, inverse: bool) -> c_int;
+
+    fn interpolate_scalars_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_evaluations: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>, 
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_scalars_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_evaluations: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_points_cuda(
+        d_out: DevicePointer<Point>,
+        d_evaluations: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_points_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_evaluations: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_on_coset_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_on_coset_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_on_coset_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_on_coset_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_scalars_cuda(
+        d_arr: DevicePointer<Scalar>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_scalars_batch_cuda(
+        d_arr: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_points_cuda(
+        d_arr: DevicePointer<Point>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_points_batch_cuda(
+        d_arr: DevicePointer<Point>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn vec_mod_mult_point(
+        inout: *mut Point,
+        scalars: *const Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+
+    fn vec_mod_mult_scalar(
+        inout: *mut Scalar,
+        scalars: *const Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+
+    fn matrix_vec_mod_mult(
+        matrix_flattened: *const Scalar,
+        input: *const Scalar,
+        output: *mut Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+}
+
+pub fn msm(points: &[PointAffineNoInfinity], scalars: &[Scalar], device_id: usize) -> Point {
+    let count = points.len();
+    if count != scalars.len() {
+        todo!("variable length")
+    }
+
+    let mut ret = Point::zero();
+    unsafe {
+        msm_cuda(
+            &mut ret as *mut _ as *mut Point,
+            points as *const _ as *const PointAffineNoInfinity,
+            scalars as *const _ as *const Scalar,
+            scalars.len(),
+            device_id,
+        )
+    };
+
+    ret
+}
+
+pub fn msm_batch(
+    points: &[PointAffineNoInfinity],
+    scalars: &[Scalar],
+    batch_size: usize,
+    device_id: usize,
+) -> Vec<Point> {
+    let count = points.len();
+    if count != scalars.len() {
+        todo!("variable length")
+    }
+
+    let mut ret = vec![Point::zero(); batch_size];
+
+    unsafe {
+        msm_batch_cuda(
+            &mut ret[0] as *mut _ as *mut Point,
+            points as *const _ as *const PointAffineNoInfinity,
+            scalars as *const _ as *const Scalar,
+            batch_size,
+            count / batch_size,
+            device_id,
+        )
+    };
+
+    ret
+}
+
+pub fn commit(
+    points: &mut DeviceBuffer<PointAffineNoInfinity>,
+    scalars: &mut DeviceBuffer<Scalar>,
+) -> DeviceBox<Point> {
+    let mut res = DeviceBox::new(&Point::zero()).unwrap();
+    unsafe {
+        commit_cuda(
+            res.as_device_ptr(),
+            scalars.as_device_ptr(),
+            points.as_device_ptr(),
+            scalars.len(),
+            0,
+        );
+    }
+    return res;
+}
+
+pub fn commit_batch(
+    points: &mut DeviceBuffer<PointAffineNoInfinity>,
+    scalars: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(batch_size).unwrap() };
+    unsafe {
+        commit_batch_cuda(
+            res.as_device_ptr(),
+            scalars.as_device_ptr(),
+            points.as_device_ptr(),
+            scalars.len() / batch_size,
+            batch_size,
+            0,
+        );
+    }
+    return res;
+}
+
+/// Compute an in-place NTT on the input data.
+fn ntt_internal(values: &mut [Scalar], device_id: usize, inverse: bool) -> i32 {
+    let ret_code = unsafe {
+        ntt_cuda(
+            values as *mut _ as *mut Scalar,
+            values.len(),
+            inverse,
+            device_id,
+        )
+    };
+    ret_code
+}
+
+pub fn ntt(values: &mut [Scalar], device_id: usize) {
+    ntt_internal(values, device_id, false);
+}
+
+pub fn intt(values: &mut [Scalar], device_id: usize) {
+    ntt_internal(values, device_id, true);
+}
+
+/// Compute an in-place NTT on the input data.
+fn ntt_internal_batch(
+    values: &mut [Scalar],
+    device_id: usize,
+    batch_size: usize,
+    inverse: bool,
+) -> i32 {
+    unsafe {
+        ntt_batch_cuda(
+            values as *mut _ as *mut Scalar,
+            values.len(),
+            batch_size,
+            inverse,
+        )
+    }
+}
+
+pub fn ntt_batch(values: &mut [Scalar], batch_size: usize, device_id: usize) {
+    ntt_internal_batch(values, 0, batch_size, false);
+}
+
+pub fn intt_batch(values: &mut [Scalar], batch_size: usize, device_id: usize) {
+    ntt_internal_batch(values, 0, batch_size, true);
+}
+
+/// Compute an in-place ECNTT on the input data.
+fn ecntt_internal(values: &mut [Point], inverse: bool, device_id: usize) -> i32 {
+    unsafe {
+        ecntt_cuda(
+            values as *mut _ as *mut Point,
+            values.len(),
+            inverse,
+            device_id,
+        )
+    }
+}
+
+pub fn ecntt(values: &mut [Point], device_id: usize) {
+    ecntt_internal(values, false, device_id);
+}
+
+/// Compute an in-place iECNTT on the input data.
+pub fn iecntt(values: &mut [Point], device_id: usize) {
+    ecntt_internal(values, true, device_id);
+}
+
+/// Compute an in-place ECNTT on the input data.
+fn ecntt_internal_batch(
+    values: &mut [Point],
+    device_id: usize,
+    batch_size: usize,
+    inverse: bool,
+) -> i32 {
+    unsafe {
+        ecntt_batch_cuda(
+            values as *mut _ as *mut Point,
+            values.len(),
+            batch_size,
+            inverse,
+        )
+    }
+}
+
+pub fn ecntt_batch(values: &mut [Point], batch_size: usize, device_id: usize) {
+    ecntt_internal_batch(values, 0, batch_size, false);
+}
+
+/// Compute an in-place iECNTT on the input data.
+pub fn iecntt_batch(values: &mut [Point], batch_size: usize, device_id: usize) {
+    ecntt_internal_batch(values, 0, batch_size, true);
+}
+
+pub fn build_domain(domain_size: usize, logn: usize, inverse: bool) -> DeviceBuffer<Scalar> {
+    unsafe {
+        DeviceBuffer::from_raw_parts(build_domain_cuda(
+            domain_size,
+            logn,
+            inverse,
+            0
+        ), domain_size)
+    }
+}
+
+
+pub fn reverse_order_scalars(
+    d_scalars: &mut DeviceBuffer<Scalar>,
+) {
+    unsafe { reverse_order_scalars_cuda(
+        d_scalars.as_device_ptr(),
+        d_scalars.len(),
+        0
+    ); }
+}
+
+pub fn reverse_order_scalars_batch(
+    d_scalars: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) {
+    unsafe { reverse_order_scalars_batch_cuda(
+        d_scalars.as_device_ptr(),
+        d_scalars.len() / batch_size,
+        batch_size,
+        0
+    ); }
+}
+
+pub fn reverse_order_points(
+    d_points: &mut DeviceBuffer<Point>,
+) {
+    unsafe { reverse_order_points_cuda(
+        d_points.as_device_ptr(),
+        d_points.len(),
+        0
+    ); }
+}
+
+pub fn reverse_order_points_batch(
+    d_points: &mut DeviceBuffer<Point>,
+    batch_size: usize,
+) {
+    unsafe { reverse_order_points_batch_cuda(
+        d_points.as_device_ptr(),
+        d_points.len() / batch_size,
+        batch_size,
+        0
+    ); }
+}
+
+pub fn interpolate_scalars(
+    d_evaluations: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe { interpolate_scalars_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_scalars_batch(
+    d_evaluations: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe { interpolate_scalars_batch_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        batch_size,
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_points(
+    d_evaluations: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe { interpolate_points_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_points_batch(
+    d_evaluations: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe { interpolate_points_batch_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        batch_size,
+        0
+    ) };
+    return res;
+}
+
+pub fn evaluate_scalars(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_scalars_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_batch(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_scalars_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_points_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_batch(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_points_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_on_coset(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_scalars_on_coset_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_on_coset_batch(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_scalars_on_coset_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_on_coset(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_points_on_coset_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_on_coset_batch(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_points_on_coset_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn multp_vec(a: &mut [Point], b: &[Scalar], device_id: usize) {
+    assert_eq!(a.len(), b.len());
+    unsafe {
+        vec_mod_mult_point(
+            a as *mut _ as *mut Point,
+            b as *const _ as *const Scalar,
+            a.len(),
+            device_id,
+        );
+    }
+}
+
+pub fn mult_sc_vec(a: &mut [Scalar], b: &[Scalar], device_id: usize) {
+    assert_eq!(a.len(), b.len());
+    unsafe {
+        vec_mod_mult_scalar(
+            a as *mut _ as *mut Scalar,
+            b as *const _ as *const Scalar,
+            a.len(),
+            device_id,
+        );
+    }
+}
+
+// Multiply a matrix by a scalar:
+//  `a` - flattenned matrix;
+//  `b` - vector to multiply `a` by;
+pub fn mult_matrix_by_vec(a: &[Scalar], b: &[Scalar], device_id: usize) -> Vec<Scalar> {
+    let mut c = Vec::with_capacity(b.len());
+    for i in 0..b.len() {
+        c.push(Scalar::zero());
+    }
+    unsafe {
+        matrix_vec_mod_mult(
+            a as *const _ as *const Scalar,
+            b as *const _ as *const Scalar,
+            c.as_mut_slice() as *mut _ as *mut Scalar,
+            b.len(),
+            device_id,
+        );
+    }
+    c
+}
+
+pub fn clone_buffer<T: DeviceCopy>(buf: &mut DeviceBuffer<T>) -> DeviceBuffer<T> {
+    let mut buf_cpy = unsafe { DeviceBuffer::uninitialized(buf.len()).unwrap() };
+    unsafe { buf_cpy.copy_from(buf) };
+    return buf_cpy;
+}
+
+pub fn get_rng(seed: Option<u64>) -> Box<dyn RngCore> {
+    let rng: Box<dyn RngCore> = match seed {
+        Some(seed) => Box::new(StdRng::seed_from_u64(seed)),
+        None => Box::new(rand::thread_rng()),
+    };
+    rng
+}
+
+fn set_up_device() {
+    // Set up the context, load the module, and create a stream to run kernels in.
+    rustacuda::init(CudaFlags::empty()).unwrap();
+    let device = Device::get_device(0).unwrap();
+    let _ctx = Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device).unwrap();
+}
+
+pub fn generate_random_points(
+    count: usize,
+    mut rng: Box<dyn RngCore>,
+) -> Vec<PointAffineNoInfinity> {
+    (0..count)
+        .map(|_| Point::from_ark(G1Projective_BLS12_381::rand(&mut rng)).to_xy_strip_z())
+        .collect()
+}
+
+pub fn generate_random_points_proj(count: usize, mut rng: Box<dyn RngCore>) -> Vec<Point> {
+    (0..count)
+        .map(|_| Point::from_ark(G1Projective_BLS12_381::rand(&mut rng)))
+        .collect()
+}
+
+pub fn generate_random_scalars(count: usize, mut rng: Box<dyn RngCore>) -> Vec<Scalar> {
+    (0..count)
+        .map(|_| Scalar::from_ark(Fr_BLS12_381::rand(&mut rng).into_repr()))
+        .collect()
+}
+
+pub fn set_up_points(test_size: usize, log_domain_size: usize, inverse: bool) -> (Vec<Point>, DeviceBuffer<Point>, DeviceBuffer<Scalar>) {
+    set_up_device();
+
+    let d_domain = build_domain(1 << log_domain_size, log_domain_size, inverse);
+
+    let seed = Some(0); // fix the rng to get two equal scalar 
+    let vector = generate_random_points_proj(test_size, get_rng(seed));
+    let mut vector_mut = vector.clone();
+
+    let mut d_vector = DeviceBuffer::from_slice(&vector[..]).unwrap();
+    (vector_mut, d_vector, d_domain)
+}
+
+pub fn set_up_scalars(test_size: usize, log_domain_size: usize, inverse: bool) -> (Vec<Scalar>, DeviceBuffer<Scalar>, DeviceBuffer<Scalar>) {
+    set_up_device();
+
+    let d_domain = build_domain(1 << log_domain_size, log_domain_size, inverse);
+
+    let seed = Some(0); // fix the rng to get two equal scalars
+    let mut vector_mut = generate_random_scalars(test_size, get_rng(seed));
+
+    let mut d_vector = DeviceBuffer::from_slice(&vector_mut[..]).unwrap();
+    (vector_mut, d_vector, d_domain)
+}
+
--- a/bls12-381/src/lib.rs
+++ b/bls12-381/src/lib.rs
@@ -0,0 +1,4 @@
+pub mod test_bls12_381;
+pub mod basic_structs;
+pub mod from_cuda;
+pub mod curve_structs;
--- a/bls12-381/src/test_bls12_381.rs
+++ b/bls12-381/src/test_bls12_381.rs
@@ -0,0 +1,816 @@
+use std::ffi::{c_int, c_uint};
+use ark_std::UniformRand;
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda::CudaFlags;
+use rustacuda::memory::DeviceBox;
+use rustacuda::prelude::{DeviceBuffer, Device, ContextFlags, Context};
+use rustacuda_core::DevicePointer;
+use std::mem::transmute;
+pub use crate::basic_structs::scalar::ScalarTrait;
+pub use crate::curve_structs::*;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+use std::marker::PhantomData;
+use std::convert::TryInto;
+use ark_bls12_381::{Fq as Fq_BLS12_381, Fr as Fr_BLS12_381, G1Affine as G1Affine_BLS12_381, G1Projective as G1Projective_BLS12_381};
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger384, BigInteger256, PrimeField};
+use rustacuda::memory::{CopyDestination, DeviceCopy};
+
+
+impl Scalar {
+    pub fn to_biginteger254(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn to_ark(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn from_biginteger256(ark: BigInteger256) -> Self {
+        Self{ value: u64_vec_to_u32_vec(&ark.0).try_into().unwrap(), phantom : PhantomData}
+    }
+
+    pub fn to_biginteger256_transmute(&self) -> BigInteger256 {
+        unsafe { transmute(*self) }
+    }
+
+    pub fn from_biginteger_transmute(v: BigInteger256) -> Scalar {
+        Scalar{ value: unsafe{ transmute(v)}, phantom : PhantomData }
+    }
+
+    pub fn to_ark_transmute(&self) -> Fr_BLS12_381 {
+        unsafe { std::mem::transmute(*self) }
+    }
+
+    pub fn from_ark_transmute(v: &Fr_BLS12_381) -> Scalar {
+        unsafe { std::mem::transmute_copy(v) }
+    }
+
+    pub fn to_ark_mod_p(&self) -> Fr_BLS12_381 {
+        Fr_BLS12_381::new(BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap()))
+    }
+
+    pub fn to_ark_repr(&self) -> Fr_BLS12_381 {
+        Fr_BLS12_381::from_repr(BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())).unwrap()
+    }
+
+    pub fn from_ark(v: BigInteger256) -> Scalar {
+        Self { value : u64_vec_to_u32_vec(&v.0).try_into().unwrap(), phantom: PhantomData}
+    }
+
+}
+
+impl Base {
+    pub fn to_ark(&self) -> BigInteger384 {
+        BigInteger384::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn from_ark(ark: BigInteger384) -> Self {
+        Self::from_limbs(&u64_vec_to_u32_vec(&ark.0))
+    }
+}
+
+
+impl Point {
+    pub fn to_ark(&self) -> G1Projective_BLS12_381 {
+        self.to_ark_affine().into_projective()
+    }
+
+    pub fn to_ark_affine(&self) -> G1Affine_BLS12_381 {
+        //TODO: generic conversion
+        use ark_ff::Field;
+        use std::ops::Mul;
+        let proj_x_field = Fq_BLS12_381::from_le_bytes_mod_order(&self.x.to_bytes_le());
+        let proj_y_field = Fq_BLS12_381::from_le_bytes_mod_order(&self.y.to_bytes_le());
+        let proj_z_field = Fq_BLS12_381::from_le_bytes_mod_order(&self.z.to_bytes_le());
+        let inverse_z = proj_z_field.inverse().unwrap();
+        let aff_x = proj_x_field.mul(inverse_z);
+        let aff_y = proj_y_field.mul(inverse_z);
+        G1Affine_BLS12_381::new(aff_x, aff_y, false)
+    }
+
+    pub fn from_ark(ark: G1Projective_BLS12_381) -> Point {
+        use ark_ff::Field;
+        let z_inv = ark.z.inverse().unwrap();
+        let z_invsq = z_inv * z_inv;
+        let z_invq3 = z_invsq * z_inv;
+        Point {
+            x: Base::from_ark((ark.x * z_invsq).into_repr()),
+            y: Base::from_ark((ark.y * z_invq3).into_repr()),
+            z: Base::one(),
+        }
+    }
+}
+
+impl PointAffineNoInfinity {
+
+    pub fn to_ark(&self) -> G1Affine_BLS12_381 {
+        G1Affine_BLS12_381::new(Fq_BLS12_381::new(self.x.to_ark()), Fq_BLS12_381::new(self.y.to_ark()), false)
+    }
+
+    pub fn to_ark_repr(&self) -> G1Affine_BLS12_381 {
+        G1Affine_BLS12_381::new(
+            Fq_BLS12_381::from_repr(self.x.to_ark()).unwrap(),
+            Fq_BLS12_381::from_repr(self.y.to_ark()).unwrap(),
+            false,
+        )
+    }
+
+    pub fn from_ark(p: &G1Affine_BLS12_381) -> Self {
+        PointAffineNoInfinity {
+            x: Base::from_ark(p.x.into_repr()),
+            y: Base::from_ark(p.y.into_repr()),
+        }
+    }
+}
+
+impl Point {
+    pub fn to_affine(&self) -> PointAffineNoInfinity {
+        let ark_affine = self.to_ark_affine();
+        PointAffineNoInfinity {
+            x: Base::from_ark(ark_affine.x.into_repr()),
+            y: Base::from_ark(ark_affine.y.into_repr()),
+        }
+    }
+}
+
+
+#[cfg(test)]
+pub(crate) mod tests_bls12_381 {
+    use std::ops::Add;
+    use ark_bls12_381::{Fr, G1Affine, G1Projective};
+    use ark_ec::{msm::VariableBaseMSM, AffineCurve, ProjectiveCurve};
+    use ark_ff::{FftField, Field, Zero, PrimeField};
+    use ark_std::UniformRand;
+    use rustacuda::prelude::{DeviceBuffer, CopyDestination};
+    use crate::curve_structs::{Point, Scalar, Base};
+    use crate::basic_structs::scalar::ScalarTrait;
+    use crate::from_cuda::{generate_random_points, get_rng, generate_random_scalars, msm, msm_batch, set_up_scalars, commit, commit_batch, ntt, intt, generate_random_points_proj, ecntt, iecntt, ntt_batch, ecntt_batch, iecntt_batch, intt_batch, reverse_order_scalars_batch, interpolate_scalars_batch, set_up_points, reverse_order_points, interpolate_points, reverse_order_points_batch, interpolate_points_batch, evaluate_scalars, interpolate_scalars, reverse_order_scalars, evaluate_points, build_domain, evaluate_scalars_on_coset, evaluate_points_on_coset, mult_matrix_by_vec, mult_sc_vec, multp_vec,evaluate_scalars_batch, evaluate_points_batch, evaluate_scalars_on_coset_batch, evaluate_points_on_coset_batch};
+
+    fn random_points_ark_proj(nof_elements: usize) -> Vec<G1Projective> {
+        let mut rng = ark_std::rand::thread_rng();
+        let mut points_ga: Vec<G1Projective> = Vec::new();
+        for _ in 0..nof_elements {
+            let aff = G1Projective::rand(&mut rng);
+            points_ga.push(aff);
+        }
+        points_ga
+    }
+
+    fn ecntt_arc_naive(
+        points: &Vec<G1Projective>,
+        size: usize,
+        inverse: bool,
+    ) -> Vec<G1Projective> {
+        let mut result: Vec<G1Projective> = Vec::new();
+        for _ in 0..size {
+            result.push(G1Projective::zero());
+        }
+        let rou: Fr;
+        if !inverse {
+            rou = Fr::get_root_of_unity(size).unwrap();
+        } else {
+            rou = Fr::inverse(&Fr::get_root_of_unity(size).unwrap()).unwrap();
+        }
+        for k in 0..size {
+            for l in 0..size {
+                let pow: [u64; 1] = [(l * k).try_into().unwrap()];
+                let mul_rou = Fr::pow(&rou, &pow);
+                result[k] = result[k].add(points[l].into_affine().mul(mul_rou));
+            }
+        }
+        if inverse {
+            let size2 = size as u64;
+            for k in 0..size {
+                let multfactor = Fr::inverse(&Fr::from(size2)).unwrap();
+                result[k] = result[k].into_affine().mul(multfactor);
+            }
+        }
+        return result;
+    }
+
+    fn check_eq(points: &Vec<G1Projective>, points2: &Vec<G1Projective>) -> bool {
+        let mut eq = true;
+        for i in 0..points.len() {
+            if points2[i].ne(&points[i]) {
+                eq = false;
+                break;
+            }
+        }
+        return eq;
+    }
+
+    fn test_naive_ark_ecntt(size: usize) {
+        let points = random_points_ark_proj(size);
+        let result1: Vec<G1Projective> = ecntt_arc_naive(&points, size, false);
+        let result2: Vec<G1Projective> = ecntt_arc_naive(&result1, size, true);
+        assert!(!check_eq(&result2, &result1));
+        assert!(check_eq(&result2, &points));
+    }
+
+    #[test]
+    fn test_msm() {
+        let test_sizes = [6, 9];
+
+        for pow2 in test_sizes {
+            let count = 1 << pow2;
+            let seed = None; // set Some to provide seed
+            let points = generate_random_points(count, get_rng(seed));
+            let scalars = generate_random_scalars(count, get_rng(seed));
+
+            let msm_result = msm(&points, &scalars, 0);
+
+            let point_r_ark: Vec<_> = points.iter().map(|x| x.to_ark_repr()).collect();
+            let scalars_r_ark: Vec<_> = scalars.iter().map(|x| x.to_ark()).collect();
+
+            let msm_result_ark = VariableBaseMSM::multi_scalar_mul(&point_r_ark, &scalars_r_ark);
+
+            assert_eq!(msm_result.to_ark_affine(), msm_result_ark);
+            assert_eq!(msm_result.to_ark(), msm_result_ark);
+            assert_eq!(
+                msm_result.to_ark_affine(),
+                Point::from_ark(msm_result_ark).to_ark_affine()
+            );
+        }
+    }
+
+    #[test]
+    fn test_batch_msm() {
+        for batch_pow2 in [2, 4] {
+            for pow2 in [4, 6] {
+                let msm_size = 1 << pow2;
+                let batch_size = 1 << batch_pow2;
+                let seed = None; // set Some to provide seed
+                let points_batch = generate_random_points(msm_size * batch_size, get_rng(seed));
+                let scalars_batch = generate_random_scalars(msm_size * batch_size, get_rng(seed));
+
+                let point_r_ark: Vec<_> = points_batch.iter().map(|x| x.to_ark_repr()).collect();
+                let scalars_r_ark: Vec<_> = scalars_batch.iter().map(|x| x.to_ark()).collect();
+
+                let expected: Vec<_> = point_r_ark
+                    .chunks(msm_size)
+                    .zip(scalars_r_ark.chunks(msm_size))
+                    .map(|p| Point::from_ark(VariableBaseMSM::multi_scalar_mul(p.0, p.1)))
+                    .collect();
+
+                let result = msm_batch(&points_batch, &scalars_batch, batch_size, 0);
+
+                assert_eq!(result, expected);
+            }
+        }
+    }
+
+    #[test]
+    fn test_commit() {
+        let test_size = 1 << 8;
+        let seed = Some(0);
+        let (mut scalars, mut d_scalars, _) = set_up_scalars(test_size, 0, false);
+        let mut points = generate_random_points(test_size, get_rng(seed));
+        let mut d_points = DeviceBuffer::from_slice(&points[..]).unwrap();
+
+        let msm_result = msm(&points, &scalars, 0);
+        let mut d_commit_result = commit(&mut d_points, &mut d_scalars);
+        let mut h_commit_result = Point::zero();
+        d_commit_result.copy_to(&mut h_commit_result).unwrap();
+
+        assert_eq!(msm_result, h_commit_result);
+        assert_ne!(msm_result, Point::zero());
+        assert_ne!(h_commit_result, Point::zero());
+    }
+
+    #[test]
+    fn test_batch_commit() {
+        let batch_size = 4;
+        let test_size = 1 << 12;
+        let seed = Some(0);
+        let (scalars, mut d_scalars, _) = set_up_scalars(test_size * batch_size, 0, false);
+        let points = generate_random_points(test_size * batch_size, get_rng(seed));
+        let mut d_points = DeviceBuffer::from_slice(&points[..]).unwrap();
+
+        let msm_result = msm_batch(&points, &scalars, batch_size, 0);
+        let mut d_commit_result = commit_batch(&mut d_points, &mut d_scalars, batch_size);
+        let mut h_commit_result: Vec<Point> = (0..batch_size).map(|_| Point::zero()).collect();
+        d_commit_result.copy_to(&mut h_commit_result[..]).unwrap();
+
+        assert_eq!(msm_result, h_commit_result);
+        for h in h_commit_result {
+            assert_ne!(h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_ntt() {
+        //NTT
+        let seed = None; //some value to fix the rng
+        let test_size = 1 << 3;
+
+        let scalars = generate_random_scalars(test_size, get_rng(seed));
+
+        let mut ntt_result = scalars.clone();
+        ntt(&mut ntt_result, 0);
+
+        assert_ne!(ntt_result, scalars);
+
+        let mut intt_result = ntt_result.clone();
+
+        intt(&mut intt_result, 0);
+
+        assert_eq!(intt_result, scalars);
+
+        //ECNTT
+        let points_proj = generate_random_points_proj(test_size, get_rng(seed));
+
+        test_naive_ark_ecntt(test_size);
+
+        assert!(points_proj[0].to_ark().into_affine().is_on_curve());
+
+        //naive ark
+        let points_proj_ark = points_proj
+            .iter()
+            .map(|p| p.to_ark())
+            .collect::<Vec<G1Projective>>();
+
+        let ecntt_result_naive = ecntt_arc_naive(&points_proj_ark, points_proj_ark.len(), false);
+
+        let iecntt_result_naive = ecntt_arc_naive(&ecntt_result_naive, points_proj_ark.len(), true);
+
+        assert_eq!(points_proj_ark, iecntt_result_naive);
+
+        //ingo gpu
+        let mut ecntt_result = points_proj.to_vec();
+        ecntt(&mut ecntt_result, 0);
+
+        assert_ne!(ecntt_result, points_proj);
+
+        let mut iecntt_result = ecntt_result.clone();
+        iecntt(&mut iecntt_result, 0);
+
+        assert_eq!(
+            iecntt_result_naive,
+            points_proj
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>()
+        );
+        assert_eq!(
+            iecntt_result
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>(),
+            points_proj
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>()
+        );
+    }
+
+    #[test]
+    fn test_ntt_batch() {
+        //NTT
+        let seed = None; //some value to fix the rng
+        let test_size = 1 << 5;
+        let batches = 4;
+
+        let scalars_batch: Vec<Scalar> =
+            generate_random_scalars(test_size * batches, get_rng(seed));
+
+        let mut scalar_vec_of_vec: Vec<Vec<Scalar>> = Vec::new();
+
+        for i in 0..batches {
+            scalar_vec_of_vec.push(scalars_batch[i * test_size..(i + 1) * test_size].to_vec());
+        }
+
+        let mut ntt_result = scalars_batch.clone();
+
+        // do batch ntt
+        ntt_batch(&mut ntt_result, test_size, 0);
+
+        let mut ntt_result_vec_of_vec = Vec::new();
+
+        // do ntt for every chunk
+        for i in 0..batches {
+            ntt_result_vec_of_vec.push(scalar_vec_of_vec[i].clone());
+            ntt(&mut ntt_result_vec_of_vec[i], 0);
+        }
+
+        // check that the ntt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                ntt_result_vec_of_vec[i],
+                ntt_result[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        // check that ntt output is different from input
+        assert_ne!(ntt_result, scalars_batch);
+
+        let mut intt_result = ntt_result.clone();
+
+        // do batch intt
+        intt_batch(&mut intt_result, test_size, 0);
+
+        let mut intt_result_vec_of_vec = Vec::new();
+
+        // do intt for every chunk
+        for i in 0..batches {
+            intt_result_vec_of_vec.push(ntt_result_vec_of_vec[i].clone());
+            intt(&mut intt_result_vec_of_vec[i], 0);
+        }
+
+        // check that the intt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                intt_result_vec_of_vec[i],
+                intt_result[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_eq!(intt_result, scalars_batch);
+
+        // //ECNTT
+        let points_proj = generate_random_points_proj(test_size * batches, get_rng(seed));
+
+        let mut points_vec_of_vec: Vec<Vec<Point>> = Vec::new();
+
+        for i in 0..batches {
+            points_vec_of_vec.push(points_proj[i * test_size..(i + 1) * test_size].to_vec());
+        }
+
+        let mut ntt_result_points = points_proj.clone();
+
+        // do batch ecintt
+        ecntt_batch(&mut ntt_result_points, test_size, 0);
+
+        let mut ntt_result_points_vec_of_vec = Vec::new();
+
+        for i in 0..batches {
+            ntt_result_points_vec_of_vec.push(points_vec_of_vec[i].clone());
+            ecntt(&mut ntt_result_points_vec_of_vec[i], 0);
+        }
+
+        for i in 0..batches {
+            assert_eq!(
+                ntt_result_points_vec_of_vec[i],
+                ntt_result_points[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_ne!(ntt_result_points, points_proj);
+
+        let mut intt_result_points = ntt_result_points.clone();
+
+        // do batch ecintt
+        iecntt_batch(&mut intt_result_points, test_size, 0);
+
+        let mut intt_result_points_vec_of_vec = Vec::new();
+
+        // do ecintt for every chunk
+        for i in 0..batches {
+            intt_result_points_vec_of_vec.push(ntt_result_points_vec_of_vec[i].clone());
+            iecntt(&mut intt_result_points_vec_of_vec[i], 0);
+        }
+
+        // check that the ecintt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                intt_result_points_vec_of_vec[i],
+                intt_result_points[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_eq!(intt_result_points, points_proj);
+    }
+
+    #[test]
+    fn test_scalar_interpolation() {
+        let log_test_size = 7;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_scalars(test_size, log_test_size, true);
+
+        reverse_order_scalars(&mut d_evals);
+        let mut d_coeffs = interpolate_scalars(&mut d_evals, &mut d_domain);
+        intt(&mut evals_mut, 0);
+        let mut h_coeffs: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+
+        assert_eq!(h_coeffs, evals_mut);
+    }
+
+    #[test]
+    fn test_scalar_batch_interpolation() {
+        let batch_size = 4;
+        let log_test_size = 10;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_scalars(test_size * batch_size, log_test_size, true);
+
+        reverse_order_scalars_batch(&mut d_evals, batch_size);
+        let mut d_coeffs = interpolate_scalars_batch(&mut d_evals, &mut d_domain, batch_size);
+        intt_batch(&mut evals_mut, test_size, 0);
+        let mut h_coeffs: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+
+        assert_eq!(h_coeffs, evals_mut);
+    }
+
+    #[test]
+    fn test_point_interpolation() {
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_points(test_size, log_test_size, true);
+
+        reverse_order_points(&mut d_evals);
+        let mut d_coeffs = interpolate_points(&mut d_evals, &mut d_domain);
+        iecntt(&mut evals_mut[..], 0);
+        let mut h_coeffs: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+        
+        assert_eq!(h_coeffs, *evals_mut);
+        for h in h_coeffs.iter() {
+            assert_ne!(*h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_interpolation() {
+        let batch_size = 4;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_points(test_size * batch_size, log_test_size, true);
+
+        reverse_order_points_batch(&mut d_evals, batch_size);
+        let mut d_coeffs = interpolate_points_batch(&mut d_evals, &mut d_domain, batch_size);
+        iecntt_batch(&mut evals_mut[..], test_size, 0);
+        let mut h_coeffs: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+        
+        assert_eq!(h_coeffs, *evals_mut);
+        for h in h_coeffs.iter() {
+            assert_ne!(*h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_scalar_evaluation() {
+        let log_test_domain_size = 8;
+        let coeff_size = 1 << 6;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut d_coeffs_domain = interpolate_scalars(&mut d_evals, &mut d_domain_inv);
+        let mut h_coeffs_domain: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        assert_eq!(h_coeffs, h_coeffs_domain[..coeff_size]);
+        for i in coeff_size.. (1 << log_test_domain_size) {
+            assert_eq!(Scalar::zero(), h_coeffs_domain[i]);
+        }
+    }
+
+    #[test]
+    fn test_scalar_batch_evaluation() {
+        let batch_size = 6;
+        let log_test_domain_size = 8;
+        let domain_size = 1 << log_test_domain_size;
+        let coeff_size = 1 << 6;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size * batch_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_scalars_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut d_coeffs_domain = interpolate_scalars_batch(&mut d_evals, &mut d_domain_inv, batch_size);
+        let mut h_coeffs_domain: Vec<Scalar> = (0..domain_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        for j in 0..batch_size {
+            assert_eq!(h_coeffs[j * coeff_size..(j + 1) * coeff_size], h_coeffs_domain[j * domain_size..j * domain_size + coeff_size]);
+            for i in coeff_size..domain_size {
+                assert_eq!(Scalar::zero(), h_coeffs_domain[j * domain_size + i]);
+            }
+        }
+    }
+
+    #[test]
+    fn test_point_evaluation() {
+        let log_test_domain_size = 7;
+        let coeff_size = 1 << 7;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_points(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_points(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_points(&mut d_coeffs, &mut d_domain);
+        let mut d_coeffs_domain = interpolate_points(&mut d_evals, &mut d_domain_inv);
+        let mut h_coeffs_domain: Vec<Point> = (0..1 << log_test_domain_size).map(|_| Point::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        assert_eq!(h_coeffs[..], h_coeffs_domain[..coeff_size]);
+        for i in coeff_size..(1 << log_test_domain_size) {
+            assert_eq!(Point::zero(), h_coeffs_domain[i]);
+        }
+        for i in 0..coeff_size {
+            assert_ne!(h_coeffs_domain[i], Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_evaluation() {
+        let batch_size = 4;
+        let log_test_domain_size = 6;
+        let domain_size = 1 << log_test_domain_size;
+        let coeff_size = 1 << 5;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_points(coeff_size * batch_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_points(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_points_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut d_coeffs_domain = interpolate_points_batch(&mut d_evals, &mut d_domain_inv, batch_size);
+        let mut h_coeffs_domain: Vec<Point> = (0..domain_size * batch_size).map(|_| Point::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        for j in 0..batch_size {
+            assert_eq!(h_coeffs[j * coeff_size..(j + 1) * coeff_size], h_coeffs_domain[j * domain_size..(j * domain_size + coeff_size)]);
+            for i in coeff_size..domain_size {
+                assert_eq!(Point::zero(), h_coeffs_domain[j * domain_size + i]);
+            }
+            for i in j * domain_size..(j * domain_size + coeff_size) {
+                assert_ne!(h_coeffs_domain[i], Point::zero());
+            }
+        }
+    }
+
+    #[test]
+    fn test_scalar_evaluation_on_trivial_coset() {
+        // checks that the evaluations on the subgroup is the same as on the coset generated by 1
+        let log_test_domain_size = 8;
+        let coeff_size = 1 << 6;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(coeff_size, log_test_domain_size, true);
+        let mut d_trivial_coset_powers = build_domain(1 << log_test_domain_size, 0, false);
+
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut h_coeffs: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_coeffs[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset(&mut d_coeffs, &mut d_domain, &mut d_trivial_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_coeffs, h_evals_coset);
+    }
+
+    #[test]
+    fn test_scalar_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let log_test_size = 8;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(test_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_scalars(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_scalars(&mut d_coeffs, &mut d_large_domain);
+        let mut h_evals_large: Vec<Scalar> = (0..2 * test_size).map(|_| Scalar::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut h_evals: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset(&mut d_coeffs, &mut d_domain, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_evals[..], h_evals_large[..test_size]);
+        assert_eq!(h_evals_coset[..], h_evals_large[test_size..2 * test_size]);
+    }
+
+    #[test]
+    fn test_scalar_batch_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let batch_size = 4;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(test_size * batch_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_scalars(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_scalars_batch(&mut d_coeffs, &mut d_large_domain, batch_size);
+        let mut h_evals_large: Vec<Scalar> = (0..2 * test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_scalars_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut h_evals: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset_batch(&mut d_coeffs, &mut d_domain, batch_size, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        for i in 0..batch_size {
+            assert_eq!(h_evals_large[2 * i * test_size..(2 * i + 1) * test_size], h_evals[i * test_size..(i + 1) * test_size]);
+            assert_eq!(h_evals_large[(2 * i + 1) * test_size..(2 * i + 2) * test_size], h_evals_coset[i * test_size..(i + 1) * test_size]);
+        }
+    }
+
+    #[test]
+    fn test_point_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let log_test_size = 8;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_points(test_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_points(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_points(&mut d_coeffs, &mut d_large_domain);
+        let mut h_evals_large: Vec<Point> = (0..2 * test_size).map(|_| Point::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_points(&mut d_coeffs, &mut d_domain);
+        let mut h_evals: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_points_on_coset(&mut d_coeffs, &mut d_domain, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_evals[..], h_evals_large[..test_size]);
+        assert_eq!(h_evals_coset[..], h_evals_large[test_size..2 * test_size]);
+        for i in 0..test_size {
+            assert_ne!(h_evals[i], Point::zero());
+            assert_ne!(h_evals_coset[i], Point::zero());
+            assert_ne!(h_evals_large[2 * i], Point::zero());
+            assert_ne!(h_evals_large[2 * i + 1], Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let batch_size = 2;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_points(test_size * batch_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_points(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_points_batch(&mut d_coeffs, &mut d_large_domain, batch_size);
+        let mut h_evals_large: Vec<Point> = (0..2 * test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_points_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut h_evals: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_points_on_coset_batch(&mut d_coeffs, &mut d_domain, batch_size, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        for i in 0..batch_size {
+            assert_eq!(h_evals_large[2 * i * test_size..(2 * i + 1) * test_size], h_evals[i * test_size..(i + 1) * test_size]);
+            assert_eq!(h_evals_large[(2 * i + 1) * test_size..(2 * i + 2) * test_size], h_evals_coset[i * test_size..(i + 1) * test_size]);
+        }
+        for i in 0..test_size * batch_size {
+            assert_ne!(h_evals[i], Point::zero());
+            assert_ne!(h_evals_coset[i], Point::zero());
+            assert_ne!(h_evals_large[2 * i], Point::zero());
+            assert_ne!(h_evals_large[2 * i + 1], Point::zero());
+        }
+    }
+
+    // testing matrix multiplication by comparing the result of FFT with the naive multiplication by the DFT matrix
+    #[test]
+    fn test_matrix_multiplication() {
+        let seed = None; // some value to fix the rng
+        let test_size = 1 << 5;
+        let rou = Fr::get_root_of_unity(test_size).unwrap();
+        let matrix_flattened: Vec<Scalar> = (0..test_size).map(
+            |row_num| { (0..test_size).map( 
+                |col_num| {
+                    let pow: [u64; 1] = [(row_num * col_num).try_into().unwrap()];
+                    Scalar::from_ark(Fr::pow(&rou, &pow).into_repr())
+                }).collect::<Vec<Scalar>>()
+            }).flatten().collect::<Vec<_>>();
+        let vector: Vec<Scalar> = generate_random_scalars(test_size, get_rng(seed));
+
+        let result = mult_matrix_by_vec(&matrix_flattened, &vector, 0);
+        let mut ntt_result = vector.clone();
+        ntt(&mut ntt_result, 0);
+        
+        // we don't use the same roots of unity as arkworks, so the results are permutations
+        // of one another and the only guaranteed fixed scalars are the following ones:
+        assert_eq!(result[0], ntt_result[0]);
+        assert_eq!(result[test_size >> 1], ntt_result[test_size >> 1]);
+    }
+
+    #[test]
+    #[allow(non_snake_case)]
+    fn test_vec_scalar_mul() {
+        let mut intoo = [Scalar::one(), Scalar::one(), Scalar::zero()];
+        let expected = [Scalar::one(), Scalar::zero(), Scalar::zero()];
+        mult_sc_vec(&mut intoo, &expected, 0);
+        assert_eq!(intoo, expected);
+    }
+
+    #[test]
+    #[allow(non_snake_case)]
+    fn test_vec_point_mul() {
+        let dummy_one = Point {
+            x: Base::one(),
+            y: Base::one(),
+            z: Base::one(),
+        };
+
+        let mut inout = [dummy_one, dummy_one, Point::zero()];
+        let scalars = [Scalar::one(), Scalar::zero(), Scalar::zero()];
+        let expected = [dummy_one, Point::zero(), Point::zero()];
+        multp_vec(&mut inout, &scalars, 0);
+        assert_eq!(inout, expected);
+    }
+}
--- a/bn254/Cargo.toml
+++ b/bn254/Cargo.toml
@@ -0,0 +1,34 @@
+[package]
+name = "bn254"
+version = "0.1.0"
+edition = "2021"
+authors = [ "Ingonyama" ]
+
+[dependencies]
+icicle-core = { path = "../icicle-core" }
+
+hex = "*"
+ark-std = "0.3.0"
+ark-ff = "0.3.0"
+ark-poly = "0.3.0"
+ark-ec = { version = "0.3.0", features = [ "parallel" ] }
+ark-bn254 = "0.3.0"
+
+serde = { version = "1.0", features = ["derive"] }
+serde_derive = "1.0"
+serde_cbor = "0.11.2"
+
+rustacuda = "0.1"
+rustacuda_core = "0.1"
+rustacuda_derive = "0.1"
+
+rand = "*" #TODO: move rand and ark dependencies to dev once random scalar/point generation is done "natively"
+
+[build-dependencies]
+cc = { version = "1.0", features = ["parallel"] }
+
+[dev-dependencies]
+"criterion" = "0.4.0"
+
+[features]
+g2 = []
--- a/bn254/build.rs
+++ b/bn254/build.rs
@@ -0,0 +1,36 @@
+use std::env;
+
+fn main() {
+    //TODO: check cargo features selected
+    //TODO: can conflict/duplicate with make ?
+
+    println!("cargo:rerun-if-env-changed=CXXFLAGS");
+    println!("cargo:rerun-if-changed=./icicle");
+
+    let arch_type = env::var("ARCH_TYPE").unwrap_or(String::from("native"));
+    let stream_type = env::var("DEFAULT_STREAM").unwrap_or(String::from("legacy"));
+
+    let mut arch = String::from("-arch=");
+    arch.push_str(&arch_type);
+    let mut stream = String::from("-default-stream=");
+    stream.push_str(&stream_type);
+
+    let mut nvcc = cc::Build::new();
+
+    println!("Compiling icicle library using arch: {}", &arch);
+
+    if cfg!(feature = "g2") {
+        nvcc.define("G2_DEFINED", None);
+    }
+    nvcc.cuda(true);
+    nvcc.define("FEATURE_BN254", None);
+    nvcc.debug(false);
+    nvcc.flag(&arch);
+    nvcc.flag(&stream);
+    nvcc.shared_flag(false);
+    // nvcc.static_flag(true);
+    nvcc.files([
+        "../icicle-cuda/curves/index.cu",
+    ]);
+    nvcc.compile("ingo_icicle"); //TODO: extension??
+}
--- a/bn254/src/basic_structs/field.rs
+++ b/bn254/src/basic_structs/field.rs
@@ -0,0 +1,4 @@
+pub trait Field<const NUM_LIMBS: usize> {
+    const MODOLUS: [u32;NUM_LIMBS];
+    const LIMBS: usize = NUM_LIMBS;
+}
--- a/bn254/src/basic_structs/mod.rs
+++ b/bn254/src/basic_structs/mod.rs
@@ -0,0 +1,3 @@
+pub mod field; 
+pub mod scalar; 
+pub mod point; 
--- a/bn254/src/basic_structs/point.rs
+++ b/bn254/src/basic_structs/point.rs
@@ -0,0 +1,108 @@
+use std::ffi::c_uint;
+
+use ark_bn254::{Fq as Fq_BN254, Fr as Fr_BN254, G1Affine as G1Affine_BN254, G1Projective as G1Projective_BN254};
+
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger256, PrimeField};
+use std::mem::transmute;
+use ark_ff::Field;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+
+use rustacuda_core::DeviceCopy;
+use rustacuda_derive::DeviceCopy;
+
+use super::scalar::{get_fixed_limbs, self};
+
+
+#[derive(Debug, Clone, Copy, DeviceCopy)]
+#[repr(C)]
+pub struct PointT<BF: scalar::ScalarTrait> {
+    pub x: BF,
+    pub y: BF,
+    pub z: BF,
+}
+
+impl<BF: DeviceCopy + scalar::ScalarTrait> Default for PointT<BF> {
+    fn default() -> Self {
+        PointT::zero()
+    }
+}
+
+impl<BF: DeviceCopy + scalar::ScalarTrait> PointT<BF> {
+    pub fn zero() -> Self {
+        PointT {
+            x: BF::zero(),
+            y: BF::one(),
+            z: BF::zero(),
+        }
+    }
+
+    pub fn infinity() -> Self {
+        Self::zero()
+    }
+}
+
+#[derive(Debug, PartialEq, Clone, Copy, DeviceCopy)]
+#[repr(C)]
+pub struct PointAffineNoInfinityT<BF> {
+    pub x: BF,
+    pub y: BF,
+}
+
+impl<BF: scalar::ScalarTrait> Default for PointAffineNoInfinityT<BF> {
+    fn default() -> Self {
+        PointAffineNoInfinityT {
+            x: BF::zero(),
+            y: BF::zero(),
+        }
+    }
+}
+
+impl<BF: Copy + scalar::ScalarTrait> PointAffineNoInfinityT<BF> {
+    ///From u32 limbs x,y
+    pub fn from_limbs(x: &[u32], y: &[u32]) -> Self {
+        PointAffineNoInfinityT {
+            x: BF::from_limbs(x),
+            y: BF::from_limbs(y)
+        }
+    }
+
+    pub fn limbs(&self) -> Vec<u32> {
+        [self.x.limbs(), self.y.limbs()].concat()
+    }
+
+    pub fn to_projective(&self) -> PointT<BF> {
+        PointT {
+            x: self.x,
+            y: self.y,
+            z: BF::one(),
+        }
+    }
+}
+
+impl<BF: Copy + scalar::ScalarTrait> PointT<BF>  {
+    pub fn from_limbs(x: &[u32], y: &[u32], z: &[u32]) -> Self {
+        PointT {
+            x: BF::from_limbs(x),
+            y: BF::from_limbs(y),
+            z: BF::from_limbs(z)
+        }
+    }
+
+    pub fn from_xy_limbs(value: &[u32]) -> PointT<BF> {
+        let l = value.len();
+        assert_eq!(l, 3 * BF::base_limbs(), "length must be 3 * {}", BF::base_limbs());
+        PointT {
+            x: BF::from_limbs(value[..BF::base_limbs()].try_into().unwrap()),
+            y: BF::from_limbs(value[BF::base_limbs()..BF::base_limbs() * 2].try_into().unwrap()),
+            z: BF::from_limbs(value[BF::base_limbs() * 2..].try_into().unwrap())
+        }
+    }
+
+    pub fn to_xy_strip_z(&self) -> PointAffineNoInfinityT<BF> {
+        PointAffineNoInfinityT {
+            x: self.x,
+            y: self.y,
+        }
+    }
+}
--- a/bn254/src/basic_structs/scalar.rs
+++ b/bn254/src/basic_structs/scalar.rs
@@ -0,0 +1,102 @@
+use std::ffi::{c_int, c_uint};
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda_core::DeviceCopy;
+use rustacuda_derive::DeviceCopy;
+use std::mem::transmute;
+use rustacuda::prelude::*;
+use rustacuda_core::DevicePointer;
+use rustacuda::memory::{DeviceBox, CopyDestination};
+
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+
+use std::marker::PhantomData;
+use std::convert::TryInto;
+
+use super::field::{Field, self};
+
+pub fn get_fixed_limbs<const NUM_LIMBS: usize>(val: &[u32]) -> [u32; NUM_LIMBS] {
+    match val.len() {
+        n if n < NUM_LIMBS => {
+            let mut padded: [u32; NUM_LIMBS] = [0; NUM_LIMBS];
+            padded[..val.len()].copy_from_slice(&val);
+            padded
+        }
+        n if n == NUM_LIMBS => val.try_into().unwrap(),
+        _ => panic!("slice has too many elements"),
+    }
+}
+
+pub trait ScalarTrait{
+    fn base_limbs() -> usize;
+    fn zero() -> Self;
+    fn from_limbs(value: &[u32]) -> Self;
+    fn one() -> Self;
+    fn to_bytes_le(&self) -> Vec<u8>;
+    fn limbs(&self) -> &[u32];
+}
+
+#[derive(Debug, PartialEq, Clone, Copy)]
+#[repr(C)]
+pub struct ScalarT<M, const NUM_LIMBS: usize> {
+    pub(crate) phantom: PhantomData<M>,
+    pub(crate) value : [u32; NUM_LIMBS]
+}
+
+impl<M, const NUM_LIMBS: usize> ScalarTrait for ScalarT<M, NUM_LIMBS>
+where
+    M: Field<NUM_LIMBS>,
+{
+
+    fn base_limbs() -> usize {
+        return NUM_LIMBS; 
+    }
+
+    fn zero() -> Self {
+        ScalarT {
+            value: [0u32; NUM_LIMBS],
+            phantom: PhantomData,
+        }
+    }
+
+    fn from_limbs(value: &[u32]) -> Self {
+        Self {
+            value: get_fixed_limbs(value),
+            phantom: PhantomData,
+        }
+    }
+
+    fn one() -> Self {
+        let mut s = [0u32; NUM_LIMBS];
+        s[0] = 1;
+        ScalarT { value: s, phantom: PhantomData }
+    }
+
+    fn to_bytes_le(&self) -> Vec<u8> {
+        self.value
+            .iter()
+            .map(|s| s.to_le_bytes().to_vec())
+            .flatten()
+            .collect::<Vec<_>>()
+    }
+
+    fn limbs(&self) -> &[u32] {
+        &self.value
+    }
+}
+
+impl<M, const NUM_LIMBS: usize> ScalarT<M, NUM_LIMBS> where M: field::Field<NUM_LIMBS>{
+    pub fn from_limbs_le(value: &[u32]) -> ScalarT<M,NUM_LIMBS> {
+        Self::from_limbs(value)
+     }
+ 
+    pub fn from_limbs_be(value: &[u32]) -> ScalarT<M,NUM_LIMBS> {
+         let mut value = value.to_vec();
+         value.reverse();
+         Self::from_limbs_le(&value)
+     }
+ 
+     // Additional Functions
+     pub fn add(&self, other:ScalarT<M, NUM_LIMBS>) -> ScalarT<M,NUM_LIMBS>{  // overload + 
+         return ScalarT{value: [self.value[0] + other.value[0];NUM_LIMBS], phantom: PhantomData }; 
+     }
+}
--- a/bn254/src/curve_structs.rs
+++ b/bn254/src/curve_structs.rs
@@ -0,0 +1,62 @@
+use std::ffi::{c_int, c_uint};
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda_derive::DeviceCopy;
+use std::mem::transmute;
+use rustacuda::prelude::*;
+use rustacuda_core::DevicePointer;
+use rustacuda::memory::{DeviceBox, CopyDestination, DeviceCopy};
+
+use std::marker::PhantomData;
+use std::convert::TryInto;
+
+use crate::basic_structs::point::{PointT, PointAffineNoInfinityT};
+use crate::basic_structs::scalar::ScalarT;
+use crate::basic_structs::field::Field;
+
+
+#[derive(Debug, PartialEq, Clone, Copy,DeviceCopy)]
+#[repr(C)]
+pub struct ScalarField;
+impl Field<8> for ScalarField {
+    const MODOLUS: [u32; 8] = [0x0;8];
+}
+
+#[derive(Debug, PartialEq, Clone, Copy,DeviceCopy)]
+#[repr(C)]
+pub struct BaseField;
+impl Field<8> for BaseField {
+    const MODOLUS: [u32; 8] = [0x0;8];
+}
+
+
+pub type Scalar = ScalarT<ScalarField,8>;
+impl Default for Scalar {
+    fn default() -> Self {
+        Self{value: [0x0;ScalarField::LIMBS], phantom: PhantomData }
+    }
+}
+
+unsafe impl DeviceCopy for Scalar{}
+
+
+pub type Base = ScalarT<BaseField,8>;
+impl Default for Base {
+    fn default() -> Self {
+        Self{value: [0x0;BaseField::LIMBS], phantom: PhantomData }
+    }
+}
+
+unsafe impl DeviceCopy for Base{}
+
+pub type Point = PointT<Base>;
+pub type PointAffineNoInfinity = PointAffineNoInfinityT<Base>;
+
+extern "C" {
+    fn eq(point1: *const Point, point2: *const Point) -> c_uint;
+}
+
+impl PartialEq for Point {
+    fn eq(&self, other: &Self) -> bool {
+        unsafe { eq(self, other) != 0 }
+    }
+}
--- a/bn254/src/from_cuda.rs
+++ b/bn254/src/from_cuda.rs
@@ -0,0 +1,797 @@
+use std::ffi::{c_int, c_uint};
+use ark_std::UniformRand;
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda::CudaFlags;
+use rustacuda::memory::DeviceBox;
+use rustacuda::prelude::{DeviceBuffer, Device, ContextFlags, Context};
+use rustacuda_core::DevicePointer;
+use std::mem::transmute;
+use crate::basic_structs::scalar::ScalarTrait;
+use crate::curve_structs::*;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+use std::marker::PhantomData;
+use std::convert::TryInto;
+use ark_bn254::{Fq as Fq_BN254, Fr as Fr_BN254, G1Affine as G1Affine_BN254, G1Projective as G1Projective_BN254};
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger384, BigInteger256, PrimeField};
+use rustacuda::memory::{CopyDestination, DeviceCopy};
+
+extern "C" {
+    fn msm_cuda(
+        out: *mut Point,
+        points: *const PointAffineNoInfinity,
+        scalars: *const Scalar,
+        count: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn msm_batch_cuda(
+        out: *mut Point,
+        points: *const PointAffineNoInfinity,
+        scalars: *const Scalar,
+        batch_size: usize,
+        msm_size: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn commit_cuda(
+        d_out: DevicePointer<Point>,
+        d_scalars: DevicePointer<Scalar>,
+        d_points: DevicePointer<PointAffineNoInfinity>,
+        count: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn commit_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_scalars: DevicePointer<Scalar>,
+        d_points: DevicePointer<PointAffineNoInfinity>,
+        count: usize,
+        batch_size: usize,
+        device_id: usize,
+    ) -> c_uint;
+
+    fn build_domain_cuda(domain_size: usize, logn: usize, inverse: bool, device_id: usize) -> DevicePointer<Scalar>;
+
+    fn ntt_cuda(inout: *mut Scalar, n: usize, inverse: bool, device_id: usize) -> c_int;
+
+    fn ecntt_cuda(inout: *mut Point, n: usize, inverse: bool, device_id: usize) -> c_int;
+
+    fn ntt_batch_cuda(
+        inout: *mut Scalar,
+        arr_size: usize,
+        n: usize,
+        inverse: bool,
+    ) -> c_int;
+
+    fn ecntt_batch_cuda(inout: *mut Point, arr_size: usize, n: usize, inverse: bool) -> c_int;
+
+    fn interpolate_scalars_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_evaluations: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>, 
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_scalars_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_evaluations: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_points_cuda(
+        d_out: DevicePointer<Point>,
+        d_evaluations: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn interpolate_points_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_evaluations: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_on_coset_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_scalars_on_coset_batch_cuda(
+        d_out: DevicePointer<Scalar>,
+        d_coefficients: DevicePointer<Scalar>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_on_coset_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn evaluate_points_on_coset_batch_cuda(
+        d_out: DevicePointer<Point>,
+        d_coefficients: DevicePointer<Point>,
+        d_domain: DevicePointer<Scalar>,
+        domain_size: usize,
+        n: usize,
+        batch_size: usize,
+        coset_powers: DevicePointer<Scalar>,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_scalars_cuda(
+        d_arr: DevicePointer<Scalar>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_scalars_batch_cuda(
+        d_arr: DevicePointer<Scalar>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_points_cuda(
+        d_arr: DevicePointer<Point>,
+        n: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn reverse_order_points_batch_cuda(
+        d_arr: DevicePointer<Point>,
+        n: usize,
+        batch_size: usize,
+        device_id: usize
+    ) -> c_int;
+
+    fn vec_mod_mult_point(
+        inout: *mut Point,
+        scalars: *const Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+
+    fn vec_mod_mult_scalar(
+        inout: *mut Scalar,
+        scalars: *const Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+
+    fn matrix_vec_mod_mult(
+        matrix_flattened: *const Scalar,
+        input: *const Scalar,
+        output: *mut Scalar,
+        n_elements: usize,
+        device_id: usize,
+    ) -> c_int;
+}
+
+pub fn msm(points: &[PointAffineNoInfinity], scalars: &[Scalar], device_id: usize) -> Point {
+    let count = points.len();
+    if count != scalars.len() {
+        todo!("variable length")
+    }
+    let mut ret = Point::zero();
+    unsafe {
+        msm_cuda(
+            &mut ret as *mut _ as *mut Point,
+            points as *const _ as *const PointAffineNoInfinity,
+            scalars as *const _ as *const Scalar,
+            scalars.len(),
+            device_id,
+        )
+    };
+
+    ret
+}
+
+pub fn msm_batch(
+    points: &[PointAffineNoInfinity],
+    scalars: &[Scalar],
+    batch_size: usize,
+    device_id: usize,
+) -> Vec<Point> {
+    let count = points.len();
+    if count != scalars.len() {
+        todo!("variable length")
+    }
+
+    let mut ret = vec![Point::zero(); batch_size];
+
+    unsafe {
+        msm_batch_cuda(
+            &mut ret[0] as *mut _ as *mut Point,
+            points as *const _ as *const PointAffineNoInfinity,
+            scalars as *const _ as *const Scalar,
+            batch_size,
+            count / batch_size,
+            device_id,
+        )
+    };
+
+    ret
+}
+
+pub fn commit(
+    points: &mut DeviceBuffer<PointAffineNoInfinity>,
+    scalars: &mut DeviceBuffer<Scalar>,
+) -> DeviceBox<Point> {
+    let mut res = DeviceBox::new(&Point::zero()).unwrap();
+    unsafe {
+        commit_cuda(
+            res.as_device_ptr(),
+            scalars.as_device_ptr(),
+            points.as_device_ptr(),
+            scalars.len(),
+            0,
+        );
+    }
+    return res;
+}
+
+pub fn commit_batch(
+    points: &mut DeviceBuffer<PointAffineNoInfinity>,
+    scalars: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(batch_size).unwrap() };
+    unsafe {
+        commit_batch_cuda(
+            res.as_device_ptr(),
+            scalars.as_device_ptr(),
+            points.as_device_ptr(),
+            scalars.len() / batch_size,
+            batch_size,
+            0,
+        );
+    }
+    return res;
+}
+
+/// Compute an in-place NTT on the input data.
+fn ntt_internal(values: &mut [Scalar], device_id: usize, inverse: bool) -> i32 {
+    let ret_code = unsafe {
+        ntt_cuda(
+            values as *mut _ as *mut Scalar,
+            values.len(),
+            inverse,
+            device_id,
+        )
+    };
+    ret_code
+}
+
+pub fn ntt(values: &mut [Scalar], device_id: usize) {
+    ntt_internal(values, device_id, false);
+}
+
+pub fn intt(values: &mut [Scalar], device_id: usize) {
+    ntt_internal(values, device_id, true);
+}
+
+/// Compute an in-place NTT on the input data.
+fn ntt_internal_batch(
+    values: &mut [Scalar],
+    device_id: usize,
+    batch_size: usize,
+    inverse: bool,
+) -> i32 {
+    unsafe {
+        ntt_batch_cuda(
+            values as *mut _ as *mut Scalar,
+            values.len(),
+            batch_size,
+            inverse,
+        )
+    }
+}
+
+pub fn ntt_batch(values: &mut [Scalar], batch_size: usize, device_id: usize) {
+    ntt_internal_batch(values, 0, batch_size, false);
+}
+
+pub fn intt_batch(values: &mut [Scalar], batch_size: usize, device_id: usize) {
+    ntt_internal_batch(values, 0, batch_size, true);
+}
+
+/// Compute an in-place ECNTT on the input data.
+fn ecntt_internal(values: &mut [Point], inverse: bool, device_id: usize) -> i32 {
+    unsafe {
+        ecntt_cuda(
+            values as *mut _ as *mut Point,
+            values.len(),
+            inverse,
+            device_id,
+        )
+    }
+}
+
+pub fn ecntt(values: &mut [Point], device_id: usize) {
+    ecntt_internal(values, false, device_id);
+}
+
+/// Compute an in-place iECNTT on the input data.
+pub fn iecntt(values: &mut [Point], device_id: usize) {
+    ecntt_internal(values, true, device_id);
+}
+
+/// Compute an in-place ECNTT on the input data.
+fn ecntt_internal_batch(
+    values: &mut [Point],
+    device_id: usize,
+    batch_size: usize,
+    inverse: bool,
+) -> i32 {
+    unsafe {
+        ecntt_batch_cuda(
+            values as *mut _ as *mut Point,
+            values.len(),
+            batch_size,
+            inverse,
+        )
+    }
+}
+
+pub fn ecntt_batch(values: &mut [Point], batch_size: usize, device_id: usize) {
+    ecntt_internal_batch(values, 0, batch_size, false);
+}
+
+/// Compute an in-place iECNTT on the input data.
+pub fn iecntt_batch(values: &mut [Point], batch_size: usize, device_id: usize) {
+    ecntt_internal_batch(values, 0, batch_size, true);
+}
+
+pub fn build_domain(domain_size: usize, logn: usize, inverse: bool) -> DeviceBuffer<Scalar> {
+    unsafe {
+        DeviceBuffer::from_raw_parts(build_domain_cuda(
+            domain_size,
+            logn,
+            inverse,
+            0
+        ), domain_size)
+    }
+}
+
+
+pub fn reverse_order_scalars(
+    d_scalars: &mut DeviceBuffer<Scalar>,
+) {
+    unsafe { reverse_order_scalars_cuda(
+        d_scalars.as_device_ptr(),
+        d_scalars.len(),
+        0
+    ); }
+}
+
+pub fn reverse_order_scalars_batch(
+    d_scalars: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) {
+    unsafe { reverse_order_scalars_batch_cuda(
+        d_scalars.as_device_ptr(),
+        d_scalars.len() / batch_size,
+        batch_size,
+        0
+    ); }
+}
+
+pub fn reverse_order_points(
+    d_points: &mut DeviceBuffer<Point>,
+) {
+    unsafe { reverse_order_points_cuda(
+        d_points.as_device_ptr(),
+        d_points.len(),
+        0
+    ); }
+}
+
+pub fn reverse_order_points_batch(
+    d_points: &mut DeviceBuffer<Point>,
+    batch_size: usize,
+) {
+    unsafe { reverse_order_points_batch_cuda(
+        d_points.as_device_ptr(),
+        d_points.len() / batch_size,
+        batch_size,
+        0
+    ); }
+}
+
+pub fn interpolate_scalars(
+    d_evaluations: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe { interpolate_scalars_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_scalars_batch(
+    d_evaluations: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe { interpolate_scalars_batch_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        batch_size,
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_points(
+    d_evaluations: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe { interpolate_points_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        0
+    ) };
+    return res;
+}
+
+pub fn interpolate_points_batch(
+    d_evaluations: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe { interpolate_points_batch_cuda(
+        res.as_device_ptr(),
+        d_evaluations.as_device_ptr(),
+        d_domain.as_device_ptr(),
+        d_domain.len(),
+        batch_size,
+        0
+    ) };
+    return res;
+}
+
+pub fn evaluate_scalars(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_scalars_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_batch(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_scalars_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_points_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_batch(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_points_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_on_coset(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_scalars_on_coset_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_scalars_on_coset_batch(
+    d_coefficients: &mut DeviceBuffer<Scalar>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Scalar> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_scalars_on_coset_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_on_coset(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len()).unwrap() };
+    unsafe {
+        evaluate_points_on_coset_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len(),
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn evaluate_points_on_coset_batch(
+    d_coefficients: &mut DeviceBuffer<Point>,
+    d_domain: &mut DeviceBuffer<Scalar>,
+    batch_size: usize,
+    coset_powers: &mut DeviceBuffer<Scalar>,
+) -> DeviceBuffer<Point> {
+    let mut res = unsafe { DeviceBuffer::uninitialized(d_domain.len() * batch_size).unwrap() };
+    unsafe {
+        evaluate_points_on_coset_batch_cuda(
+            res.as_device_ptr(),
+            d_coefficients.as_device_ptr(),
+            d_domain.as_device_ptr(),
+            d_domain.len(),
+            d_coefficients.len() / batch_size,
+            batch_size,
+            coset_powers.as_device_ptr(),
+            0
+        );
+    }
+    return res;
+}
+
+pub fn multp_vec(a: &mut [Point], b: &[Scalar], device_id: usize) {
+    assert_eq!(a.len(), b.len());
+    unsafe {
+        vec_mod_mult_point(
+            a as *mut _ as *mut Point,
+            b as *const _ as *const Scalar,
+            a.len(),
+            device_id,
+        );
+    }
+}
+
+pub fn mult_sc_vec(a: &mut [Scalar], b: &[Scalar], device_id: usize) {
+    assert_eq!(a.len(), b.len());
+    unsafe {
+        vec_mod_mult_scalar(
+            a as *mut _ as *mut Scalar,
+            b as *const _ as *const Scalar,
+            a.len(),
+            device_id,
+        );
+    }
+}
+
+// Multiply a matrix by a scalar:
+//  `a` - flattenned matrix;
+//  `b` - vector to multiply `a` by;
+pub fn mult_matrix_by_vec(a: &[Scalar], b: &[Scalar], device_id: usize) -> Vec<Scalar> {
+    let mut c = Vec::with_capacity(b.len());
+    for i in 0..b.len() {
+        c.push(Scalar::zero());
+    }
+    unsafe {
+        matrix_vec_mod_mult(
+            a as *const _ as *const Scalar,
+            b as *const _ as *const Scalar,
+            c.as_mut_slice() as *mut _ as *mut Scalar,
+            b.len(),
+            device_id,
+        );
+    }
+    c
+}
+
+pub fn clone_buffer<T: DeviceCopy>(buf: &mut DeviceBuffer<T>) -> DeviceBuffer<T> {
+    let mut buf_cpy = unsafe { DeviceBuffer::uninitialized(buf.len()).unwrap() };
+    unsafe { buf_cpy.copy_from(buf) };
+    return buf_cpy;
+}
+
+pub fn get_rng(seed: Option<u64>) -> Box<dyn RngCore> {
+    let rng: Box<dyn RngCore> = match seed {
+        Some(seed) => Box::new(StdRng::seed_from_u64(seed)),
+        None => Box::new(rand::thread_rng()),
+    };
+    rng
+}
+
+fn set_up_device() {
+    // Set up the context, load the module, and create a stream to run kernels in.
+    rustacuda::init(CudaFlags::empty()).unwrap();
+    let device = Device::get_device(0).unwrap();
+    let _ctx = Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device).unwrap();
+}
+
+pub fn generate_random_points(
+    count: usize,
+    mut rng: Box<dyn RngCore>,
+) -> Vec<PointAffineNoInfinity> {
+    (0..count)
+        .map(|_| Point::from_ark(G1Projective_BN254::rand(&mut rng)).to_xy_strip_z())
+        .collect()
+}
+
+pub fn generate_random_points_proj(count: usize, mut rng: Box<dyn RngCore>) -> Vec<Point> {
+    (0..count)
+        .map(|_| Point::from_ark(G1Projective_BN254::rand(&mut rng)))
+        .collect()
+}
+
+pub fn generate_random_scalars(count: usize, mut rng: Box<dyn RngCore>) -> Vec<Scalar> {
+    (0..count)
+        .map(|_| Scalar::from_ark(Fr_BN254::rand(&mut rng).into_repr()))
+        .collect()
+}
+
+pub fn set_up_points(test_size: usize, log_domain_size: usize, inverse: bool) -> (Vec<Point>, DeviceBuffer<Point>, DeviceBuffer<Scalar>) {
+    set_up_device();
+
+    let d_domain = build_domain(1 << log_domain_size, log_domain_size, inverse);
+
+    let seed = Some(0); // fix the rng to get two equal scalar 
+    let vector = generate_random_points_proj(test_size, get_rng(seed));
+    let mut vector_mut = vector.clone();
+
+    let mut d_vector = DeviceBuffer::from_slice(&vector[..]).unwrap();
+    (vector_mut, d_vector, d_domain)
+}
+
+pub fn set_up_scalars(test_size: usize, log_domain_size: usize, inverse: bool) -> (Vec<Scalar>, DeviceBuffer<Scalar>, DeviceBuffer<Scalar>) {
+    set_up_device();
+
+    let d_domain = build_domain(1 << log_domain_size, log_domain_size, inverse);
+
+    let seed = Some(0); // fix the rng to get two equal scalars
+    let mut vector_mut = generate_random_scalars(test_size, get_rng(seed));
+
+    let mut d_vector = DeviceBuffer::from_slice(&vector_mut[..]).unwrap();
+    (vector_mut, d_vector, d_domain)
+}
+
--- a/bn254/src/lib.rs
+++ b/bn254/src/lib.rs
@@ -0,0 +1,4 @@
+pub mod test_bn254;
+pub mod basic_structs;
+pub mod from_cuda;
+pub mod curve_structs;
--- a/bn254/src/test_bn254.rs
+++ b/bn254/src/test_bn254.rs
@@ -0,0 +1,816 @@
+use std::ffi::{c_int, c_uint};
+use ark_std::UniformRand;
+use rand::{rngs::StdRng, RngCore, SeedableRng};
+use rustacuda::CudaFlags;
+use rustacuda::memory::DeviceBox;
+use rustacuda::prelude::{DeviceBuffer, Device, ContextFlags, Context};
+use rustacuda_core::DevicePointer;
+use std::mem::transmute;
+pub use crate::basic_structs::scalar::ScalarTrait;
+pub use crate::curve_structs::*;
+use icicle_core::utils::{u32_vec_to_u64_vec, u64_vec_to_u32_vec};
+use std::marker::PhantomData;
+use std::convert::TryInto;
+use ark_bn254::{Fq as Fq_BN254, Fr as Fr_BN254, G1Affine as G1Affine_BN254, G1Projective as G1Projective_BN254};
+use ark_ec::AffineCurve;
+use ark_ff::{BigInteger384, BigInteger256, PrimeField};
+use rustacuda::memory::{CopyDestination, DeviceCopy};
+
+
+impl Scalar {
+    pub fn to_biginteger254(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn to_ark(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn from_biginteger256(ark: BigInteger256) -> Self {
+        Self{ value: u64_vec_to_u32_vec(&ark.0).try_into().unwrap(), phantom : PhantomData}
+    }
+
+    pub fn to_biginteger256_transmute(&self) -> BigInteger256 {
+        unsafe { transmute(*self) }
+    }
+
+    pub fn from_biginteger_transmute(v: BigInteger256) -> Scalar {
+        Scalar{ value: unsafe{ transmute(v)}, phantom : PhantomData }
+    }
+
+    pub fn to_ark_transmute(&self) -> Fr_BN254 {
+        unsafe { std::mem::transmute(*self) }
+    }
+
+    pub fn from_ark_transmute(v: &Fr_BN254) -> Scalar {
+        unsafe { std::mem::transmute_copy(v) }
+    }
+
+    pub fn to_ark_mod_p(&self) -> Fr_BN254 {
+        Fr_BN254::new(BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap()))
+    }
+
+    pub fn to_ark_repr(&self) -> Fr_BN254 {
+        Fr_BN254::from_repr(BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())).unwrap()
+    }
+
+    pub fn from_ark(v: BigInteger256) -> Scalar {
+        Self { value : u64_vec_to_u32_vec(&v.0).try_into().unwrap(), phantom: PhantomData}
+    }
+
+}
+
+impl Base {
+    pub fn to_ark(&self) -> BigInteger256 {
+        BigInteger256::new(u32_vec_to_u64_vec(&self.limbs()).try_into().unwrap())
+    }
+
+    pub fn from_ark(ark: BigInteger256) -> Self {
+        Self::from_limbs(&u64_vec_to_u32_vec(&ark.0))
+    }
+}
+
+
+impl Point {
+    pub fn to_ark(&self) -> G1Projective_BN254 {
+        self.to_ark_affine().into_projective()
+    }
+
+    pub fn to_ark_affine(&self) -> G1Affine_BN254 {
+        //TODO: generic conversion
+        use ark_ff::Field;
+        use std::ops::Mul;
+        let proj_x_field = Fq_BN254::from_le_bytes_mod_order(&self.x.to_bytes_le());
+        let proj_y_field = Fq_BN254::from_le_bytes_mod_order(&self.y.to_bytes_le());
+        let proj_z_field = Fq_BN254::from_le_bytes_mod_order(&self.z.to_bytes_le());
+        let inverse_z = proj_z_field.inverse().unwrap();
+        let aff_x = proj_x_field.mul(inverse_z);
+        let aff_y = proj_y_field.mul(inverse_z);
+        G1Affine_BN254::new(aff_x, aff_y, false)
+    }
+
+    pub fn from_ark(ark: G1Projective_BN254) -> Point {
+        use ark_ff::Field;
+        let z_inv = ark.z.inverse().unwrap();
+        let z_invsq = z_inv * z_inv;
+        let z_invq3 = z_invsq * z_inv;
+        Point {
+            x: Base::from_ark((ark.x * z_invsq).into_repr()),
+            y: Base::from_ark((ark.y * z_invq3).into_repr()),
+            z: Base::one(),
+        }
+    }
+}
+
+impl PointAffineNoInfinity {
+
+    pub fn to_ark(&self) -> G1Affine_BN254 {
+        G1Affine_BN254::new(Fq_BN254::new(self.x.to_ark()), Fq_BN254::new(self.y.to_ark()), false)
+    }
+
+    pub fn to_ark_repr(&self) -> G1Affine_BN254 {
+        G1Affine_BN254::new(
+            Fq_BN254::from_repr(self.x.to_ark()).unwrap(),
+            Fq_BN254::from_repr(self.y.to_ark()).unwrap(),
+            false,
+        )
+    }
+
+    pub fn from_ark(p: &G1Affine_BN254) -> Self {
+        PointAffineNoInfinity {
+            x: Base::from_ark(p.x.into_repr()),
+            y: Base::from_ark(p.y.into_repr()),
+        }
+    }
+}
+
+impl Point {
+    pub fn to_affine(&self) -> PointAffineNoInfinity {
+        let ark_affine = self.to_ark_affine();
+        PointAffineNoInfinity {
+            x: Base::from_ark(ark_affine.x.into_repr()),
+            y: Base::from_ark(ark_affine.y.into_repr()),
+        }
+    }
+}
+
+
+#[cfg(test)]
+pub(crate) mod tests_bn254 {
+    use std::ops::Add;
+    use ark_bn254::{Fr, G1Affine, G1Projective};
+    use ark_ec::{msm::VariableBaseMSM, AffineCurve, ProjectiveCurve};
+    use ark_ff::{FftField, Field, Zero, PrimeField};
+    use ark_std::UniformRand;
+    use rustacuda::prelude::{DeviceBuffer, CopyDestination};
+    use crate::curve_structs::{Point, Scalar, Base};
+    use crate::basic_structs::scalar::ScalarTrait;
+    use crate::from_cuda::{generate_random_points, get_rng, generate_random_scalars, msm, msm_batch, set_up_scalars, commit, commit_batch, ntt, intt, generate_random_points_proj, ecntt, iecntt, ntt_batch, ecntt_batch, iecntt_batch, intt_batch, reverse_order_scalars_batch, interpolate_scalars_batch, set_up_points, reverse_order_points, interpolate_points, reverse_order_points_batch, interpolate_points_batch, evaluate_scalars, interpolate_scalars, reverse_order_scalars, evaluate_points, build_domain, evaluate_scalars_on_coset, evaluate_points_on_coset, mult_matrix_by_vec, mult_sc_vec, multp_vec,evaluate_scalars_batch, evaluate_points_batch, evaluate_scalars_on_coset_batch, evaluate_points_on_coset_batch};
+
+    fn random_points_ark_proj(nof_elements: usize) -> Vec<G1Projective> {
+        let mut rng = ark_std::rand::thread_rng();
+        let mut points_ga: Vec<G1Projective> = Vec::new();
+        for _ in 0..nof_elements {
+            let aff = G1Projective::rand(&mut rng);
+            points_ga.push(aff);
+        }
+        points_ga
+    }
+
+    fn ecntt_arc_naive(
+        points: &Vec<G1Projective>,
+        size: usize,
+        inverse: bool,
+    ) -> Vec<G1Projective> {
+        let mut result: Vec<G1Projective> = Vec::new();
+        for _ in 0..size {
+            result.push(G1Projective::zero());
+        }
+        let rou: Fr;
+        if !inverse {
+            rou = Fr::get_root_of_unity(size).unwrap();
+        } else {
+            rou = Fr::inverse(&Fr::get_root_of_unity(size).unwrap()).unwrap();
+        }
+        for k in 0..size {
+            for l in 0..size {
+                let pow: [u64; 1] = [(l * k).try_into().unwrap()];
+                let mul_rou = Fr::pow(&rou, &pow);
+                result[k] = result[k].add(points[l].into_affine().mul(mul_rou));
+            }
+        }
+        if inverse {
+            let size2 = size as u64;
+            for k in 0..size {
+                let multfactor = Fr::inverse(&Fr::from(size2)).unwrap();
+                result[k] = result[k].into_affine().mul(multfactor);
+            }
+        }
+        return result;
+    }
+
+    fn check_eq(points: &Vec<G1Projective>, points2: &Vec<G1Projective>) -> bool {
+        let mut eq = true;
+        for i in 0..points.len() {
+            if points2[i].ne(&points[i]) {
+                eq = false;
+                break;
+            }
+        }
+        return eq;
+    }
+
+    fn test_naive_ark_ecntt(size: usize) {
+        let points = random_points_ark_proj(size);
+        let result1: Vec<G1Projective> = ecntt_arc_naive(&points, size, false);
+        let result2: Vec<G1Projective> = ecntt_arc_naive(&result1, size, true);
+        assert!(!check_eq(&result2, &result1));
+        assert!(check_eq(&result2, &points));
+    }
+
+    #[test]
+    fn test_msm() {
+        let test_sizes = [6, 9];
+
+        for pow2 in test_sizes {
+            let count = 1 << pow2;
+            let seed = None; // set Some to provide seed
+            let points = generate_random_points(count, get_rng(seed));
+            let scalars = generate_random_scalars(count, get_rng(seed));
+
+            let msm_result = msm(&points, &scalars, 0);
+
+            let point_r_ark: Vec<_> = points.iter().map(|x| x.to_ark_repr()).collect();
+            let scalars_r_ark: Vec<_> = scalars.iter().map(|x| x.to_ark()).collect();
+
+            let msm_result_ark = VariableBaseMSM::multi_scalar_mul(&point_r_ark, &scalars_r_ark);
+
+            assert_eq!(msm_result.to_ark_affine(), msm_result_ark);
+            assert_eq!(msm_result.to_ark(), msm_result_ark);
+            assert_eq!(
+                msm_result.to_ark_affine(),
+                Point::from_ark(msm_result_ark).to_ark_affine()
+            );
+        }
+    }
+
+    #[test]
+    fn test_batch_msm() {
+        for batch_pow2 in [2, 4] {
+            for pow2 in [4, 6] {
+                let msm_size = 1 << pow2;
+                let batch_size = 1 << batch_pow2;
+                let seed = None; // set Some to provide seed
+                let points_batch = generate_random_points(msm_size * batch_size, get_rng(seed));
+                let scalars_batch = generate_random_scalars(msm_size * batch_size, get_rng(seed));
+
+                let point_r_ark: Vec<_> = points_batch.iter().map(|x| x.to_ark_repr()).collect();
+                let scalars_r_ark: Vec<_> = scalars_batch.iter().map(|x| x.to_ark()).collect();
+
+                let expected: Vec<_> = point_r_ark
+                    .chunks(msm_size)
+                    .zip(scalars_r_ark.chunks(msm_size))
+                    .map(|p| Point::from_ark(VariableBaseMSM::multi_scalar_mul(p.0, p.1)))
+                    .collect();
+
+                let result = msm_batch(&points_batch, &scalars_batch, batch_size, 0);
+
+                assert_eq!(result, expected);
+            }
+        }
+    }
+
+    #[test]
+    fn test_commit() {
+        let test_size = 1 << 8;
+        let seed = Some(0);
+        let (mut scalars, mut d_scalars, _) = set_up_scalars(test_size, 0, false);
+        let mut points = generate_random_points(test_size, get_rng(seed));
+        let mut d_points = DeviceBuffer::from_slice(&points[..]).unwrap();
+
+        let msm_result = msm(&points, &scalars, 0);
+        let mut d_commit_result = commit(&mut d_points, &mut d_scalars);
+        let mut h_commit_result = Point::zero();
+        d_commit_result.copy_to(&mut h_commit_result).unwrap();
+
+        assert_eq!(msm_result, h_commit_result);
+        assert_ne!(msm_result, Point::zero());
+        assert_ne!(h_commit_result, Point::zero());
+    }
+
+    #[test]
+    fn test_batch_commit() {
+        let batch_size = 4;
+        let test_size = 1 << 12;
+        let seed = Some(0);
+        let (scalars, mut d_scalars, _) = set_up_scalars(test_size * batch_size, 0, false);
+        let points = generate_random_points(test_size * batch_size, get_rng(seed));
+        let mut d_points = DeviceBuffer::from_slice(&points[..]).unwrap();
+
+        let msm_result = msm_batch(&points, &scalars, batch_size, 0);
+        let mut d_commit_result = commit_batch(&mut d_points, &mut d_scalars, batch_size);
+        let mut h_commit_result: Vec<Point> = (0..batch_size).map(|_| Point::zero()).collect();
+        d_commit_result.copy_to(&mut h_commit_result[..]).unwrap();
+
+        assert_eq!(msm_result, h_commit_result);
+        for h in h_commit_result {
+            assert_ne!(h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_ntt() {
+        //NTT
+        let seed = None; //some value to fix the rng
+        let test_size = 1 << 3;
+
+        let scalars = generate_random_scalars(test_size, get_rng(seed));
+
+        let mut ntt_result = scalars.clone();
+        ntt(&mut ntt_result, 0);
+
+        assert_ne!(ntt_result, scalars);
+
+        let mut intt_result = ntt_result.clone();
+
+        intt(&mut intt_result, 0);
+
+        assert_eq!(intt_result, scalars);
+
+        //ECNTT
+        let points_proj = generate_random_points_proj(test_size, get_rng(seed));
+
+        test_naive_ark_ecntt(test_size);
+
+        assert!(points_proj[0].to_ark().into_affine().is_on_curve());
+
+        //naive ark
+        let points_proj_ark = points_proj
+            .iter()
+            .map(|p| p.to_ark())
+            .collect::<Vec<G1Projective>>();
+
+        let ecntt_result_naive = ecntt_arc_naive(&points_proj_ark, points_proj_ark.len(), false);
+
+        let iecntt_result_naive = ecntt_arc_naive(&ecntt_result_naive, points_proj_ark.len(), true);
+
+        assert_eq!(points_proj_ark, iecntt_result_naive);
+
+        //ingo gpu
+        let mut ecntt_result = points_proj.to_vec();
+        ecntt(&mut ecntt_result, 0);
+
+        assert_ne!(ecntt_result, points_proj);
+
+        let mut iecntt_result = ecntt_result.clone();
+        iecntt(&mut iecntt_result, 0);
+
+        assert_eq!(
+            iecntt_result_naive,
+            points_proj
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>()
+        );
+        assert_eq!(
+            iecntt_result
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>(),
+            points_proj
+                .iter()
+                .map(|p| p.to_ark_affine())
+                .collect::<Vec<G1Affine>>()
+        );
+    }
+
+    #[test]
+    fn test_ntt_batch() {
+        //NTT
+        let seed = None; //some value to fix the rng
+        let test_size = 1 << 5;
+        let batches = 4;
+
+        let scalars_batch: Vec<Scalar> =
+            generate_random_scalars(test_size * batches, get_rng(seed));
+
+        let mut scalar_vec_of_vec: Vec<Vec<Scalar>> = Vec::new();
+
+        for i in 0..batches {
+            scalar_vec_of_vec.push(scalars_batch[i * test_size..(i + 1) * test_size].to_vec());
+        }
+
+        let mut ntt_result = scalars_batch.clone();
+
+        // do batch ntt
+        ntt_batch(&mut ntt_result, test_size, 0);
+
+        let mut ntt_result_vec_of_vec = Vec::new();
+
+        // do ntt for every chunk
+        for i in 0..batches {
+            ntt_result_vec_of_vec.push(scalar_vec_of_vec[i].clone());
+            ntt(&mut ntt_result_vec_of_vec[i], 0);
+        }
+
+        // check that the ntt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                ntt_result_vec_of_vec[i],
+                ntt_result[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        // check that ntt output is different from input
+        assert_ne!(ntt_result, scalars_batch);
+
+        let mut intt_result = ntt_result.clone();
+
+        // do batch intt
+        intt_batch(&mut intt_result, test_size, 0);
+
+        let mut intt_result_vec_of_vec = Vec::new();
+
+        // do intt for every chunk
+        for i in 0..batches {
+            intt_result_vec_of_vec.push(ntt_result_vec_of_vec[i].clone());
+            intt(&mut intt_result_vec_of_vec[i], 0);
+        }
+
+        // check that the intt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                intt_result_vec_of_vec[i],
+                intt_result[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_eq!(intt_result, scalars_batch);
+
+        // //ECNTT
+        let points_proj = generate_random_points_proj(test_size * batches, get_rng(seed));
+
+        let mut points_vec_of_vec: Vec<Vec<Point>> = Vec::new();
+
+        for i in 0..batches {
+            points_vec_of_vec.push(points_proj[i * test_size..(i + 1) * test_size].to_vec());
+        }
+
+        let mut ntt_result_points = points_proj.clone();
+
+        // do batch ecintt
+        ecntt_batch(&mut ntt_result_points, test_size, 0);
+
+        let mut ntt_result_points_vec_of_vec = Vec::new();
+
+        for i in 0..batches {
+            ntt_result_points_vec_of_vec.push(points_vec_of_vec[i].clone());
+            ecntt(&mut ntt_result_points_vec_of_vec[i], 0);
+        }
+
+        for i in 0..batches {
+            assert_eq!(
+                ntt_result_points_vec_of_vec[i],
+                ntt_result_points[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_ne!(ntt_result_points, points_proj);
+
+        let mut intt_result_points = ntt_result_points.clone();
+
+        // do batch ecintt
+        iecntt_batch(&mut intt_result_points, test_size, 0);
+
+        let mut intt_result_points_vec_of_vec = Vec::new();
+
+        // do ecintt for every chunk
+        for i in 0..batches {
+            intt_result_points_vec_of_vec.push(ntt_result_points_vec_of_vec[i].clone());
+            iecntt(&mut intt_result_points_vec_of_vec[i], 0);
+        }
+
+        // check that the ecintt of each vec of scalars is equal to the intt of the specific batch
+        for i in 0..batches {
+            assert_eq!(
+                intt_result_points_vec_of_vec[i],
+                intt_result_points[i * test_size..(i + 1) * test_size]
+            );
+        }
+
+        assert_eq!(intt_result_points, points_proj);
+    }
+
+    #[test]
+    fn test_scalar_interpolation() {
+        let log_test_size = 7;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_scalars(test_size, log_test_size, true);
+
+        reverse_order_scalars(&mut d_evals);
+        let mut d_coeffs = interpolate_scalars(&mut d_evals, &mut d_domain);
+        intt(&mut evals_mut, 0);
+        let mut h_coeffs: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+
+        assert_eq!(h_coeffs, evals_mut);
+    }
+
+    #[test]
+    fn test_scalar_batch_interpolation() {
+        let batch_size = 4;
+        let log_test_size = 10;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_scalars(test_size * batch_size, log_test_size, true);
+
+        reverse_order_scalars_batch(&mut d_evals, batch_size);
+        let mut d_coeffs = interpolate_scalars_batch(&mut d_evals, &mut d_domain, batch_size);
+        intt_batch(&mut evals_mut, test_size, 0);
+        let mut h_coeffs: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+
+        assert_eq!(h_coeffs, evals_mut);
+    }
+
+    #[test]
+    fn test_point_interpolation() {
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_points(test_size, log_test_size, true);
+
+        reverse_order_points(&mut d_evals);
+        let mut d_coeffs = interpolate_points(&mut d_evals, &mut d_domain);
+        iecntt(&mut evals_mut[..], 0);
+        let mut h_coeffs: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+        
+        assert_eq!(h_coeffs, *evals_mut);
+        for h in h_coeffs.iter() {
+            assert_ne!(*h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_interpolation() {
+        let batch_size = 4;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (mut evals_mut, mut d_evals, mut d_domain) = set_up_points(test_size * batch_size, log_test_size, true);
+
+        reverse_order_points_batch(&mut d_evals, batch_size);
+        let mut d_coeffs = interpolate_points_batch(&mut d_evals, &mut d_domain, batch_size);
+        iecntt_batch(&mut evals_mut[..], test_size, 0);
+        let mut h_coeffs: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_coeffs.copy_to(&mut h_coeffs[..]).unwrap();
+        
+        assert_eq!(h_coeffs, *evals_mut);
+        for h in h_coeffs.iter() {
+            assert_ne!(*h, Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_scalar_evaluation() {
+        let log_test_domain_size = 8;
+        let coeff_size = 1 << 6;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut d_coeffs_domain = interpolate_scalars(&mut d_evals, &mut d_domain_inv);
+        let mut h_coeffs_domain: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        assert_eq!(h_coeffs, h_coeffs_domain[..coeff_size]);
+        for i in coeff_size.. (1 << log_test_domain_size) {
+            assert_eq!(Scalar::zero(), h_coeffs_domain[i]);
+        }
+    }
+
+    #[test]
+    fn test_scalar_batch_evaluation() {
+        let batch_size = 6;
+        let log_test_domain_size = 8;
+        let domain_size = 1 << log_test_domain_size;
+        let coeff_size = 1 << 6;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size * batch_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_scalars_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut d_coeffs_domain = interpolate_scalars_batch(&mut d_evals, &mut d_domain_inv, batch_size);
+        let mut h_coeffs_domain: Vec<Scalar> = (0..domain_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        for j in 0..batch_size {
+            assert_eq!(h_coeffs[j * coeff_size..(j + 1) * coeff_size], h_coeffs_domain[j * domain_size..j * domain_size + coeff_size]);
+            for i in coeff_size..domain_size {
+                assert_eq!(Scalar::zero(), h_coeffs_domain[j * domain_size + i]);
+            }
+        }
+    }
+
+    #[test]
+    fn test_point_evaluation() {
+        let log_test_domain_size = 7;
+        let coeff_size = 1 << 7;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_points(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_points(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_points(&mut d_coeffs, &mut d_domain);
+        let mut d_coeffs_domain = interpolate_points(&mut d_evals, &mut d_domain_inv);
+        let mut h_coeffs_domain: Vec<Point> = (0..1 << log_test_domain_size).map(|_| Point::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        assert_eq!(h_coeffs[..], h_coeffs_domain[..coeff_size]);
+        for i in coeff_size..(1 << log_test_domain_size) {
+            assert_eq!(Point::zero(), h_coeffs_domain[i]);
+        }
+        for i in 0..coeff_size {
+            assert_ne!(h_coeffs_domain[i], Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_evaluation() {
+        let batch_size = 4;
+        let log_test_domain_size = 6;
+        let domain_size = 1 << log_test_domain_size;
+        let coeff_size = 1 << 5;
+        let (h_coeffs, mut d_coeffs, mut d_domain) = set_up_points(coeff_size * batch_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_points(0, log_test_domain_size, true);
+
+        let mut d_evals = evaluate_points_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut d_coeffs_domain = interpolate_points_batch(&mut d_evals, &mut d_domain_inv, batch_size);
+        let mut h_coeffs_domain: Vec<Point> = (0..domain_size * batch_size).map(|_| Point::zero()).collect();
+        d_coeffs_domain.copy_to(&mut h_coeffs_domain[..]).unwrap();
+
+        for j in 0..batch_size {
+            assert_eq!(h_coeffs[j * coeff_size..(j + 1) * coeff_size], h_coeffs_domain[j * domain_size..(j * domain_size + coeff_size)]);
+            for i in coeff_size..domain_size {
+                assert_eq!(Point::zero(), h_coeffs_domain[j * domain_size + i]);
+            }
+            for i in j * domain_size..(j * domain_size + coeff_size) {
+                assert_ne!(h_coeffs_domain[i], Point::zero());
+            }
+        }
+    }
+
+    #[test]
+    fn test_scalar_evaluation_on_trivial_coset() {
+        // checks that the evaluations on the subgroup is the same as on the coset generated by 1
+        let log_test_domain_size = 8;
+        let coeff_size = 1 << 6;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(coeff_size, log_test_domain_size, false);
+        let (_, _, mut d_domain_inv) = set_up_scalars(coeff_size, log_test_domain_size, true);
+        let mut d_trivial_coset_powers = build_domain(1 << log_test_domain_size, 0, false);
+
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut h_coeffs: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_coeffs[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset(&mut d_coeffs, &mut d_domain, &mut d_trivial_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..1 << log_test_domain_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_coeffs, h_evals_coset);
+    }
+
+    #[test]
+    fn test_scalar_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let log_test_size = 8;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(test_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_scalars(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_scalars(&mut d_coeffs, &mut d_large_domain);
+        let mut h_evals_large: Vec<Scalar> = (0..2 * test_size).map(|_| Scalar::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_scalars(&mut d_coeffs, &mut d_domain);
+        let mut h_evals: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset(&mut d_coeffs, &mut d_domain, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..test_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_evals[..], h_evals_large[..test_size]);
+        assert_eq!(h_evals_coset[..], h_evals_large[test_size..2 * test_size]);
+    }
+
+    #[test]
+    fn test_scalar_batch_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let batch_size = 4;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_scalars(test_size * batch_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_scalars(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_scalars_batch(&mut d_coeffs, &mut d_large_domain, batch_size);
+        let mut h_evals_large: Vec<Scalar> = (0..2 * test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_scalars_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut h_evals: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_scalars_on_coset_batch(&mut d_coeffs, &mut d_domain, batch_size, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Scalar> = (0..test_size * batch_size).map(|_| Scalar::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        for i in 0..batch_size {
+            assert_eq!(h_evals_large[2 * i * test_size..(2 * i + 1) * test_size], h_evals[i * test_size..(i + 1) * test_size]);
+            assert_eq!(h_evals_large[(2 * i + 1) * test_size..(2 * i + 2) * test_size], h_evals_coset[i * test_size..(i + 1) * test_size]);
+        }
+    }
+
+    #[test]
+    fn test_point_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let log_test_size = 8;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_points(test_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_points(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_points(&mut d_coeffs, &mut d_large_domain);
+        let mut h_evals_large: Vec<Point> = (0..2 * test_size).map(|_| Point::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_points(&mut d_coeffs, &mut d_domain);
+        let mut h_evals: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_points_on_coset(&mut d_coeffs, &mut d_domain, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Point> = (0..test_size).map(|_| Point::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        assert_eq!(h_evals[..], h_evals_large[..test_size]);
+        assert_eq!(h_evals_coset[..], h_evals_large[test_size..2 * test_size]);
+        for i in 0..test_size {
+            assert_ne!(h_evals[i], Point::zero());
+            assert_ne!(h_evals_coset[i], Point::zero());
+            assert_ne!(h_evals_large[2 * i], Point::zero());
+            assert_ne!(h_evals_large[2 * i + 1], Point::zero());
+        }
+    }
+
+    #[test]
+    fn test_point_batch_evaluation_on_coset() {
+        // checks that evaluating a polynomial on a subgroup and its coset is the same as evaluating on a 2x larger subgroup 
+        let batch_size = 2;
+        let log_test_size = 6;
+        let test_size = 1 << log_test_size;
+        let (_, mut d_coeffs, mut d_domain) = set_up_points(test_size * batch_size, log_test_size, false);
+        let (_, _, mut d_large_domain) = set_up_points(0, log_test_size + 1, false);
+        let mut d_coset_powers = build_domain(test_size, log_test_size + 1, false);
+
+        let mut d_evals_large = evaluate_points_batch(&mut d_coeffs, &mut d_large_domain, batch_size);
+        let mut h_evals_large: Vec<Point> = (0..2 * test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals_large.copy_to(&mut h_evals_large[..]).unwrap();
+        let mut d_evals = evaluate_points_batch(&mut d_coeffs, &mut d_domain, batch_size);
+        let mut h_evals: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals.copy_to(&mut h_evals[..]).unwrap();
+        let mut d_evals_coset = evaluate_points_on_coset_batch(&mut d_coeffs, &mut d_domain, batch_size, &mut d_coset_powers);
+        let mut h_evals_coset: Vec<Point> = (0..test_size * batch_size).map(|_| Point::zero()).collect();
+        d_evals_coset.copy_to(&mut h_evals_coset[..]).unwrap();
+
+        for i in 0..batch_size {
+            assert_eq!(h_evals_large[2 * i * test_size..(2 * i + 1) * test_size], h_evals[i * test_size..(i + 1) * test_size]);
+            assert_eq!(h_evals_large[(2 * i + 1) * test_size..(2 * i + 2) * test_size], h_evals_coset[i * test_size..(i + 1) * test_size]);
+        }
+        for i in 0..test_size * batch_size {
+            assert_ne!(h_evals[i], Point::zero());
+            assert_ne!(h_evals_coset[i], Point::zero());
+            assert_ne!(h_evals_large[2 * i], Point::zero());
+            assert_ne!(h_evals_large[2 * i + 1], Point::zero());
+        }
+    }
+
+    // testing matrix multiplication by comparing the result of FFT with the naive multiplication by the DFT matrix
+    #[test]
+    fn test_matrix_multiplication() {
+        let seed = None; // some value to fix the rng
+        let test_size = 1 << 5;
+        let rou = Fr::get_root_of_unity(test_size).unwrap();
+        let matrix_flattened: Vec<Scalar> = (0..test_size).map(
+            |row_num| { (0..test_size).map( 
+                |col_num| {
+                    let pow: [u64; 1] = [(row_num * col_num).try_into().unwrap()];
+                    Scalar::from_ark(Fr::pow(&rou, &pow).into_repr())
+                }).collect::<Vec<Scalar>>()
+            }).flatten().collect::<Vec<_>>();
+        let vector: Vec<Scalar> = generate_random_scalars(test_size, get_rng(seed));
+
+        let result = mult_matrix_by_vec(&matrix_flattened, &vector, 0);
+        let mut ntt_result = vector.clone();
+        ntt(&mut ntt_result, 0);
+        
+        // we don't use the same roots of unity as arkworks, so the results are permutations
+        // of one another and the only guaranteed fixed scalars are the following ones:
+        assert_eq!(result[0], ntt_result[0]);
+        assert_eq!(result[test_size >> 1], ntt_result[test_size >> 1]);
+    }
+
+    #[test]
+    #[allow(non_snake_case)]
+    fn test_vec_scalar_mul() {
+        let mut intoo = [Scalar::one(), Scalar::one(), Scalar::zero()];
+        let expected = [Scalar::one(), Scalar::zero(), Scalar::zero()];
+        mult_sc_vec(&mut intoo, &expected, 0);
+        assert_eq!(intoo, expected);
+    }
+
+    #[test]
+    #[allow(non_snake_case)]
+    fn test_vec_point_mul() {
+        let dummy_one = Point {
+            x: Base::one(),
+            y: Base::one(),
+            z: Base::one(),
+        };
+
+        let mut inout = [dummy_one, dummy_one, Point::zero()];
+        let scalars = [Scalar::one(), Scalar::zero(), Scalar::zero()];
+        let expected = [dummy_one, Point::zero(), Point::zero()];
+        multp_vec(&mut inout, &scalars, 0);
+        assert_eq!(inout, expected);
+    }
+}
--- a/curve_parameters/bls12_377.json
+++ b/curve_parameters/bls12_377.json
@@ -0,0 +1,13 @@
+{
+    "curve_name" : "bls12_377",
+    "modolus_p" : 8444461749428370424248824938781546531375899335154063827935233455917409239041,
+    "bit_count_p" : 253,
+    "limb_p" :  8,
+    "ntt_size" : 32,
+    "modolus_q" : 258664426012969094010652733694893533536393512754914660539884262666720468348340822774968888139573360124440321458177,
+    "bit_count_q" : 377,
+    "limb_q" : 12,
+    "weierstrass_b" : 1,
+    "gen_x" : 81937999373150964239938255573465948239988671502647976594219695644855304257327692006745978603320413799295628339695,
+    "gen_y" : 241266749859715473739788878240585681733927191168601896383759122102112907357779751001206799952863815012735208165030
+}
--- a/curve_parameters/bls12_381.json
+++ b/curve_parameters/bls12_381.json
@@ -0,0 +1,13 @@
+{
+    "curve_name" : "bls12_381",
+    "modolus_p" : 52435875175126190479447740508185965837690552500527637822603658699938581184513,
+    "bit_count_p" : 255,
+    "limb_p" :  8,
+    "ntt_size" : 32,
+    "modolus_q" : 4002409555221667393417789825735904156556882819939007885332058136124031650490837864442687629129015664037894272559787,
+    "bit_count_q" : 381,
+    "limb_q" : 12,
+    "weierstrass_b" : 4,
+    "gen_x" : 3685416753713387016781088315183077757961620795782546409894578378688607592378376318836054947676345821548104185464507,
+    "gen_y" : 1339506544944476473020471379941921221584933875938349620426543736416511423956333506472724655353366534992391756441569
+}
--- a/curve_parameters/bn254.json
+++ b/curve_parameters/bn254.json
@@ -0,0 +1,13 @@
+{
+    "curve_name" : "bn254",
+    "modolus_p" : 21888242871839275222246405745257275088548364400416034343698204186575808495617,
+    "bit_count_p" : 254,
+    "limb_p" :  8,
+    "ntt_size" : 16,
+    "modolus_q" : 21888242871839275222246405745257275088696311157297823662689037894645226208583,
+    "bit_count_q" : 254,
+    "limb_q" : 8,
+    "weierstrass_b" : 3,
+    "gen_x" : 1,
+    "gen_y" : 2
+}
--- a/curve_parameters/new_curve_script.py
+++ b/curve_parameters/new_curve_script.py
@@ -0,0 +1,203 @@
+import json
+import math
+import os
+from sympy.ntheory import isprime, primitive_root
+import subprocess
+import random 
+import sys
+
+data = None
+with open(sys.argv[1]) as json_file:
+    data = json.load(json_file)
+
+curve_name = data["curve_name"]
+modolus_p = data["modolus_p"]
+bit_count_p = data["bit_count_p"]
+limb_p =  data["limb_p"]
+ntt_size = data["ntt_size"]
+modolus_q = data["modolus_q"]
+bit_count_q = data["bit_count_q"] 
+limb_q = data["limb_q"]
+weierstrass_b = data["weierstrass_b"]
+gen_x = data["gen_x"]
+gen_y = data["gen_y"]
+
+
+def to_hex(val, length):
+    x = str(hex(val))[2:]
+    if len(x) % 8 != 0:
+        x = "0" * (8-len(x) % 8) + x
+    if len(x) != length:
+        x = "0" * (length-len(x)) + x
+    n = 8
+    chunks = [x[i:i+n] for i in range(0, len(x), n)][::-1]
+    s = ""
+    for c in chunks:
+        s += "0x" + c + ", "
+    return s
+
+
+def get_root_of_unity(order: int) -> int:
+    assert (modolus_p - 1) % order == 0
+    return pow(5, (modolus_p - 1) // order, modolus_p)
+
+def create_field_parameters_struct(modulus, modulus_bits_count,limbs,ntt,size,name):
+    s = " struct "+name+"{\n"
+    s += "   static constexpr unsigned limbs_count = " + str(limbs)+";\n"
+    s += "   static constexpr storage<limbs_count> modulus = {"+to_hex(modulus,8*limbs)[:-2]+"};\n"
+    s += "   static constexpr storage<limbs_count> modulus_2 = {"+to_hex(modulus*2,8*limbs)[:-2]+"};\n"   
+    s += "   static constexpr storage<limbs_count> modulus_4 = {"+to_hex(modulus*4,8*limbs)[:-2]+"};\n"
+    s += "   static constexpr storage<2*limbs_count> modulus_wide = {"+to_hex(modulus,8*limbs*2)[:-2]+"};\n"
+    s += "   static constexpr storage<2*limbs_count> modulus_sqared = {"+to_hex(modulus*modulus,8*limbs)[:-2]+"};\n"  
+    s += "   static constexpr storage<2*limbs_count> modulus_sqared_2 = {"+to_hex(modulus*modulus*2,8*limbs)[:-2]+"};\n"   
+    s += "   static constexpr storage<2*limbs_count> modulus_sqared_4 = {"+to_hex(modulus*modulus*2*2,8*limbs)[:-2]+"};\n"   
+    s += "   static constexpr unsigned modulus_bits_count = "+str(modulus_bits_count)+";\n"
+    m = int(math.floor(int(pow(2,2*modulus_bits_count) // modulus)))
+    s += "   static constexpr storage<limbs_count> m = {"+ to_hex(m,8*limbs)[:-2] +"};\n"
+    s += "   static constexpr storage<limbs_count> one = {"+ to_hex(1,8*limbs)[:-2] +"};\n"
+    s += "   static constexpr storage<limbs_count> zero = {"+ to_hex(0,8*limbs)[:-2] +"};\n"
+
+    if ntt:
+        for k in range(size):
+            omega = get_root_of_unity(int(pow(2,k+1)))
+            s += "   static constexpr storage<limbs_count> omega"+str(k+1)+"= {"+ to_hex(omega,8*limbs)[:-2]+"};\n"
+        for k in range(size):
+            omega = get_root_of_unity(int(pow(2,k+1)))
+            s += "   static constexpr storage<limbs_count> omega_inv"+str(k+1)+"= {"+ to_hex(pow(omega, -1, modulus),8*limbs)[:-2]+"};\n"
+        for k in range(size):
+            s += "   static constexpr storage<limbs_count> inv"+str(k+1)+"= {"+ to_hex(pow(int(pow(2,k+1)), -1, modulus),8*limbs)[:-2]+"};\n"  
+    s+=" };\n"   
+    return s
+
+def create_gen():
+    s = " struct group_generator {\n"
+    s += "  static constexpr storage<fq_config::limbs_count> generator_x = {"+to_hex(gen_x,8*limb_q)[:-2]+ "};\n"
+    s += "  static constexpr storage<fq_config::limbs_count> generator_y = {"+to_hex(gen_y,8*limb_q)[:-2]+ "};\n"
+    s+=" };\n" 
+    return s
+
+def get_config_file_content(modolus_p, bit_count_p, limb_p, ntt_size, modolus_q, bit_count_q, limb_q, weierstrass_b):
+    file_content = ""
+    file_content += "#pragma once\n#include \"../../utils/storage.cuh\"\n"
+    file_content += "namespace PARAMS_"+curve_name.upper()+"{\n"
+    file_content += create_field_parameters_struct(modolus_p,bit_count_p,limb_p,True,ntt_size,"fp_config")
+    file_content += create_field_parameters_struct(modolus_q,bit_count_q,limb_q,False,0,"fq_config")
+    file_content += " static constexpr unsigned weierstrass_b = " + str(weierstrass_b)+ ";\n"
+    file_content += create_gen()
+    file_content+="}\n"
+    return file_content
+
+
+# Create Cuda interface
+
+newpath = "./icicle-cuda/curves/"+curve_name 
+if not os.path.exists(newpath):
+    os.makedirs(newpath)
+
+fc = get_config_file_content(modolus_p, bit_count_p, limb_p, ntt_size, modolus_q, bit_count_q, limb_q, weierstrass_b)
+text_file = open("./icicle-cuda/curves/"+curve_name+"/params.cuh", "w")
+n = text_file.write(fc)
+text_file.close()
+
+with open("./icicle-cuda/curves/curve_template/lde.cu", "r") as lde_file:
+    content = lde_file.read()
+    content = content.replace("CURVE_NAME_U",curve_name.upper())
+    content = content.replace("CURVE_NAME_L",curve_name.lower())
+    text_file = open("./icicle-cuda/curves/"+curve_name+"/lde.cu", "w")
+    n = text_file.write(content)
+    text_file.close()
+    
+with open("./icicle-cuda/curves/curve_template/msm.cu", "r") as msm_file:
+    content = msm_file.read()
+    content = content.replace("CURVE_NAME_U",curve_name.upper())
+    content = content.replace("CURVE_NAME_L",curve_name.lower())
+    text_file = open("./icicle-cuda/curves/"+curve_name+"/msm.cu", "w")
+    n = text_file.write(content)
+    text_file.close()
+
+with open("./icicle-cuda/curves/curve_template/ve_mod_mult.cu", "r") as ve_mod_mult_file:
+    content = ve_mod_mult_file.read()
+    content = content.replace("CURVE_NAME_U",curve_name.upper())
+    content = content.replace("CURVE_NAME_L",curve_name.lower())
+    text_file = open("./icicle-cuda/curves/"+curve_name+"/ve_mod_mult.cu", "w")
+    n = text_file.write(content)
+    text_file.close()
+    
+
+namespace = '#include "params.cuh"\n'+'''namespace CURVE_NAME_U {
+    typedef Field<PARAMS_CURVE_NAME_U::fp_config> scalar_field_t;\
+    typedef scalar_field_t scalar_t;\
+    typedef Field<PARAMS_CURVE_NAME_U::fq_config> point_field_t;
+    typedef Projective<point_field_t, scalar_field_t, PARAMS_CURVE_NAME_U::group_generator, PARAMS_CURVE_NAME_U::weierstrass_b> projective_t;
+    typedef Affine<point_field_t> affine_t;
+}'''
+
+with open('./icicle-cuda/curves/'+curve_name+'/curve_config.cuh', 'w') as f:
+    f.write(namespace.replace("CURVE_NAME_U",curve_name.upper()))
+    
+    
+eq = '''
+#include <cuda.h>\n
+#include "curve_config.cuh"\n
+#include "../../primitives/projective.cuh"\n
+extern "C" bool eq_CURVE_NAME_L(CURVE_NAME_U::projective_t *point1, CURVE_NAME_U::projective_t *point2)
+{
+    return (*point1 == *point2);
+}'''
+
+with open('./icicle-cuda/curves/'+curve_name+'/projective.cu', 'w') as f:
+    f.write(eq.replace("CURVE_NAME_U",curve_name.upper()).replace("CURVE_NAME_L",curve_name.lower()))
+
+supported_operations = '''
+#include "projective.cu"
+#include "lde.cu"
+#include "msm.cu"
+#include "ve_mod_mult.cu"
+'''
+
+with open('./icicle-cuda/curves/'+curve_name+'/supported_operations.cu', 'w') as f:
+    f.write(supported_operations.replace("CURVE_NAME_U",curve_name.upper()).replace("CURVE_NAME_L",curve_name.lower()))
+    
+with open('./icicle-cuda/curves/index.cu', 'a') as f:
+    f.write('\n#include "'+curve_name.lower()+'/supported_operations.cu"')
+    
+
+
+# Create Rust interface and tests
+
+if limb_p == limb_q: 
+    with open("./src/curve_templates/curve_same_limbs.rs", "r") as curve_file:
+        content = curve_file.read()
+        content = content.replace("CURVE_NAME_U",curve_name.upper())
+        content = content.replace("CURVE_NAME_L",curve_name.lower())
+        content = content.replace("_limbs_p",str(limb_p * 8 * 4))
+        content = content.replace("limbs_p",str(limb_p))
+        text_file = open("./src/curves/"+curve_name+".rs", "w")
+        n = text_file.write(content)
+        text_file.close()
+else:
+    with open("./src/curve_templates/curve_different_limbs.rs", "r") as curve_file:
+        content = curve_file.read()
+        content = content.replace("CURVE_NAME_U",curve_name.upper())
+        content = content.replace("CURVE_NAME_L",curve_name.lower())
+        content = content.replace("_limbs_p",str(limb_p * 8 * 4))
+        content = content.replace("limbs_p",str(limb_p))
+        content = content.replace("_limbs_q",str(limb_q * 8 * 4))
+        content = content.replace("limbs_q",str(limb_q))
+        text_file = open("./src/curves/"+curve_name+".rs", "w")
+        n = text_file.write(content)
+        text_file.close()
+
+with open("./src/curve_templates/test.rs", "r") as test_file:
+    content = test_file.read()
+    content = content.replace("CURVE_NAME_U",curve_name.upper())
+    content = content.replace("CURVE_NAME_L",curve_name.lower())
+    text_file = open("./src/test_"+curve_name+".rs", "w")
+    n = text_file.write(content)
+    text_file.close()
+    
+with open('./src/curves/mod.rs', 'a') as f:
+    f.write('\n pub mod ' + curve_name + ';')
+
+with open('./src/lib.rs', 'a') as f:
+    f.write('\npub mod ' + curve_name + ';')
--- a/docs/.codespellignore
+++ b/docs/.codespellignore
@@ -1 +0,0 @@
-ICICLE
--- a/docs/.gitignore
+++ b/docs/.gitignore
@@ -1,17 +0,0 @@
-.docusaurus/
-node_modules/
-yarn.lock
-.DS_Store
-
-# tex build artifacts
-.aux
-.bbl
-.bcf
-.blg
-.fdb_latexmk
-.fls
-.log
-.out
-.xml
-.gz
-.toc
--- a/docs/.prettierignore
+++ b/docs/.prettierignore
@@ -1,17 +0,0 @@
-.docusaurus/
-node_modules/
-yarn.lock
-.DS_Store
-
-# tex build artifacts
-.aux
-.bbl
-.bcf
-.blg
-.fdb_latexmk
-.fls
-.log
-.out
-.xml
-.gz
-.toc
--- a/docs/.prettierrc
+++ b/docs/.prettierrc
@@ -1,10 +0,0 @@
-{
-  "semi": false,
-  "singleQuote": true,
-  "trailingComma": "es5",
-  "printWidth": 80,
-  "tabWidth": 2,
-  "useTabs": false,
-  "proseWrap": "preserve",
-  "endOfLine": "lf"
-}
--- a/docs/CNAME
+++ b/docs/CNAME
@@ -1 +0,0 @@
-dev.ingonyama.com
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,39 +0,0 @@
-# Website
-
-This website is built using [Docusaurus 2](https://docusaurus.io/), a modern static website generator.
-
-### Installation
-
-```
-$ npm i
-```
-
-### Local Development
-
-```
-$ npm start
-```
-
-This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
-
-### Build
-
-```
-$ npm run build
-```
-
-This command generates static content into the `build` directory and can be served using any static contents hosting service.
-
-### Deployment
-
-Using SSH:
-
-```
-$ USE_SSH=true npm run deploy
-```
-
-Not using SSH:
-
-```
-$ GIT_USER=<Your GitHub username> npm run deploy
-```
--- a/docs/babel.config.js
+++ b/docs/babel.config.js
@@ -1,3 +0,0 @@
-module.exports = {
-  presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
-};
--- a/docs/docs/ZKContainers.md
+++ b/docs/docs/ZKContainers.md
@@ -1,12 +0,0 @@
-# ZKContainer
-
-We found that developing ZK provers with ICICLE gives developers the ability to scale ZK provers across many machines and many GPUs. To make this possible we developed the ZKContainer.
-
-## What is a ZKContainer?
-
-A ZKContainer is a standardized, optimized and secure docker container that we configured with ICICLE applications in mind. A developer using our ZKContainer can deploy an ICICLE application on a single machine or on a thousand GPU machines in a data center with minimal concerns regarding compatibility.
-
-ZKContainer has been used by Ingonyama clients to achieve scalability across large data centers.
-We suggest you read our [article](https://medium.com/@ingonyama/product-announcement-zk-containers-0e2a1f2d0a2b) regarding ZKContainer to understand the benefits of using them.
-
-![ZKContainer inside a ZK data center](../static/img/architecture-zkcontainer.png)
--- a/docs/docs/contributor-guide.md
+++ b/docs/docs/contributor-guide.md
@@ -1,23 +0,0 @@
-# Contributor's Guide
-
-We welcome all contributions with open arms. At Ingonyama we take a village approach, believing it takes many hands and minds to build an ecosystem.
-
-## Contributing to ICICLE
-
- Make suggestions or report bugs via [GitHub issues](https://github.com/ingonyama-zk/icicle/issues)
- Contribute to the ICICLE by opening a [pull request](https://github.com/ingonyama-zk/icicle/pulls).
- Contribute to our [documentation](https://github.com/ingonyama-zk/icicle/tree/main/docs) and [examples](https://github.com/ingonyama-zk/icicle/tree/main/examples).
- Ask questions on Discord
-
-### Opening a pull request
-
-When opening a [pull request](https://github.com/ingonyama-zk/icicle/pulls) please keep the following in mind.
-
- `Clear Purpose` - The pull request should solve a single issue and be clean of any unrelated changes.
- `Clear description` - If the pull request is for a new feature describe what you built, why you added it and how it's best that we test it. For bug fixes please describe the issue and the solution.
- `Consistent style` - Rust and Golang code should be linted by the official linters (golang fmt and rust fmt) and maintain a proper style. For CUDA and C++ code we use [`clang-format`](https://github.com/ingonyama-zk/icicle/blob/main/.clang-format), [here](https://github.com/ingonyama-zk/icicle/blob/605c25f9d22135c54ac49683b710fe2ce06e2300/.github/workflows/main-format.yml#L46) you can see how we run it.
- `Minimal Tests` - please add test which cover basic usage of your changes.
-
-## Questions?
-
-Find us on [Discord](https://discord.gg/6vYrE7waPj).
--- a/docs/docs/grants.md
+++ b/docs/docs/grants.md
@@ -1,23 +0,0 @@
-# Ingonyama Grant programs
-
-Ingonyama understands the importance of supporting and fostering a vibrant community of researchers and builders to advance ZK. To encourage progress, we are not only developing in the open but also sharing resources with researchers and builders through various programs.
-
-## ICICLE ZK-GPU Ecosystem Grant
-
-Ingonyama invites researchers and practitioners to collaborate in advancing ZK acceleration. We are allocating $100,000 for grants to support this initiative.
-
-### Bounties & Grants
-
-Eligibility for grants includes:
-
-1. **Students**: Utilize ICICLE in your research.
-2. **Performance Improvement**: Enhance the performance of accelerated primitives in ICICLE.
-3. **Protocol Porting**: Migrate existing ZK protocols to ICICLE.
-4. **New Primitives**: Contribute new primitives to ICICLE.
-5. **Benchmarking**: Compare ZK benchmarks against ICICLE.
-
-## Contact
-
-For questions or submissions: [grants@ingonyama.com](mailto:grants@ingonyama.com)
-
-**Read the full article [here](https://www.ingonyama.com/blog/icicle-for-researchers-grants-challenges)**
--- a/docs/docs/icicle/colab-instructions.md
+++ b/docs/docs/icicle/colab-instructions.md
@@ -1,138 +0,0 @@
-# Run ICICLE on Google Colab
-
-Google Colab lets you use a GPU free of charge, it's an Nvidia T4 GPU with 16 GB of memory, capable of running latest CUDA (tested on Cuda 12.2)
-As Colab is able to interact with shell commands, a user can also install a framework and load git repositories into Colab space.
-
-## Prepare Colab environment
-
-First thing to do in a notebook is to set the runtime type to a T4 GPU.
-
- in the upper corner click on the dropdown menu and select "change runtime type"
-
-![Change runtime](../../static/img/colab_change_runtime.png)
-
- In the window select "T4 GPU" and press Save
-
-![T4 GPU](../../static/img/t4_gpu.png)
-
-Installing Rust is rather simple, just execute the following command:
-
-```sh
-!apt install rustc cargo
-```
-
-To test the installation of Rust:
-
-```sh
-!rustc --version
-!cargo --version
-```
-
-A successful installation will result in a rustc and cargo version print, a faulty installation will look like this:
-
-```sh
-/bin/bash: line 1: rustc: command not found
-/bin/bash: line 1: cargo: command not found
-```
-
-Now we will check the environment:
-
-```sh
-!nvcc --version
-!gcc --version
-!cmake --version
-!nvidia-smi
-```
-
-A correct environment should print the result with no bash errors for `nvidia-smi` command and result in a **Teslt T4 GPU** type:
-
-```sh
-nvcc: NVIDIA (R) Cuda compiler driver
-Copyright (c) 2005-2023 NVIDIA Corporation
-Built on Tue_Aug_15_22:02:13_PDT_2023
-Cuda compilation tools, release 12.2, V12.2.140
-Build cuda_12.2.r12.2/compiler.33191640_0
-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
-Copyright (C) 2021 Free Software Foundation, Inc.
-This is free software; see the source for copying conditions.  There is NO
-warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-
-cmake version 3.27.9
-
-CMake suite maintained and supported by Kitware (kitware.com/cmake).
-Wed Jan 17 13:10:18 2024
-+---------------------------------------------------------------------------------------+
-| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
-|-----------------------------------------+----------------------+----------------------+
-| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
-|                                         |                      |               MIG M. |
-|=========================================+======================+======================|
-|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
-| N/A   39C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
-|                                         |                      |                  N/A |
-+-----------------------------------------+----------------------+----------------------+
-
-+---------------------------------------------------------------------------------------+
-| Processes:                                                                            |
-|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
-|        ID   ID                                                             Usage      |
-|=======================================================================================|
-|  No running processes found                                                           |
-+---------------------------------------------------------------------------------------+
-```
-
-## Cloning ICICLE and running test
-
-Now we are ready to clone ICICE repository,
-
-```sh
-!git clone https://github.com/ingonyama-zk/icicle.git
-```
-
-We now can browse the repository and run tests to check the runtime environment:
-
-```sh
-!ls -la
-%cd icicle
-```
-
-Let's run a test!
-Navigate to icicle/wrappers/rust/icicle-curves/icicle-bn254 and run cargo test:
-
-```sh
-%cd wrappers/rust/icicle-curves/icicle-bn254/
-!cargo test --release
-```
-
-:::note
-
-Compiling the first time may take a while
-
-:::
-
-Test run should end like this:
-
-```sh
-running 15 tests
-test curve::tests::test_ark_point_convert ... ok
-test curve::tests::test_ark_scalar_convert ... ok
-test curve::tests::test_affine_projective_convert ... ok
-test curve::tests::test_point_equality ... ok
-test curve::tests::test_field_convert_montgomery ... ok
-test curve::tests::test_scalar_equality ... ok
-test curve::tests::test_points_convert_montgomery ... ok
-test msm::tests::test_msm ... ok
-test msm::tests::test_msm_skewed_distributions ... ok
-test ntt::tests::test_ntt ... ok
-test ntt::tests::test_ntt_arbitrary_coset ... ok
-test msm::tests::test_msm_batch has been running for over 60 seconds
-test msm::tests::test_msm_batch ... ok
-test ntt::tests::test_ntt_coset_from_subgroup ... ok
-test ntt::tests::test_ntt_device_async ... ok
-test ntt::tests::test_ntt_batch ... ok
-
-test result: ok. 15 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 99.39s
-```
-
-Viola, ICICLE in Colab!
--- a/docs/docs/icicle/core.md
+++ b/docs/docs/icicle/core.md
@@ -1,196 +0,0 @@
-# ICICLE Core
-
-ICICLE Core is a library written in C++/CUDA. All the ICICLE primitives are implemented within ICICLE Core.
-
-The Core is split into logical modules that can be compiled into static libraries using different [strategies](#compilation-strategies). You can then [link](#linking) these libraries with your C++ project or write your own [bindings](#writing-new-bindings-for-icicle) for other programming languages. If you want to use ICICLE with existing bindings please refer to the [Rust](/icicle/rust-bindings) or [Golang](/icicle/golang-bindings) bindings documentation.
-
-## Supported curves, fields and operations
-
-### Supported curves and operations
-
-| Operation\Curve | [bn254](https://neuromancer.sk/std/bn/bn254) | [bls12-377](https://neuromancer.sk/std/bls/BLS12-377) | [bls12-381](https://neuromancer.sk/std/bls/BLS12-381) | [bw6-761](https://eprint.iacr.org/2020/351) | grumpkin |
-| --- | :---: | :---: | :---: | :---: | :---: |
-| [MSM][MSM_DOCS] | ✅ | ✅ | ✅ | ✅ | ✅ |
-| G2  | ✅ | ✅ | ✅ | ✅ | ❌ |
-| [NTT][NTT_DOCS] | ✅ | ✅ | ✅ | ✅ | ❌ |
-| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| [VecOps][VECOPS_CODE] | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [Polynomials][POLY_DOCS] | ✅ | ✅ | ✅ | ✅ | ❌ |
-| [Poseidon](primitives/poseidon) | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [Merkle Tree](primitives/poseidon#the-tree-builder) | ✅ | ✅ | ✅ | ✅ | ✅ |
-
-### Supported fields and operations
-
-| Operation\Field | [babybear](https://eprint.iacr.org/2023/824.pdf) | [Stark252](https://docs.starknet.io/documentation/architecture_and_concepts/Cryptography/p-value/) |
-| --- | :---: | :---: |
-| [VecOps][VECOPS_CODE] | ✅ | ✅ |
-| [Polynomials][POLY_DOCS] | ✅ | ✅ |
-| [NTT][NTT_DOCS] | ✅ | ✅ |
-| Extension Field | ✅ | ❌ |
-
-### Supported hashes
-
-| Hash | Sizes |
-| --- | :---: |
-| Keccak | 256, 512 |
-
-## Compilation strategies
-
-Most of the codebase is curve/field agnostic, which means it can be compiled for different curves and fields. When you build ICICLE Core you choose a single curve or field. If you need multiple curves or fields, you compile ICICLE once per curve or field that is needed. It's that simple. Currently, the following choices are supported:
-
- [Field mode][COMPILE_FIELD_MODE] - used for STARK fields like BabyBear / Mersenne / Goldilocks. Includes field arithmetic, NTT, Poseidon, Extension fields and other primitives.
- [Curve mode][COMPILE_CURVE_MODE] - used for SNARK curves like BN254 / BLS curves / Grumpkin / etc. Curve mode is built upon field mode, so it includes everything that field does It also includes curve operations / MSM / ECNTT / G2 and other curve-related primitives.
-
-:::info
-
-If you only want to use a curve's scalar or base field, you still need to use curve mode. You can disable MSM with [options](#compilation-options)
-
-:::
-
-### Compiling for a field
-
-You can compile ICICLE for a field using this command:
-
-```sh
-cd icicle
-mkdir -p build
-cmake -DFIELD=<FIELD> -S . -B build
-cmake --build build -j
-```
-
-This command will output `libingo_field_<FIELD>.a` into `build/lib`.
-
-### Compiling for a curve
-
-:::note
-
-Field related primitives will be compiled for the scalar field of the curve
-
-:::
-
-You can compile ICICLE for a SNARK curve using this command:
-
-```sh
-cd icicle
-mkdir -p build
-cmake -DCURVE=<CURVE> -S . -B build
-cmake --build build -j
-```
-
-Where `<CURVE>` can be one of `bn254`/`bls12_377`/`bls12_381`/`bw6_761`/`grumpkin`.
-
-This command will output both `libingo_curve_<CURVE>.a` and `libingo_field_<CURVE>.a` into `build/lib`.
-
-### Compilation options
-
-There exist multiple options that allow you to customize your build or enable additional functionality.
-
-#### EXT_FIELD
-
-Used only in [field mode][COMPILE_FIELD_MODE] to add an Extension field. Adds all supported field operations for the extension field.
-
-Default: `OFF`
-
-Usage: `-DEXT_FIELD=ON`
-
-#### G2
-
-Used only in [curve mode][COMPILE_CURVE_MODE] to add G2 definitions. Also adds G2 MSM.
-
-Default: `OFF`
-
-Usage: `-DG2=ON`
-
-#### ECNTT
-
-Used only in [curve mode][COMPILE_CURVE_MODE] to add ECNTT function.
-
-Default: `OFF`
-
-Usage: `-DECNTT=ON`
-
-#### MSM
-
-Used only in [curve mode][COMPILE_CURVE_MODE] to add MSM function. As MSM takes a lot of time to build, you can disable it with this option to reduce compilation time.
-
-Default: `ON`
-
-Usage: `-DMSM=OFF`
-
-#### BUILD_HASH
-
-Can be used in any mode to build a hash library. Currently it only includes Keccak hash function, but more are coming.
-
-Default: `OFF`
-
-Usage: `-DBUILD_HASH=ON`
-
-#### BUILD_TESTS
-
-Can be used in any mode to include tests runner binary.
-
-Default: `OFF`
-
-USAGE: `-DBUILD_TESTS=ON`
-
-#### BUILD_BENCHMARKS
-
-Can be used in any mode to include benchmarks runner binary.
-
-Default: `OFF`
-
-USAGE: `-DBUILD_BENCHMARKS=ON`
-
-#### DEVMODE
-
-Can be used in any mode to include debug symbols in the build.
-
-Default: `OFF`
-
-USAGE: `-DEVMODE=ON`
-
-## Linking
-
-To link ICICLE with your project you first need to compile ICICLE with options of your choice. After that you can use CMake `target_link_libraries` to link with the generated static libraries and `target_include_directories` to include ICICLE headers (located in `icicle/include`).
-
-Refer to our [c++ examples](https://github.com/ingonyama-zk/icicle/tree/main/examples/c%2B%2B) for more info. Take a look at this [CMakeLists.txt](https://github.com/ingonyama-zk/icicle/blob/main/examples/c%2B%2B/msm/CMakeLists.txt#L22)
-
-## Writing new bindings for ICICLE
-
-Since ICICLE Core is written in CUDA / C++ its really simple to generate static libraries. These static libraries can be installed on any system and called by higher level languages such as Golang.
-
-Static libraries can be loaded into memory once and used by multiple programs, reducing memory usage and potentially improving performance. They also allow you to separate functionality into distinct modules so your static library may need to compile only specific features that you want to use.
-
-Let's review the [Golang bindings][GOLANG_BINDINGS] since its a pretty verbose example (compared to rust which hides it pretty well) of using static libraries. Golang has a library named `CGO` which can be used to link static libraries. Here's a basic example on how you can use cgo to link these libraries:
-
-```go
-/*
-#cgo LDFLAGS: -L/path/to/shared/libs -lbn254 -lbls12_381 -lbls12_377 -lbw6_671
-#include "icicle.h" // make sure you use the correct header file(s)
-*/
-import "C"
-
-func main() {
-  // Now you can call the C functions from the ICICLE libraries.
-  // Note that C function calls are prefixed with 'C.' in Go code.
-
-  out := (*C.BN254_projective_t)(unsafe.Pointer(p))
-  in := (*C.BN254_affine_t)(unsafe.Pointer(affine))
-
-  C.projective_from_affine_bn254(out, in)
-}
-```
-
-The comments on the first line tell `CGO` which libraries to import as well as which header files to include. You can then call methods which are part of the static library and defined in the header file, `C.projective_from_affine_bn254` is an example.
-
-If you wish to create your own bindings for a language of your choice we suggest you start by investigating how you can call static libraries.
-
-<!-- Begin Links -->
-[GOLANG_BINDINGS]: golang-bindings.md
-[COMPILE_CURVE_MODE]: #compiling-for-a-curve
-[COMPILE_FIELD_MODE]: #compiling-for-a-field
-[NTT_DOCS]: primitives/ntt
-[MSM_DOCS]: primitives/msm
-[POLY_DOCS]: polynomials/overview
-[VECOPS_CODE]: https://github.com/ingonyama-zk/icicle/blob/main/icicle/include/vec_ops/vec_ops.cuh
-<!-- End Links -->
--- a/docs/docs/icicle/golang-bindings.md
+++ b/docs/docs/icicle/golang-bindings.md
@@ -1,136 +0,0 @@
-# Golang bindings
-
-Golang bindings allow you to use ICICLE as a golang library.
-The source code for all Golang packages can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang).
-
-The Golang bindings are comprised of multiple packages.
-
-[`core`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/core) which defines all shared methods and structures, such as configuration structures, or memory slices.
-
-[`cuda-runtime`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/cuda_runtime) which defines abstractions for CUDA methods for allocating memory, initializing and managing streams, and `DeviceContext` which enables users to define and keep track of devices.
-
-Each supported curve, field, and hash has its own package which you can find in the respective directories [here](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang). If your project uses BN254 you only need to import that single package named [`bn254`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/curves/bn254).
-
-## Using ICICLE Golang bindings in your project
-
-To add ICICLE to your `go.mod` file.
-
-```bash
-go get github.com/ingonyama-zk/icicle
-```
-
-If you want to specify a specific branch
-
-```bash
-go get github.com/ingonyama-zk/icicle@<branch_name>
-```
-
-For a specific commit
-
-```bash
-go get github.com/ingonyama-zk/icicle@<commit_id>
-```
-
-To build the shared libraries you can run [this](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/build.sh) script:
-
-```sh
-./build.sh [-curve=<curve>] [-field=<field>] [-hash=<hash>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
-
-curve - The name of the curve to build or "all" to build all supported curves
-field - The name of the field to build or "all" to build all supported fields
-hash - The name of the hash to build or "all" to build all supported hashes
-g2 - Optional - build with G2 enabled 
-ecntt - Optional - build with ECNTT enabled
-devmode - Optional - build in devmode
-help - Optional - Displays usage information
-```
-
-:::note
-
-If more than one curve or more than one field or more than one hash is supplied, the last one supplied will be built
-
-:::
-
-To build ICICLE libraries for all supported curves with G2 and ECNTT enabled.
-
-```bash
-./build.sh -curve=all -g2 -ecntt
-```
-
-If you wish to build for a specific curve, for example bn254, without G2 or ECNTT enabled.
-
-``` bash
-./build.sh -curve=bn254
-```
-
-Now you can import ICICLE into your project
-
-```go
-import (
-    "github.com/stretchr/testify/assert"
-    "testing"
-
-    "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-    cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-)
-...
-```
-
-## Running tests
-
-To run all tests, for all curves:
-
-```bash
-go test ./... -count=1
-```
-
-If you wish to run test for a specific curve:
-
-```bash
-go test <path_to_curve> -count=1
-```
-
-## How do Golang bindings work?
-
-The libraries produced from the CUDA code compilation are used to bind Golang to ICICLE's CUDA code.
-
-1. These libraries (named `libingo_curve_<curve>.a` and `libingo_field_<curve>.a`) can be imported in your Go project to leverage the GPU accelerated functionalities provided by ICICLE.
-
-2. In your Go project, you can use `cgo` to link these libraries. Here's a basic example on how you can use `cgo` to link these libraries:
-
-```go
-/*
-#cgo LDFLAGS: -L/path/to/shared/libs -lingo_curve_bn254 -L$/path/to/shared/libs -lingo_field_bn254 -lstdc++ -lm
-#include "icicle.h" // make sure you use the correct header file(s)
-*/
-import "C"
-
-func main() {
-    // Now you can call the C functions from the ICICLE libraries.
-    // Note that C function calls are prefixed with 'C.' in Go code.
-}
-```
-
-Replace `/path/to/shared/libs` with the actual path where the shared libraries are located on your system.
-
-## Supported curves, fields and operations
-
-### Supported curves and operations
-
-| Operation\Curve | bn254 | bls12_377 | bls12_381 | bw6-761 | grumpkin |
-| --- | :---: | :---: | :---: | :---: | :---: |
-| MSM | ✅ | ✅ | ✅ | ✅ | ✅ |
-| G2  | ✅ | ✅ | ✅ | ✅ | ❌ |
-| NTT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| VecOps | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Polynomials | ✅ | ✅ | ✅ | ✅ | ❌ |
-
-### Supported fields and operations
-
-| Operation\Field | babybear |
-| --- | :---: |
-| VecOps | ✅ |
-| Polynomials | ✅ |
-| NTT | ✅ |
-| Extension Field | ✅ |
--- a/docs/docs/icicle/golang-bindings/ecntt.md
+++ b/docs/docs/icicle/golang-bindings/ecntt.md
@@ -1,92 +0,0 @@
-# ECNTT
-
-## ECNTT Method
-
-The `ECNtt[T any]()` function performs the Elliptic Curve Number Theoretic Transform (EC-NTT) on the input points slice, using the provided dir (direction), cfg (configuration), and stores the results in the results slice.
-
-```go
-func ECNtt[T any](points core.HostOrDeviceSlice, dir core.NTTDir, cfg *core.NTTConfig[T], results core.HostOrDeviceSlice) core.IcicleError
-```
-
-### Parameters
-
- **`points`**: A slice of elliptic curve points (in projective coordinates) that will be transformed. The slice can be stored on the host or the device, as indicated by the `core.HostOrDeviceSlice` type.
- **`dir`**: The direction of the EC-NTT transform, either `core.KForward` or `core.KInverse`.
- **`cfg`**: A pointer to an `NTTConfig` object, containing configuration options for the NTT operation.
- **`results`**: A slice that will store the transformed elliptic curve points (in projective coordinates). The slice can be stored on the host or the device, as indicated by the `core.HostOrDeviceSlice` type.
-
-### Return Value
-
- **`CudaError`**: A `core.IcicleError` value, which will be `core.IcicleErrorCode(0)` if the EC-NTT operation was successful, or an error if something went wrong.
-
-## NTT Configuration (NTTConfig)
-
-The `NTTConfig` structure holds configuration parameters for the NTT operation, allowing customization of its behavior to optimize performance based on the specifics of your protocol.
-
-```go
-type NTTConfig[T any] struct {
-    Ctx cr.DeviceContext
-    CosetGen T
-    BatchSize int32
-    ColumnsBatch bool
-    Ordering Ordering
-    areInputsOnDevice  bool
-    areOutputsOnDevice bool
-    IsAsync bool
-    NttAlgorithm NttAlgorithm
-}
-```
-
-### Fields
-
- **`Ctx`**: Device context containing details like device ID and stream ID.
- **`CosetGen`**: Coset generator used for coset (i)NTTs, defaulting to no coset being used.
- **`BatchSize`**: The number of NTTs to compute in one operation, defaulting to 1.
- **`ColumnsBatch`**: If true the function will compute the NTTs over the columns of the input matrix and not over the rows. Defaults to `false`.
- **`Ordering`**: Ordering of inputs and outputs (`KNN`, `KNR`, `KRN`, `KRR`), affecting how data is arranged.
- **`areInputsOnDevice`**: Indicates if input scalars are located on the device.
- **`areOutputsOnDevice`**: Indicates if results are stored on the device.
- **`IsAsync`**: Controls whether the NTT operation runs asynchronously.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. ECNTT supports running on `Radix2` algorithm.
-
-### Default Configuration
-
-Use `GetDefaultNTTConfig` to obtain a default configuration, customizable as needed.
-
-```go
-func GetDefaultNTTConfig[T any](cosetGen T) NTTConfig[T]
-```
-
-## ECNTT Example
-
-```go
-package main
-
-import (
-    "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-    cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-)
-
-func Main() {
-    // Obtain the default NTT configuration with a predefined coset generator.
-    cfg := GetDefaultNttConfig()
-    
-    // Define the size of the input scalars.
-    size := 1 << 18
-
-    // Generate Points for the ECNTT operation.
-    points := GenerateProjectivePoints(size)
-    
-    // Set the direction of the NTT (forward or inverse).
-    dir := core.KForward
-
-    // Allocate memory for the results of the NTT operation.
-    results := make(core.HostSlice[Projective], size)
-
-    // Perform the NTT operation.
-    err := ECNtt(points, dir, &cfg, results)
-    if err != cr.CudaSuccess {
-        panic("ECNTT operation failed")
-    }
-}
-```
--- a/docs/docs/icicle/golang-bindings/keccak.md
+++ b/docs/docs/icicle/golang-bindings/keccak.md
@@ -1,94 +0,0 @@
-# Keccak
-
-## Keccak Example
-
-```go
-package main
-
-import (
-	"encoding/hex"
-
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/hash/keccak"
-)
-
-func createHostSliceFromHexString(hexString string) core.HostSlice[uint8] {
-	byteArray, err := hex.DecodeString(hexString)
-	if err != nil {
-		panic("Not a hex string")
-	}
-	return core.HostSliceFromElements([]uint8(byteArray))
-}
-
-func main() {
-	input := createHostSliceFromHexString("1725b6")
-	outHost256 := make(core.HostSlice[uint8], 32)
-
-	cfg := keccak.GetDefaultHashConfig()
-	e := keccak.Keccak256(input, int32(input.Len()), 1, outHost256, &cfg)
-	if e.CudaErrorCode != cr.CudaSuccess {
-		panic("Keccak256 hashing failed")
-	}
-
-	outHost512 := make(core.HostSlice[uint8], 64)
-	e = keccak.Keccak512(input, int32(input.Len()), 1, outHost512, &cfg)
-	if e.CudaErrorCode != cr.CudaSuccess {
-		panic("Keccak512 hashing failed")
-	}
-
-    numberOfBlocks := 3
-	outHostBatch256 := make(core.HostSlice[uint8], 32*numberOfBlocks)
-	e = keccak.Keccak256(input, int32(input.Len()/numberOfBlocks), int32(numberOfBlocks), outHostBatch256, &cfg)
-	if e.CudaErrorCode != cr.CudaSuccess {
-		panic("Keccak256 batch hashing failed")
-	}
-}
-```
-
-## Keccak Methods
-
-```go
-func Keccak256(input core.HostOrDeviceSlice, inputBlockSize, numberOfBlocks int32, output core.HostOrDeviceSlice, config *HashConfig) core.IcicleError
-func Keccak512(input core.HostOrDeviceSlice, inputBlockSize, numberOfBlocks int32, output core.HostOrDeviceSlice, config *HashConfig) core.IcicleError
-```
-
-### Parameters
-
- **`input`**: A slice containing the input data for the Keccak256 hash function. It can reside in either host memory or device memory.
- **`inputBlockSize`**: An integer specifying the size of the input data for a single hash.
- **`numberOfBlocks`**: An integer specifying the number of results in the hash batch.
- **`output`**: A slice where the resulting hash will be stored. This slice can be in host or device memory.
- **`config`**: A pointer to a `HashConfig` object, which contains various configuration options for the Keccak256 operation.
-
-### Return Value
-
- **`CudaError`**: Returns a CUDA error code indicating the success or failure of the Keccak256/Keccak512 operation.
-
-## HashConfig
-
-The `HashConfig` structure holds configuration parameters for the Keccak256/Keccak512 operation, allowing customization of its behavior to optimize performance based on the specifics of the operation or the underlying hardware.
-
-```go
-type HashConfig struct {
-	Ctx                cr.DeviceContext
-	areInputsOnDevice  bool
-	areOutputsOnDevice bool
-	IsAsync            bool
-}
-```
-
-### Fields
-
- **`Ctx`**: Device context containing details like device id and stream.
- **`areInputsOnDevice`**: Indicates if input data is located on the device.
- **`areOutputsOnDevice`**: Indicates if output hash is stored on the device.
- **`IsAsync`**: If true, runs the Keccak256/Keccak512 operation asynchronously.
-
-### Default Configuration
-
-Use `GetDefaultHashConfig` to obtain a default configuration, which can then be customized as needed.
-
-```go
-func GetDefaultHashConfig() HashConfig
-```
--- a/docs/docs/icicle/golang-bindings/msm-pre-computation.md
+++ b/docs/docs/icicle/golang-bindings/msm-pre-computation.md
@@ -1,99 +0,0 @@
-# MSM Pre computation
-
-To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).
-
-## Core package
-
-### MSM PrecomputePoints
-
-`PrecomputePoints` and `G2PrecomputePoints` exists for all supported curves.
-
-#### Description
-
-This function extends each provided base point $(P)$ with its multiples $(2^lP, 2^{2l}P, ..., 2^{(precompute_factor - 1) \cdot l}P)$, where $(l)$ is a level of precomputation determined by the `precompute_factor`. The extended set of points facilitates faster MSM computations by allowing the MSM algorithm to leverage precomputed multiples of base points, reducing the number of point additions required during the computation.
-
-The precomputation process is crucial for optimizing MSM operations, especially when dealing with large sets of points and scalars. By precomputing and storing multiples of the base points, the MSM function can more efficiently compute the scalar-point multiplications.
-
-#### `PrecomputePoints`
-
-Precomputes points for MSM by extending each base point with its multiples.
-
-```go
-func PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
-```
-
-##### Parameters
-
- **`points`**: A slice of the original affine points to be extended with their multiples.
- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`outputBases`**: The device slice allocated for storing the extended points.
-
-##### Example
-
-```go
-package main
-
-import (
-	"log"
-
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-)
-
-func main() {
-	cfg := bn254.GetDefaultMSMConfig()
-	points := bn254.GenerateAffinePoints(1024)
-	var precomputeFactor int32 = 8
-	var precomputeOut core.DeviceSlice
-	precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())
-
-	err := bn254.PrecomputePoints(points, 1024, &cfg, precomputeOut)
-	if err != cr.CudaSuccess {
-		log.Fatalf("PrecomputeBases failed: %v", err)
-	}
-}
-```
-
-#### `G2PrecomputePoints`
-
-This method is the same as `PrecomputePoints` but for G2 points. Extends each G2 curve base point with its multiples for optimized MSM computations.
-
-```go
-func G2PrecomputePoints(points core.HostOrDeviceSlice, msmSize int, cfg *core.MSMConfig, outputBases core.DeviceSlice) cr.CudaError
-```
-
-##### Parameters
-
- **`points`**: A slice of the original affine points to be extended with their multiples.
- **`msmSize`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`outputBases`**: The device slice allocated for storing the extended points.
-
-##### Example
-
-```go
-package main
-
-import (
-	"log"
-
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	g2 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
-)
-
-func main() {
-	cfg := g2.G2GetDefaultMSMConfig()
-	points := g2.G2GenerateAffinePoints(1024)
-	var precomputeFactor int32 = 8
-	var precomputeOut core.DeviceSlice
-	precomputeOut.Malloc(points[0].Size()*points.Len()*int(precomputeFactor), points[0].Size())
-
-	err := g2.G2PrecomputePoints(points, 1024, 0, &cfg, precomputeOut)
-	if err != cr.CudaSuccess {
-		log.Fatalf("PrecomputeBases failed: %v", err)
-	}
-}
-```
--- a/docs/docs/icicle/golang-bindings/msm.md
+++ b/docs/docs/icicle/golang-bindings/msm.md
@@ -1,198 +0,0 @@
-# MSM
-
-## MSM Example
-
-```go
-package main
-
-import (
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-	bn254_msm "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/msm"
-)
-
-func main() {
-	// Obtain the default MSM configuration.
-	cfg := core.GetDefaultMSMConfig()
-
-	// Define the size of the problem, here 2^18.
-	size := 1 << 18
-
-	// Generate scalars and points for the MSM operation.
-	scalars := bn254.GenerateScalars(size)
-	points := bn254.GenerateAffinePoints(size)
-
-	// Create a CUDA stream for asynchronous operations.
-	stream, _ := cr.CreateStream()
-	var p bn254.Projective
-
-	// Allocate memory on the device for the result of the MSM operation.
-	var out core.DeviceSlice
-	_, e := out.MallocAsync(p.Size(), p.Size(), stream)
-
-	if e != cr.CudaSuccess {
-		panic(e)
-	}
-
-	// Set the CUDA stream in the MSM configuration.
-	cfg.Ctx.Stream = &stream
-	cfg.IsAsync = true
-
-	// Perform the MSM operation.
-	e = bn254_msm.Msm(scalars, points, &cfg, out)
-
-	if e != cr.CudaSuccess {
-		panic(e)
-	}
-
-	// Allocate host memory for the results and copy the results from the device.
-	outHost := make(core.HostSlice[bn254.Projective], 1)
-	cr.SynchronizeStream(&stream)
-	outHost.CopyFromDevice(&out)
-
-	// Free the device memory allocated for the results.
-	out.Free()
-}
-
-```
-
-## MSM Method
-
-```go
-func Msm(scalars core.HostOrDeviceSlice, points core.HostOrDeviceSlice, cfg *core.MSMConfig, results core.HostOrDeviceSlice) cr.CudaError
-```
-
-### Parameters
-
- **`scalars`**: A slice containing the scalars for multiplication. It can reside either in host memory or device memory.
- **`points`**: A slice containing the points to be multiplied with scalars. Like scalars, these can also be in host or device memory.
- **`cfg`**: A pointer to an `MSMConfig` object, which contains various configuration options for the MSM operation.
- **`results`**: A slice where the results of the MSM operation will be stored. This slice can be in host or device memory.
-
-### Return Value
-
- **`CudaError`**: Returns a CUDA error code indicating the success or failure of the MSM operation.
-
-## MSMConfig
-
-The `MSMConfig` structure holds configuration parameters for the MSM operation, allowing customization of its behavior to optimize performance based on the specifics of the operation or the underlying hardware.
-
-```go
-type MSMConfig struct {
-    Ctx cr.DeviceContext
-    PrecomputeFactor int32
-    C int32
-    Bitsize int32
-    LargeBucketFactor int32
-    batchSize int32
-    areScalarsOnDevice bool
-    AreScalarsMontgomeryForm bool
-    arePointsOnDevice bool
-    ArePointsMontgomeryForm bool
-    areResultsOnDevice bool
-    IsBigTriangle bool
-    IsAsync bool
-}
-```
-
-### Fields
-
- **`Ctx`**: Device context containing details like device id and stream.
- **`PrecomputeFactor`**: Controls the number of extra points to pre-compute.
- **`C`**: Window bitsize, a key parameter in the "bucket method" for MSM.
- **`Bitsize`**: Number of bits of the largest scalar.
- **`LargeBucketFactor`**: Sensitivity to frequently occurring buckets.
- **`batchSize`**: Number of results to compute in one batch.
- **`areScalarsOnDevice`**: Indicates if scalars are located on the device.
- **`AreScalarsMontgomeryForm`**: True if scalars are in Montgomery form.
- **`arePointsOnDevice`**: Indicates if points are located on the device.
- **`ArePointsMontgomeryForm`**: True if point coordinates are in Montgomery form.
- **`areResultsOnDevice`**: Indicates if results are stored on the device.
- **`IsBigTriangle`**: If `true` MSM will run in Large triangle accumulation if `false` Bucket accumulation will be chosen. Default value: false.
- **`IsAsync`**: If true, runs MSM asynchronously.
-
-### Default Configuration
-
-Use `GetDefaultMSMConfig` to obtain a default configuration, which can then be customized as needed.
-
-```go
-func GetDefaultMSMConfig() MSMConfig
-```
-
-## How do I toggle between the supported algorithms?
-
-When creating your MSM Config you may state which algorithm you wish to use. `cfg.Ctx.IsBigTriangle = true` will activate Large triangle reduction and `cfg.Ctx.IsBigTriangle = false` will activate iterative reduction.
-
-```go
-...
-
-// Obtain the default MSM configuration.
-cfg := GetDefaultMSMConfig()
-
-cfg.Ctx.IsBigTriangle = true
-
-...
-```
-
-## How do I toggle between MSM modes?
-
-Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `MSM` function.
-
-The number of results is interpreted from the size of `var out core.DeviceSlice`. Thus it's important when allocating memory for `var out core.DeviceSlice` to make sure that you are allocating `<number of results> X <size of a single point>`.
-
-```go
-... 
-
-batchSize := 3
-var p G2Projective
-var out core.DeviceSlice
-out.Malloc(batchSize*p.Size(), p.Size())
-
-...
-```
-
-## Parameters for optimal performance
-
-Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)
-
-## Support for G2 group
-
-To activate G2 support first you must make sure you are building the static libraries with G2 feature enabled as described in the [Golang building instructions](../golang-bindings.md#using-icicle-golang-bindings-in-your-project).
-
-Now you may import `g2` package of the specified curve.
-
-```go
-import (
-    "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
-)
-```
-
-This package includes `G2Projective` and `G2Affine` points as well as a `G2Msm` method.
-
-```go
-package main
-
-import (
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-	g2 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254/g2"
-)
-
-func main() {
-	cfg := core.GetDefaultMSMConfig()
-	size := 1 << 12
-	batchSize := 3
-	totalSize := size * batchSize
-	scalars := bn254.GenerateScalars(totalSize)
-	points := g2.G2GenerateAffinePoints(totalSize)
-
-	var p g2.G2Projective
-	var out core.DeviceSlice
-	out.Malloc(batchSize*p.Size(), p.Size())
-	g2.G2Msm(scalars, points, &cfg, out)
-}
-
-```
-
-`G2Msm` works the same way as normal MSM, the difference is that it uses G2 Points.
--- a/docs/docs/icicle/golang-bindings/multi-gpu.md
+++ b/docs/docs/icicle/golang-bindings/multi-gpu.md
@@ -1,155 +0,0 @@
-# Multi GPU APIs
-
-To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation.
-
-Here we will cover the core multi GPU apis and an [example](#a-multi-gpu-example)
-
-## A Multi GPU example
-
-In this example we will display how you can
-
-1. Fetch the number of devices installed on a machine
-2. For every GPU launch a thread and set an active device per thread.
-3. Execute a MSM on each GPU
-
-```go
-package main
-
-import (
-	"fmt"
-	"sync"
-
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-)
-
-func main() {
-	numDevices, _ := cr.GetDeviceCount()
-	fmt.Println("There are ", numDevices, " devices available")
-	wg := sync.WaitGroup{}
-
-	for i := 0; i < numDevices; i++ {
-		wg.Add(1)
-		// RunOnDevice makes sure each MSM runs on a single thread
-		cr.RunOnDevice(i, func(args ...any) {
-			defer wg.Done()
-			cfg := bn254.GetDefaultMSMConfig()
-			cfg.IsAsync = true
-			for _, power := range []int{10, 18} {
-				size := 1 << power // 2^pwr
-
-				// generate random scalars
-				scalars := bn254.GenerateScalars(size)
-				points := bn254.GenerateAffinePoints(size)
-
-				// create a stream and allocate result pointer
-				stream, _ := cr.CreateStream()
-				var p bn254.Projective
-				var out core.DeviceSlice
-				out.MallocAsync(p.Size(), p.Size(), stream)
-				// assign stream to device context
-				cfg.Ctx.Stream = &stream
-
-				// execute MSM
-				bn254.Msm(scalars, points, &cfg, out)
-				// read result from device
-				outHost := make(core.HostSlice[bn254.Projective], 1)
-				outHost.CopyFromDeviceAsync(&out, stream)
-				out.FreeAsync(stream)
-
-				// sync the stream
-				cr.SynchronizeStream(&stream)
-			}
-		})
-	}
-	wg.Wait()
-}
-```
-
-This example demonstrates a basic pattern for distributing tasks across multiple GPUs. The `RunOnDevice` function ensures that each goroutine is executed on its designated GPU and a corresponding thread.
-
-## Device Management API
-
-To streamline device management we offer as part of `cuda_runtime` package methods for dealing with devices.
-
-### `RunOnDevice`
-
-Runs a given function on a specific GPU device, ensuring that all CUDA calls within the function are executed on the selected device.
-
-In Go, most concurrency can be done via Goroutines. However, there is no guarantee that a goroutine stays on a specific host thread.
-
-`RunOnDevice` was designed to solve this caveat and ensure that the goroutine will stay on a specific host thread.
-
-`RunOnDevice` locks a goroutine into a specific host thread, sets a current GPU device, runs a provided function, and unlocks the goroutine from the host thread after the provided function finishes.
-
-While the goroutine is locked to the host thread, the Go runtime will not assign other goroutines to that host thread.
-
-**Parameters:**
-
- **`deviceId int`**: The ID of the device on which to run the provided function. Device IDs start from 0.
- **`funcToRun func(args ...any)`**: The function to be executed on the specified device.
- **`args ...any`**: Arguments to be passed to `funcToRun`.
-
-**Behavior:**
-
- The function `funcToRun` is executed in a new goroutine that is locked to a specific OS thread to ensure that all CUDA calls within the function target the specified device.
-
-:::note
-Any goroutines launched within `funcToRun` are not automatically bound to the same GPU device. If necessary, `RunOnDevice` should be called again within such goroutines with the same `deviceId`.
-:::
-
-**Example:**
-
-```go
-RunOnDevice(0, func(args ...any) {
-	fmt.Println("This runs on GPU 0")
-	// CUDA-related operations here will target GPU 0
-}, nil)
-```
-
-### `SetDevice`
-
-Sets the active device for the current host thread. All subsequent CUDA calls made from this thread will target the specified device.
-
-:::warning
-This function should not be used directly in conjunction with goroutines. If you want to run multi-gpu scenarios with goroutines you should use [RunOnDevice](#runondevice)
-:::
-
-**Parameters:**
-
- **`device int`**: The ID of the device to set as the current device.
-
-**Returns:**
-
- **`CudaError`**: Error code indicating the success or failure of the operation.
-
-### `GetDeviceCount`
-
-Retrieves the number of CUDA-capable devices available on the host.
-
-**Returns:**
-
- **`(int, CudaError)`**: The number of devices and an error code indicating the success or failure of the operation.
-
-### `GetDevice`
-
-Gets the ID of the currently active device for the calling host thread.
-
-**Returns:**
-
- **`(int, CudaError)`**: The ID of the current device and an error code indicating the success or failure of the operation.
-
-### `GetDeviceFromPointer`
-
-Retrieves the device associated with a given pointer.
-
-**Parameters:**
-
- **`ptr unsafe.Pointer`**: Pointer to query.
-
-**Returns:**
-
- **`int`**: The device ID associated with the memory pointed to by `ptr`.
-
-This documentation should provide a clear understanding of how to effectively manage multiple GPUs in Go applications using CUDA, with a particular emphasis on the `RunOnDevice` function for executing tasks on specific GPUs.
--- a/docs/docs/icicle/golang-bindings/ntt.md
+++ b/docs/docs/icicle/golang-bindings/ntt.md
@@ -1,151 +0,0 @@
-# NTT
-
-## NTT Example
-
-```go
-package main
-
-import (
-  "github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-  cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-  bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-
-  "github.com/consensys/gnark-crypto/ecc/bn254/fr/fft"
-)
-
-func init() {
-  cfg := bn254.GetDefaultNttConfig()
-  initDomain(18, cfg)
-}
-
-func initDomain[T any](largestTestSize int, cfg core.NTTConfig[T]) core.IcicleError {
-  rouMont, _ := fft.Generator(uint64(1 << largestTestSize))
-  rou := rouMont.Bits()
-  rouIcicle := bn254.ScalarField{}
-
-  rouIcicle.FromLimbs(rou[:])
-  e := bn254.InitDomain(rouIcicle, cfg.Ctx, false)
-  return e
-}
-
-func main() {
-  // Obtain the default NTT configuration with a predefined coset generator.
-  cfg := bn254.GetDefaultNttConfig()
-
-  // Define the size of the input scalars.
-  size := 1 << 18
-
-  // Generate scalars for the NTT operation.
-  scalars := bn254.GenerateScalars(size)
-
-  // Set the direction of the NTT (forward or inverse).
-  dir := core.KForward
-
-  // Allocate memory for the results of the NTT operation.
-  results := make(core.HostSlice[bn254.ScalarField], size)
-
-  // Perform the NTT operation.
-  err := bn254.Ntt(scalars, dir, &cfg, results)
-  if err.CudaErrorCode != cr.CudaSuccess {
-    panic("NTT operation failed")
-  }
-}
-```
-
-## NTT Method
-
-```go
-func Ntt[T any](scalars core.HostOrDeviceSlice, dir core.NTTDir, cfg *core.NTTConfig[T], results core.HostOrDeviceSlice) core.IcicleError
-```
-
-### Parameters
-
- **`scalars`**: A slice containing the input scalars for the transform. It can reside either in host memory or device memory.
- **`dir`**: The direction of the NTT operation (`KForward` or `KInverse`).
- **`cfg`**: A pointer to an `NTTConfig` object, containing configuration options for the NTT operation.
- **`results`**: A slice where the results of the NTT operation will be stored. This slice can be in host or device memory.
-
-### Return Value
-
- **`CudaError`**: Returns a CUDA error code indicating the success or failure of the NTT operation.
-
-## NTT Configuration (NTTConfig)
-
-The `NTTConfig` structure holds configuration parameters for the NTT operation, allowing customization of its behavior to optimize performance based on the specifics of your protocol.
-
-```go
-type NTTConfig[T any] struct {
-    Ctx cr.DeviceContext
-    CosetGen T
-    BatchSize int32
-    ColumnsBatch bool
-    Ordering Ordering
-    areInputsOnDevice  bool
-    areOutputsOnDevice bool
-    IsAsync bool
-    NttAlgorithm NttAlgorithm
-}
-```
-
-### Fields
-
- **`Ctx`**: Device context containing details like device ID and stream ID.
- **`CosetGen`**: Coset generator used for coset (i)NTTs, defaulting to no coset being used.
- **`BatchSize`**: The number of NTTs to compute in one operation, defaulting to 1.
- **`ColumnsBatch`**: If true the function will compute the NTTs over the columns of the input matrix and not over the rows. Defaults to `false`.
- **`Ordering`**: Ordering of inputs and outputs (`KNN`, `KNR`, `KRN`, `KRR`, `KMN`, `KNM`), affecting how data is arranged.
- **`areInputsOnDevice`**: Indicates if input scalars are located on the device.
- **`areOutputsOnDevice`**: Indicates if results are stored on the device.
- **`IsAsync`**: Controls whether the NTT operation runs asynchronously.
- **`NttAlgorithm`**: Explicitly select the NTT algorithm. Default value: Auto (the implementation selects radix-2 or mixed-radix algorithm based on heuristics).
-
-### Default Configuration
-
-Use `GetDefaultNTTConfig` to obtain a default configuration, customizable as needed.
-
-```go
-func GetDefaultNTTConfig[T any](cosetGen T) NTTConfig[T]
-```
-
-### Initializing the NTT Domain
-
-Before performing NTT operations, it's necessary to initialize the NTT domain; it only needs to be called once per GPU since the twiddles are cached.
-
-```go
-func InitDomain(primitiveRoot ScalarField, ctx cr.DeviceContext, fastTwiddles bool) core.IcicleError
-```
-
-This function initializes the domain with a given primitive root, optionally using fast twiddle factors to optimize the computation.
-
-### Releasing the domain
-
-The `ReleaseDomain` function is responsible for releasing the resources associated with a specific domain in the CUDA device context.
-
-```go
-func ReleaseDomain(ctx cr.DeviceContext) core.IcicleError
-```
-
-### Parameters
-
- **`ctx`**: a reference to the `DeviceContext` object, which represents the CUDA device context.
-
-### Return Value
-
-The function returns a `core.IcicleError`, which represents the result of the operation. If the operation is successful, the function returns `core.IcicleErrorCode(0)`.
-
-### Example
-
-```go
-import (
-    "github.com/icicle-crypto/icicle-core/cr"
-    "github.com/icicle-crypto/icicle-core/core"
-)
-
-func example() {
-  cfg := GetDefaultNttConfig()
-  err := ReleaseDomain(cfg.Ctx)
-  if err != nil {
-      // Handle the error
-  }
-}
-```
--- a/docs/docs/icicle/golang-bindings/vec-ops.md
+++ b/docs/docs/icicle/golang-bindings/vec-ops.md
@@ -1,188 +0,0 @@
-# Vector Operations
-
-## Overview
-
-Icicle exposes a number of vector operations which a user can use:
-
-* The VecOps API provides efficient vector operations such as addition, subtraction, and multiplication.
-* MatrixTranspose API allows a user to perform a transpose on a vector representation of a matrix
-
-## VecOps API Documentation
-
-### Example
-
-#### Vector addition
-
-```go
-package main
-
-import (
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-)
-
-func main() {
-	testSize := 1 << 12
-	a := bn254.GenerateScalars(testSize)
-	b := bn254.GenerateScalars(testSize)
-	out := make(core.HostSlice[bn254.ScalarField], testSize)
-	cfg := core.DefaultVecOpsConfig()
-
-	// Perform vector multiplication
-	err := bn254.VecOp(a, b, out, cfg, core.Add)
-	if err != cr.CudaSuccess {
-		panic("Vector addition failed")
-	}
-}
-```
-
-#### Vector Subtraction
-
-```go
-package main
-
-import (
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-)
-
-func main() {
-	testSize := 1 << 12
-	a := bn254.GenerateScalars(testSize)
-	b := bn254.GenerateScalars(testSize)
-	out := make(core.HostSlice[bn254.ScalarField], testSize)
-	cfg := core.DefaultVecOpsConfig()
-
-	// Perform vector multiplication
-	err := bn254.VecOp(a, b, out, cfg, core.Sub)
-	if err != cr.CudaSuccess {
-		panic("Vector subtraction failed")
-	}
-}
-```
-
-#### Vector Multiplication
-
-```go
-package main
-
-import (
-	"github.com/ingonyama-zk/icicle/v2/wrappers/golang/core"
-	cr "github.com/ingonyama-zk/icicle/v2/wrappers/golang/cuda_runtime"
-	bn254 "github.com/ingonyama-zk/icicle/v2/wrappers/golang/curves/bn254"
-)
-
-func main() {
-	testSize := 1 << 12
-	a := bn254.GenerateScalars(testSize)
-	b := bn254.GenerateScalars(testSize)
-	out := make(core.HostSlice[bn254.ScalarField], testSize)
-	cfg := core.DefaultVecOpsConfig()
-
-	// Perform vector multiplication
-	err := bn254.VecOp(a, b, out, cfg, core.Mul)
-	if err != cr.CudaSuccess {
-		panic("Vector multiplication failed")
-	}
-}
-```
-
-### VecOps Method
-
-```go
-func VecOp(a, b, out core.HostOrDeviceSlice, config core.VecOpsConfig, op core.VecOps) (ret cr.CudaError)
-```
-
-#### Parameters
-
- **`a`**: The first input vector.
- **`b`**: The second input vector.
- **`out`**: The output vector where the result of the operation will be stored.
- **`config`**: A `VecOpsConfig` object containing various configuration options for the vector operations.
- **`op`**: The operation to perform, specified as one of the constants (`Sub`, `Add`, `Mul`) from the `VecOps` type.
-
-#### Return Value
-
- **`CudaError`**: Returns a CUDA error code indicating the success or failure of the vector operation.
-
-### VecOpsConfig
-
-The `VecOpsConfig` structure holds configuration parameters for the vector operations, allowing customization of its behavior.
-
-```go
-type VecOpsConfig struct {
-    Ctx cr.DeviceContext
-    isAOnDevice bool
-    isBOnDevice bool
-    isResultOnDevice bool
-    IsAsync bool
-}
-```
-
-#### Fields
-
- **Ctx**: Device context containing details like device ID and stream ID.
- **isAOnDevice**: Indicates if vector `a` is located on the device.
- **isBOnDevice**: Indicates if vector `b` is located on the device.
- **isResultOnDevice**: Specifies where the result vector should be stored (device or host memory).
- **IsAsync**: Controls whether the vector operation runs asynchronously.
-
-#### Default Configuration
-
-Use `DefaultVecOpsConfig` to obtain a default configuration, customizable as needed.
-
-```go
-func DefaultVecOpsConfig() VecOpsConfig
-```
-
-## MatrixTranspose API Documentation
-
-This section describes the functionality of the `TransposeMatrix` function used for matrix transposition.
-
-The function takes a matrix represented as a 1D slice and transposes it, storing the result in another 1D slice.
-
-### Function
-
-```go
-func TransposeMatrix(in, out core.HostOrDeviceSlice, columnSize, rowSize int, ctx cr.DeviceContext, onDevice, isAsync bool) (ret core.IcicleError)
-```
-
-## Parameters
-
- **`in`**: The input matrix is a `core.HostOrDeviceSlice`, stored as a 1D slice.
- **`out`**: The output matrix is a `core.HostOrDeviceSlice`, which will be the transpose of the input matrix, stored as a 1D slice.
- **`columnSize`**: The number of columns in the input matrix.
- **`rowSize`**: The number of rows in the input matrix.
- **`ctx`**: The device context `cr.DeviceContext` to be used for the matrix transpose operation.
- **`onDevice`**: Indicates whether the input and output slices are stored on the device (GPU) or the host (CPU).
- **`isAsync`**: Indicates whether the matrix transpose operation should be executed asynchronously.
-
-## Return Value
-
-The function returns a `core.IcicleError` value, which represents the result of the matrix transpose operation. If the operation is successful, the returned value will be `0`.
-
-## Example Usage
-
-```go
-var input = make(core.HostSlice[ScalarField], 20)
-var output = make(core.HostSlice[ScalarField], 20)
-
-// Populate the input matrix
-// ...
-
-// Get device context
-ctx, _ := cr.GetDefaultDeviceContext()
-
-// Transpose the matrix
-err := TransposeMatrix(input, output, 5, 4, ctx, false, false)
-if err.IcicleErrorCode != core.IcicleErrorCode(0) {
-    // Handle the error
-}
-
-// Use the transposed matrix
-// ...
-```
-
-In this example, the `TransposeMatrix` function is used to transpose a 5x4 matrix stored in a 1D slice. The input and output slices are stored on the host (CPU), and the operation is executed synchronously.
--- a/docs/docs/icicle/image.png
+++ b/docs/docs/icicle/image.png
--- a/docs/docs/icicle/integrations.md
+++ b/docs/docs/icicle/integrations.md
@@ -1,97 +0,0 @@
-# ICICLE integrated provers
-
-ICICLE has been used by companies and projects such as [Celer Network](https://github.com/celer-network), [Consensys Gnark](https://github.com/Consensys/gnark), [EZKL](https://blog.ezkl.xyz/post/acceleration/), [ZKWASM](https://twitter.com/DelphinusLab/status/1762604988797513915) and others to accelerate their ZK proving pipeline.
-
-Many of these integrations have been a collaboration between Ingonyama and the integrating company. We have learned a lot about designing GPU based ZK provers.
-
-If you're interested in understanding these integrations better or learning how you can use ICICLE to accelerate your existing ZK proving pipeline this is the place for you.
-
-## A primer to building your own integrations
-
-Lets illustrate an ICICLE integration, so you can understand the core API and design overview of ICICLE.
-
-![ICICLE architecture](../../static/img/architecture-high-level.png)
-
-Engineers usually use a cryptographic library to implement their ZK protocols. These libraries implement efficient primitives which are used as building blocks for the protocol; ICICLE is such a library. The difference is that ICICLE is designed from the start to run on GPUs; the Rust and Golang APIs abstract away all low level CUDA details. Our goal was to allow developers with no GPU experience to quickly get started with ICICLE.
-
-A developer may use ICICLE with two main approaches in mind.
-
-1. Drop-in replacement approach.
-2. End-to-End GPU replacement approach.
-
-The first approach for GPU-accelerating your Prover with ICICLE is quick to implement, but it has limitations, such as reduced memory optimization and limited protocol tuning for GPUs. It's a solid starting point, but those committed to fully leveraging GPU acceleration should consider a more comprehensive approach.
-
-A End-to-End GPU replacement means performing the entire ZK proof on the GPU. This approach will reduce latency to a minimum and requires you to change the way you implement the protocol to be more GPU friendly. This approach will take full advantage of GPU acceleration. Redesigning your prover this way may take more engineering effort but we promise you that its worth it!
-
-## Using ICICLE integrated provers
-
-Here we cover how a developer can run existing circuits on ICICLE integrated provers.
-
-### Gnark
-
-[Gnark](https://github.com/Consensys/gnark) officially supports GPU proving with ICICLE. Currently only Groth16 on curve `BN254` is supported. This means that if you are currently using Gnark to write your circuits you can enjoy GPU acceleration without making many changes.
-
-:::info
-
-Currently ICICLE has been merged to Gnark [master branch](https://github.com/Consensys/gnark), however the [latest release](https://github.com/Consensys/gnark/releases/tag/v0.9.1) is from October 2023.
-
-:::
-
-Make sure your golang circuit project has `gnark` as a dependency and that you are using the master branch for now.
-
-```
-go get 	github.com/consensys/gnark@master
-```
-
-You should see two indirect dependencies added.
-
-```
-...
-	github.com/ingonyama-zk/icicle v0.1.0 // indirect
-	github.com/ingonyama-zk/iciclegnark v0.1.1 // indirect
-...
-```
-
-:::info
-As you may notice we are using ICICLE v0.1 here since golang bindings are only support in ICICLE v0.1 for the time being.
-:::
-
-To switch over to ICICLE proving, make sure to change the backend you are using, below is an example of how this should be done.
-
-```
-// toggle on
-proofIci, err := groth16.Prove(ccs, pk, secretWitness, backend.WithIcicleAcceleration())
-
-// toggle off
-proof, err := groth16.Prove(ccs, pk, secretWitness)
-```
-
-Now that you have enabled `WithIcicleAcceleration` backend simple change the way your run your circuits to:
-
-```
-go run -tags=icicle main.go
-```
-
-Your logs should look something like this if everything went as expected.
-
-```
-13:12:05 INF compiling circuit
-13:12:05 INF parsed circuit inputs nbPublic=1 nbSecret=1
-13:12:05 INF building constraint builder nbConstraints=3
-13:12:05 DBG precomputing proving key in GPU acceleration=icicle backend=groth16 curve=bn254 nbConstraints=3
-13:12:05 DBG constraint system solver done nbConstraints=3 took=0.070259
-13:12:05 DBG prover done acceleration=icicle backend=groth16 curve=bn254 nbConstraints=3 took=80.356684
-13:12:05 DBG verifier done backend=groth16 curve=bn254 took=1.843888
-```
-
-`acceleration=icicle` indicates that the prover is running in acceleration mode with ICICLE.
-
-You can reference the [Gnark docs](https://github.com/Consensys/gnark?tab=readme-ov-file#gpu-support) for further information.
-
-### Halo2
-
-[Halo2](https://github.com/zkonduit/halo2) fork integrated with ICICLE for GPU acceleration. This means that you can run your existing Halo2 circuits with GPU acceleration just by activating a feature flag.
-
-To enable GPU acceleration just enable `icicle_gpu` [feature flag](https://github.com/zkonduit/halo2/blob/3d7b5e61b3052680ccb279e05bdcc21dd8a8fedf/halo2_proofs/Cargo.toml#L102).
-
-This feature flag will seamlessly toggle on GPU acceleration for you.
--- a/docs/docs/icicle/introduction.md
+++ b/docs/docs/icicle/introduction.md
@@ -1,247 +0,0 @@
-# Getting started with ICICLE
-
-This guide is oriented towards developers who want to start writing code with the ICICLE libraries. If you just want to run your existing ZK circuits on GPU refer to [this guide](./integrations.md#using-icicle-integrations) please.
-
-## ICICLE repository overview
-
-![ICICLE API overview](../../static/img/apilevels.png)
-
-The diagram above displays the general architecture of ICICLE and the API layers that exist. The CUDA API, which we also call ICICLE Core, is the lowest level and is comprised of CUDA kernels which implement all primitives such as MSM as well as C++ wrappers which expose these methods for different curves.
-
-ICICLE Core compiles into a static library. This library can be used with our official Golang and Rust wrappers or linked with your C++ project. You can also implement a wrapper for it in any other language.
-
-Based on this dependency architecture, the ICICLE repository has three main sections:
-
- [ICICLE Core](#icicle-core)
- [ICICLE Rust bindings](#icicle-rust-and-golang-bindings)
- [ICICLE Golang bindings](#icicle-rust-and-golang-bindings)
-
-### ICICLE Core
-
-[ICICLE Core](/icicle/core) is a library that directly works with GPU by defining CUDA kernels and algorithms that invoke them. It contains code for [fast field arithmetic](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/field/field.cuh), cryptographic primitives used in ZK such as [NTT](https://github.com/ingonyama-zk/icicle/tree/main/icicle/src/ntt/), [MSM](https://github.com/ingonyama-zk/icicle/tree/main/icicle/src/msm/), [Poseidon Hash](https://github.com/ingonyama-zk/icicle/tree/main/icicle/src/poseidon/), [Polynomials](https://github.com/ingonyama-zk/icicle/tree/main/icicle/src/polynomials/) and others.
-
-ICICLE Core would typically be compiled into a static library and either used in a third party language such as Rust or Golang, or linked with your own C++ project.
-
-### ICICLE Rust and Golang bindings
-
- [ICICLE Rust bindings](/icicle/rust-bindings)
- [ICICLE Golang bindings](/icicle/golang-bindings)
-
-These bindings allow you to easily use ICICLE in a Rust or Golang project. Setting up Golang bindings requires a bit of extra steps compared to the Rust bindings which utilize the `cargo build` tool.
-
-## Running ICICLE
-
-This guide assumes that you have a Linux or Windows machine with an Nvidia GPU installed. If you don't have access to an Nvidia GPU you can access one for free on [Google Colab](https://colab.google/).
-
-:::info note
-
-ICICLE can only run on Linux or Windows. **MacOS is not supported**.
-
-:::
-
-### Prerequisites
-
- NVCC (version 12.0 or newer)
- cmake 3.18 and above
- GCC - version 9 or newer is recommended.
- Any Nvidia GPU
- Linux or Windows operating system.
-
-#### Optional Prerequisites
-
- Docker, latest version.
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html)
-
-If you don't wish to install these prerequisites you can follow this tutorial using a [ZK-Container](https://github.com/ingonyama-zk/icicle/blob/main/Dockerfile) (docker container). To learn more about using ZK-Containers [read this](../ZKContainers.md).
-
-### Setting up ICICLE and running tests
-
-The objective of this guide is to make sure you can run the ICICLE Core, Rust and Golang tests. Achieving this will ensure you know how to setup ICICLE and run an ICICLE program. For simplicity, we will be using the ICICLE docker container as our environment, however, you may install the prerequisites on your machine and [skip](#icicle-core-1) the docker section.
-
-#### Setting up environment with Docker
-
-Lets begin by cloning the ICICLE repository:
-
-```sh
-git clone https://github.com/ingonyama-zk/icicle
-```
-
-We will proceed to build the docker image [found here](https://github.com/ingonyama-zk/icicle/blob/main/Dockerfile):
-
-```sh
-docker build -t icicle-demo .
-docker run -it --runtime=nvidia --gpus all --name icicle_container icicle-demo
-```
-
- `-it` runs the container in interactive mode with a terminal.
- `--gpus all` Allocate all available GPUs to the container. You can also specify which GPUs to use if you don't want to allocate all.
- `--runtime=nvidia` Use the NVIDIA runtime, necessary for GPU support.
-
-To read more about these settings reference this [article](https://developer.nvidia.com/nvidia-container-runtime).
-
-If you accidentally close your terminal and want to reconnect just call:
-
-```sh
-docker exec -it icicle_container bash
-```
-
-Lets make sure that we have the correct CUDA version before proceeding
-
-```sh
-nvcc --version
-```
-
-You should see something like this
-
-```sh
-nvcc: NVIDIA (R) Cuda compiler driver
-Copyright (c) 2005-2023 NVIDIA Corporation
-Built on Tue_Aug_15_22:02:13_PDT_2023
-Cuda compilation tools, release 12.2, V12.2.140
-Build cuda_12.2.r12.2/compiler.33191640_0
-```
-
-Make sure the release version is at least 12.0.
-
-#### ICICLE Core
-
-ICICLE Core is found under [`<project_root>/icicle`](https://github.com/ingonyama-zk/icicle/tree/main/icicle). To build and run the tests first:
-
-```sh
-cd icicle
-```
-
-For this example, we are going to compile ICICLE for a `bn254` curve. However other compilation strategies are supported.
-
-```sh
-mkdir -p build
-cmake -S . -B build -DCURVE=bn254 -DBUILD_TESTS=ON
-cmake --build build -j
-```
-
-`-DBUILD_TESTS` option compiles the tests, without this flag `ctest` won't work.
-`-DCURVE` option tells the compiler which curve to build. You can find a list of supported curves [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/cmake/CurvesCommon.cmake#L2).
-
-The output in `build` folder should include the static libraries for the compiled curve.
-
-To run the test
-
-```sh
-cd build/tests
-ctest
-```
-
-#### ICICLE Rust
-
-The rust bindings work by first compiling the CUDA static libraries as seen [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-curves/icicle-bn254/build.rs). The compilation of CUDA and the Rust library is all handled by the rust build toolchain.
-
-Similar to ICICLE Core here we also have to compile per curve.
-
-Lets compile curve `bn254`
-
-```sh
-cd wrappers/rust/icicle-curves/icicle-bn254
-```
-
-Now lets build our library
-
-```sh
-cargo build --release
-```
-
-This may take a couple of minutes since we are compiling both the CUDA and Rust code.
-
-To run the tests
-
-```sh
-cargo test
-```
-
-We also include some benchmarks
-
-```sh
-cargo bench
-```
-
-#### ICICLE Golang
-
-The Golang bindings require compiling ICICLE Core first. We supply a [build script](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/build.sh) to help build what you need.
-
-Script usage:
-
-```sh
-./build.sh [-curve=<curve>] [-field=<field>] [-hash=<hash>] [-cuda_version=<version>] [-g2] [-ecntt] [-devmode]
-
-curve - The name of the curve to build or "all" to build all supported curves
-field - The name of the field to build or "all" to build all supported fields
-hash - The name of the hash to build or "all" to build all supported hashes
-g2 - Optional - build with G2 enabled 
-ecntt - Optional - build with ECNTT enabled
-devmode - Optional - build in devmode
-```
-
-:::note
-
-If more than one curve or more than one field or more than one hash is supplied, the last one supplied will be built
-
-:::
-
-Once the library has been built, you can use and test the Golang bindings.
-
-To test a specific curve, field or hash, change to it's directory and then run:
-
-```sh
-go test ./tests -count=1 -failfast -timeout 60m -p 2 -v
-```
-
-You will be able to see each test that runs, how long it takes and whether it passed or failed
-
-### Running ICICLE examples
-
-ICICLE examples can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/examples) these examples cover some simple use cases using C++, rust and golang.
-
-Lets run one of our C++ examples, in this case the [MSM example](https://github.com/ingonyama-zk/icicle/blob/main/examples/c%2B%2B/msm/example.cu).
-
-```sh
-cd examples/c++/msm
-./compile.sh
-./run.sh
-```
-
-:::tip
-
-Read through the compile.sh and CMakeLists.txt to understand how to link your own C++ project with ICICLE
-
-:::
-
-#### Running with Docker
-
-In each example directory, ZK-container files are located in a subdirectory `.devcontainer`.
-
-```sh
-msm/
-├── .devcontainer
-   ├── devcontainer.json
-   └── Dockerfile
-```
-
-Now lets build our docker file and run the test inside it. Make sure you have installed the [optional prerequisites](#optional-prerequisites).
-
-```sh
-docker build -t icicle-example-msm -f .devcontainer/Dockerfile .
-```
-
-Lets start and enter the container
-
-```sh
-docker run -it --rm --gpus all -v .:/icicle-example icicle-example-msm
-```
-
-Inside the container you can run the same commands:
-
-```sh
-./compile.sh
-./run.sh
-```
-
-You can now experiment with our other examples, perhaps try to run a rust or golang example next.
--- a/docs/docs/icicle/multi-gpu.md
+++ b/docs/docs/icicle/multi-gpu.md
@@ -1,61 +0,0 @@
-# Multi GPU with ICICLE
-
-:::info
-
-If you are looking for the Multi GPU API documentation refer [here](./rust-bindings/multi-gpu.md) for Rust and [here](./golang-bindings/multi-gpu.md) for Golang.
-
-:::
-
-One common challenge with Zero-Knowledge computation is managing the large input sizes. It's not uncommon to encounter circuits surpassing 2^25 constraints, pushing the capabilities of even advanced GPUs to their limits. To effectively scale and process such large circuits, leveraging multiple GPUs in tandem becomes a necessity.
-
-Multi-GPU programming involves developing software to operate across multiple GPU devices. Lets first explore different approaches to Multi-GPU programming then we will cover how ICICLE allows you to easily develop youR ZK computations to run across many GPUs.
-
-## Approaches to Multi GPU programming
-
-There are many [different strategies](https://github.com/NVIDIA/multi-gpu-programming-models) available for implementing multi GPU, however, it can be split into two categories.
-
-### GPU Server approach
-
-This approach usually involves a single or multiple CPUs opening threads to read / write from multiple GPUs. You can think about it as a scaled up HOST - Device model.
-
-![alt text](image.png)
-
-This approach won't let us tackle larger computation sizes but it will allow us to compute multiple computations which we wouldn't be able to load onto a single GPU.
-
-For example let's say that you had to compute two MSMs of size 2^26 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel.
-
-### Inter GPU approach
-
-This approach involves a more sophisticated approach to multi GPU computation. Using technologies such as [GPUDirect, NCCL, NVSHMEM](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-cwes1084/) and NVLink it's possible to combine multiple GPUs and split a computation among different devices.
-
-This approach requires redesigning the algorithm at the software level to be compatible with splitting amongst devices. In some cases, to lower latency to a minimum, special inter GPU connections would be installed on a server to allow direct communication between multiple GPUs.
-
-## Writing ICICLE Code for Multi GPUs
-
-The approach we have taken for the moment is a GPU Server approach; we assume you have a machine with multiple GPUs and you wish to run some computation on each GPU.
-
-To dive deeper and learn about the API check out the docs for our different ICICLE API
-
- [Rust Multi GPU APIs](./rust-bindings/multi-gpu.md)
- [Golang Multi GPU APIs](./golang-bindings/multi-gpu.md)
- C++ Multi GPU APIs
-
-## Best practices
-
- Never hardcode device IDs, if you want your software to take advantage of all GPUs on a machine use methods such as `get_device_count` to support arbitrary number of GPUs.
-
- Launch one CPU thread per GPU. To avoid [nasty errors](https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/) and hard to read code we suggest that for every GPU you create a dedicated thread. Within a CPU thread you should be able to launch as many tasks as you wish for a GPU as long as they all run on the same GPU id. This will make your code way more manageable, easy to read and performant.
-
-## ZKContainer support for multi GPUs
-
-Multi GPU support should work with ZK-Containers by simply defining which devices the docker container should interact with:
-
-```sh
-docker run -it --gpus '"device=0,2"' zk-container-image
-```
-
-If you wish to expose all GPUs
-
-```sh
-docker run --gpus all zk-container-image
-```
--- a/docs/docs/icicle/overview.md
+++ b/docs/docs/icicle/overview.md
@@ -1,58 +0,0 @@
-# What is ICICLE?
-
-[![GitHub Release](https://img.shields.io/github/v/release/ingonyama-zk/icicle)](https://github.com/ingonyama-zk/icicle/releases)
-
-[ICICLE](https://github.com/ingonyama-zk/icicle) is a cryptography library for ZK using GPUs. ICICLE implements blazing fast cryptographic primitives such as EC operations, MSM, NTT, Poseidon hash and more on GPU.
-
-ICICLE allows developers with minimal GPU experience to effortlessly accelerate their ZK application; from our experiments, even the most naive implementation may yield 10X improvement in proving times.
-
-ICICLE has been used by many leading ZK companies such as [Celer Network](https://github.com/celer-network), [Gnark](https://github.com/Consensys/gnark) and others to accelerate their ZK proving pipeline.
-
-## Dont have access to a GPU?
-
-We understand that not all developers have access to a GPU and we don't want this to limit anyone from developing with ICICLE.
-Here are some ways we can help you gain access to GPUs:
-
-:::note
-
-If none of the following options suit your needs, contact us on [telegram](https://t.me/RealElan) for assistance. We're committed to ensuring that a lack of a GPU doesn't become a bottleneck for you. If you need help with setup or any other issues, we're here to help you.
-
-:::
-
-### Grants
-
-At Ingonyama we are interested in accelerating the progress of ZK and cryptography. If you are an engineer, developer or an academic researcher we invite you to checkout [our grant program](https://www.ingonyama.com/blog/icicle-for-researchers-grants-challenges). We will give you access to GPUs and even pay you to do your dream research!
-
-### Google Colab
-
-This is a great way to get started with ICICLE instantly. Google Colab offers free GPU access to a NVIDIA T4 instance with 16 GB of memory which should be enough for experimenting and even prototyping with ICICLE.
-
-For an extensive guide on how to setup Google Colab with ICICLE refer to [this article](./colab-instructions.md).
-
-### Vast.ai
-
-[Vast.ai](https://vast.ai/) is a global GPU marketplace where you can rent many different types of GPUs by the hour for [competitive pricing](https://vast.ai/pricing). They provide on-demand and interruptible rentals depending on your need or use case; you can learn more about their rental types [here](https://vast.ai/faq#rental-types).
-
-## What can you do with ICICLE?
-
-[ICICLE](https://github.com/ingonyama-zk/icicle) can be used in the same way you would use any other cryptography library. While developing and integrating ICICLE into many proof systems, we found some use case categories:
-
-### Circuit developers
-
-If you are a circuit developer and are experiencing bottlenecks while running your circuits, an ICICLE integrated prover may be the solution.
-
-ICICLE has been integrated into a number of popular ZK provers including [Gnark prover](https://github.com/Consensys/gnark) and [Halo2](https://github.com/zkonduit/halo2). This means that you can enjoy GPU acceleration for your existing circuits immediately without writing a single line of code by simply switching on the GPU prover flag!
-
-### Integrating into existing ZK provers
-
-From our collaborations we have learned that its possible to accelerate a specific part of your prover to solve for a specific bottleneck.
-
-ICICLE can be used to accelerate specific parts of your prover without completely rewriting your ZK prover.
-
-### Developing your own ZK provers
-
-If your goal is to build a ZK prover from the ground up, ICICLE is an ideal tool for creating a highly optimized and scalable ZK prover. A key benefit of using GPUs with ICICLE is the ability to scale your ZK prover efficiently across multiple machines within a data center.
-
-### Developing proof of concepts
-
-ICICLE is also ideal for developing small prototypes. ICICLE has Golang and Rust bindings so you can easily develop a library implementing a specific primitive using ICICLE. An example would be develop a KZG commitment library using ICICLE.
--- a/docs/docs/icicle/polynomials/ffi.uml
+++ b/docs/docs/icicle/polynomials/ffi.uml
@@ -1,27 +0,0 @@
-@startuml
-skinparam componentStyle uml2
-
-' Define Components
-component "C++ Template\nComponent" as CppTemplate {
-  [Parameterizable Interface]
-}
-component "C API Wrapper\nComponent" as CApiWrapper {
-  [C API Interface]
-}
-component "Rust Code\nComponent" as RustCode {
-  [Macro Interface\n(Template Instantiation)]
-}
-
-' Define Artifact
-artifact "Static Library\n«artifact»" as StaticLib
-
-' Connections
-CppTemplate -down-> CApiWrapper : Instantiates
-CApiWrapper .down.> StaticLib : Compiles into
-RustCode -left-> StaticLib : Links against\nand calls via FFI
-
-' Notes
-note right of CppTemplate : Generic C++\ntemplate implementation
-note right of CApiWrapper : Exposes C API for FFI\nto Rust/Go
-note right of RustCode : Uses macros to\ninstantiate templates
-@enduml
--- a/docs/docs/icicle/polynomials/hw_backends.uml
+++ b/docs/docs/icicle/polynomials/hw_backends.uml
@@ -1,86 +0,0 @@
-@startuml
-
-' Define Interface for Polynomial Backend Operations
-interface IPolynomialBackend {
-    +add()
-    +subtract()
-    +multiply()
-    +divide()
-    +evaluate()
-}
-
-' Define Interface for Polynomial Context (State Management)
-interface IPolynomialContext {
-    +initFromCoeffs()
-    +initFromEvals()
-    +getCoeffs()
-    +getEvals()
-}
-
-' PolynomialAPI now uses two strategies: Backend and Context
-class PolynomialAPI {
-    -backendStrategy: IPolynomialBackend
-    -contextStrategy: IPolynomialContext
-    -setBackendStrategy(IPolynomialBackend)
-    -setContextStrategy(IPolynomialContext)
-    +add()
-    +subtract()
-    +multiply()
-    +divide()
-    +evaluate()
-}
-
-' Backend Implementations
-class GPUPolynomialBackend implements IPolynomialBackend {
-    #gpuResources: Resource
-    +add()
-    +subtract()
-    +multiply()
-    +divide()
-    +evaluate()
-}
-
-class ZPUPolynomialBackend implements IPolynomialBackend {
-    #zpuResources: Resource
-    +add()
-    +subtract()
-    +multiply()
-    +divide()
-    +evaluate()
-}
-
-class TracerPolynomialBackend implements IPolynomialBackend {
-    #traceData: Data
-    +add()
-    +subtract()
-    +multiply()
-    +divide()
-    +evaluate()
-}
-
-' Context Implementations (Placeholder for actual implementation)
-class GPUContext implements IPolynomialContext {
-    +initFromCoeffs()
-    +initFromEvals()
-    +getCoeffs()
-    +getEvals()
-}
-
-class ZPUContext implements IPolynomialContext {
-    +initFromCoeffs()
-    +initFromEvals()
-    +getCoeffs()
-    +getEvals()
-}
-
-class TracerContext implements IPolynomialContext {
-    +initFromCoeffs()
-    +initFromEvals()
-    +getCoeffs()
-    +getEvals()
-}
-
-' Relationships
-PolynomialAPI o-- IPolynomialBackend : uses
-PolynomialAPI o-- IPolynomialContext : uses
-@enduml
--- a/docs/docs/icicle/polynomials/overview.md
+++ b/docs/docs/icicle/polynomials/overview.md
@@ -1,388 +0,0 @@
-# Polynomial API Overview
-
-:::note
-Read our paper on the Polynomials API in ICICLE v2 by clicking [here](https://eprint.iacr.org/2024/973).
-:::
-
-## Introduction
-
-The Polynomial API offers a robust framework for polynomial operations within a computational environment. It's designed for flexibility and efficiency, supporting a broad range of operations like arithmetic, evaluation, and manipulation, all while abstracting from the computation and storage specifics. This enables adaptability to various backend technologies, employing modern C++ practices.
-
-## Key Features
-
-### Backend Agnostic Architecture
-
-Our API is structured to be independent of any specific computational backend. While a CUDA backend is currently implemented, the architecture facilitates easy integration of additional backends. This capability allows users to perform polynomial operations without the need to tailor their code to specific hardware, enhancing code portability and scalability.
-
-### Templating in the Polynomial API
-
-The Polynomial API is designed with a templated structure to accommodate different data types for coefficients, the domain, and images. This flexibility allows the API to be adapted for various computational needs and types of data.
-
-```cpp
-template <typename Coeff, typename Domain = Coeff, typename Image = Coeff>
-class Polynomial {
-    // Polynomial class definition
-}
-```
-
-In this template:
-
- **`Coeff`**: Represents the type of the coefficients of the polynomial.
- **`Domain`**: Specifies the type for the input values over which the polynomial is evaluated. By default, it is the same as the type of the coefficients but can be specified separately to accommodate different computational contexts.
- **`Image`**: Defines the type of the output values of the polynomial. This is typically the same as the coefficients.
-
-#### Default instantiation
-
-```cpp
-extern template class Polynomial<scalar_t>;
-```
-
-#### Extended use cases
-
-The templated nature of the Polynomial API also supports more complex scenarios. For example, coefficients and images could be points on an elliptic curve (EC points), which are useful in cryptographic applications and advanced algebraic structures. This approach allows the API to be extended easily to support new algebraic constructions without modifying the core implementation.
-
-### Supported Operations
-
-The Polynomial class encapsulates a polynomial, providing a variety of operations:
-
- **Construction**: Create polynomials from coefficients or evaluations on roots-of-unity domains.
- **Arithmetic Operations**: Perform addition, subtraction, multiplication, and division.
- **Evaluation**: Directly evaluate polynomials at specific points or across a domain.
- **Manipulation**: Features like slicing polynomials, adding or subtracting monomials inplace, and computing polynomial degrees.
- **Memory Access**: Access internal states or obtain device-memory views of polynomials.
-
-## Usage
-
-This section outlines how to use the Polynomial API in C++. Bindings for Rust and Go are detailed under the Bindings sections.
-
-### Backend Initialization
-
-Initialization with an appropriate factory is required to configure the computational context and backend.
-
-```cpp
-#include "polynomials/polynomials.h"
-#include "polynomials/cuda_backend/polynomial_cuda_backend.cuh"
-
-// Initialize with a CUDA backend
-Polynomial::initialize(std::make_shared<CUDAPolynomialFactory>());
-```
-
-:::note
-Initialization of a factory must be done per linked curve or field.
-:::
-
-### Construction
-
-Polynomials can be constructed from coefficients, from evaluations on roots-of-unity domains, or by cloning existing polynomials.
-
-```cpp
-// Construction
-static Polynomial from_coefficients(const Coeff* coefficients, uint64_t nof_coefficients);
-static Polynomial from_rou_evaluations(const Image* evaluations, uint64_t nof_evaluations);
-// Clone the polynomial
-Polynomial clone() const;
-```
-
-Example:
-
-```cpp
-auto p_from_coeffs = Polynomial_t::from_coefficients(coeff /* :scalar_t* */, nof_coeffs);
-auto p_from_rou_evals = Polynomial_t::from_rou_evaluations(rou_evals /* :scalar_t* */, nof_evals);
-auto p_cloned = p.clone(); // p_cloned and p do not share memory
-```
-
-:::note
-The coefficients or evaluations may be allocated either on host or device memory. In both cases the memory is copied to the backend device.
-:::
-
-### Arithmetic
-
-Constructed polynomials can be used for various arithmetic operations:
-
-```cpp
-// Addition
-Polynomial operator+(const Polynomial& rhs) const; 
-Polynomial& operator+=(const Polynomial& rhs); // inplace addition
-
-// Subtraction
-Polynomial operator-(const Polynomial& rhs) const;
-
-// Multiplication
-Polynomial operator*(const Polynomial& rhs) const;
-Polynomial operator*(const Domain& scalar) const; // scalar multiplication
-
-// Division A(x) = B(x)Q(x) + R(x)
-std::pair<Polynomial, Polynomial> divide(const Polynomial& rhs) const; // returns (Q(x), R(x))
-Polynomial operator/(const Polynomial& rhs) const; // returns quotient Q(x)
-Polynomial operator%(const Polynomial& rhs) const; // returns remainder R(x)
-Polynomial divide_by_vanishing_polynomial(uint64_t degree) const; // sdivision by the vanishing polynomial V(x)=X^N-1
-```
-
-#### Example
-
-Given polynomials A(x),B(x),C(x) and V(x) the vanishing polynomial.
-
-$$
-H(x)=\frac{A(x) \cdot B(x) - C(x)}{V(x)} \space where \space V(x) = X^{N}-1
-$$
-
-```cpp
-auto H = (A*B-C).divide_by_vanishing_polynomial(N);
-```
-
-### Evaluation
-
-Evaluate polynomials at arbitrary domain points, across a domain or on a roots-of-unity domain.
-
-```cpp
-Image operator()(const Domain& x) const; // evaluate f(x)
-void evaluate(const Domain* x, Image* evals /*OUT*/) const;
-void evaluate_on_domain(Domain* domain, uint64_t size, Image* evals /*OUT*/) const; // caller allocates memory
-void evaluate_on_rou_domain(uint64_t domain_log_size, Image* evals /*OUT*/) const;  // caller allocate memory
-```
-
-Example:
-
-```cpp
-Coeff x = rand();
-Image f_x = f(x); // evaluate f at x
-
-// evaluate f(x) on a domain
-uint64_t domain_size = ...;
-auto domain = /*build domain*/; // host or device memory
-auto evaluations = std::make_unique<scalar_t[]>(domain_size); // can be device memory too
-f.evaluate_on_domain(domain, domain_size, evaluations);
-
-// evaluate f(x) on roots of unity domain
-uint64_t domain_log_size = ...;
-auto evaluations_rou_domain = std::make_unique<scalar_t[]>(1 << domain_log_size); // can be device memory too
-f.evaluate_on_rou_domain(domain_log_size, evaluations_rou_domain);
-```
-
-### Manipulations
-
-Beyond arithmetic, the API supports efficient polynomial manipulations:
-
-#### Monomials
-
-```cpp
-// Monomial operations
-Polynomial& add_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
-Polynomial& sub_monomial_inplace(Coeff monomial_coeff, uint64_t monomial = 0);
-```
-
-The ability to add or subtract monomials directly and in-place is an efficient way to manipulate polynomials.
-
-Example:
-
-```cpp
-f.add_monomial_in_place(scalar_t::from(5)); // f(x) += 5
-f.sub_monomial_in_place(scalar_t::from(3), 8); // f(x) -= 3x^8
-```
-
-#### Computing the degree of a Polynomial
-
-```cpp
-// Degree computation
-int64_t degree();
-```
-
-The degree of a polynomial is a fundamental characteristic that describes the highest power of the variable in the polynomial expression with a non-zero coefficient.
-The `degree()` function in the API returns the degree of the polynomial, corresponding to the highest exponent with a non-zero coefficient.
-
- For the polynomial $f(x) = x^5 + 2x^3 + 4$, the degree is 5 because the highest power of $x$ with a non-zero coefficient is 5.
- For a scalar value such as a constant term (e.g., $f(x) = 7$, the degree is considered 0, as it corresponds to $x^0$.
- The degree of the zero polynomial, $f(x) = 0$, where there are no non-zero coefficients, is defined as -1. This special case often represents an "empty" or undefined state in many mathematical contexts.
-
-Example:
-
-```cpp
-auto f = /*some expression*/;
-auto degree_of_f = f.degree();
-```
-
-#### Slicing
-
-```cpp
-// Slicing and selecting even or odd components.
-Polynomial slice(uint64_t offset, uint64_t stride, uint64_t size = 0 /*0 means take all elements*/);
-Polynomial even();
-Polynomial odd();
-```
-
-The Polynomial API provides methods for slicing polynomials and selecting specific components, such as even or odd indexed terms. Slicing allows extracting specific sections of a polynomial based on an offset, stride, and size.
-
-The following examples demonstrate folding a polynomial's even and odd parts and arbitrary slicing;
-
-```cpp
-// folding a polynomials even and odd parts with randomness
-auto x = rand();
-auto even = f.even();
-auto odd = f.odd();
-auto fold_poly = even + odd * x;
-
-// arbitrary slicing (first quarter)
-auto first_quarter = f.slice(0 /*offset*/, 1 /*stride*/, f.degree()/4 /*size*/);
-```
-
-### Memory access (copy/view)
-
-Access to the polynomial's internal state can be vital for operations like commitment schemes or when more efficient custom operations are necessary. This can be done either by copying or viewing the polynomial
-
-#### Copying
-
-Copies the polynomial coefficients to either host or device allocated memory.
-
-:::note
-Copying to host memory is backend agnostic while copying to device memory requires the memory to be allocated on the corresponding backend.
-:::
-
-```cpp
-Coeff get_coeff(uint64_t idx) const; // copy single coefficient to host
-uint64_t copy_coeffs(Coeff* coeffs, uint64_t start_idx, uint64_t end_idx) const;
-```
-
-Example:
-
-```cpp
-auto coeffs_device = /*allocate CUDA or host memory*/
-f.copy_coeffs(coeffs_device, 0/*start*/, f.degree());
-  
-MSMConfig cfg = msm::defaultMSMConfig();
-cfg.are_points_on_device = true; // assuming copy to device memory
-auto rv = msm::MSM(coeffs_device, points, msm_size, cfg, results);
-```
-
-#### Views
-
-The Polynomial API supports efficient data handling through the use of memory views. These views provide direct access to the polynomial's internal state without the need to copy data. This feature is particularly useful for operations that require direct access to device memory, enhancing both performance and memory efficiency.
-
-##### What is a Memory View?
-
-A memory view is essentially a pointer to data stored in device memory. By providing a direct access pathway to the data, it eliminates the need for data duplication, thus conserving both time and system resources. This is especially beneficial in high-performance computing environments where data size and operation speed are critical factors.
-
-##### Applications of Memory Views
-
-Memory views are extremely versatile and can be employed in various computational contexts such as:
-
- **Commitments**: Views can be used to commit polynomial states in cryptographic schemes, such as Multi-Scalar Multiplications (MSM).
- **External Computations**: They allow external functions or algorithms to utilize the polynomial's data directly, facilitating operations outside the core polynomial API. This is useful for custom operations that are not covered by the API.
-
-##### Obtaining and Using Views
-
-To create and use views within the Polynomial API, functions are provided to obtain pointers to both coefficients and evaluation data. Here’s how they are generally structured:
-
-```cpp
-// Obtain a view of the polynomial's coefficients
-std::tuple<IntegrityPointer<Coeff>, uint64_t /*size*/, uint64_t /*device_id*/> get_coefficients_view();
-```
-
-Example usage:
-
-```cpp
-auto [coeffs_view, size, device_id] = polynomial.get_coefficients_view();
-
-// Use coeffs_view in a computational routine that requires direct access to polynomial coefficients
-// Example: Passing the view to a GPU-accelerated function
-gpu_accelerated_function(coeffs_view.get(),...);
-```
-
-##### Integrity-Pointer: Managing Memory Views
-
-Within the Polynomial API, memory views are managed through a specialized tool called the Integrity-Pointer. This pointer type is designed to safeguard operations by monitoring the validity of the memory it points to. It can detect if the memory has been modified or released, thereby preventing unsafe access to stale or non-existent data.
-The Integrity-Pointer not only acts as a regular pointer but also provides additional functionality to ensure the integrity of the data it references. Here are its key features:
-
-```cpp
-// Checks whether the pointer is still considered valid
-bool isValid() const;
-
-// Retrieves the raw pointer or nullptr if pointer is invalid
-const T* get() const;
-
-// Dereferences the pointer. Throws exception if the pointer is invalid.
-const T& operator*() const;
-
-//Provides access to the member of the pointed-to object. Throws exception if the pointer is invalid.
-const T* operator->() const;
-```
-
-Consider the Following case:
-
-```cpp
-auto [coeff_view, size, device] = f.get_coefficients_view();
-
-// Use the coefficients view to perform external operations
-commit_to_polynomial(coeff_view.get(), size);
-
-// Modification of the original polynomial
-f += g; // Any operation that modifies 'f' potentially invalidates 'coeff_view'
-
-// Check if the view is still valid before using it further
-if (coeff_view.isValid()) {
-    perform_additional_computation(coeff_view.get(), size);
-} else {
-    handle_invalid_data();
-}
-```
-
-
-
-## Multi-GPU Support with CUDA Backend
-
-The Polynomial API includes comprehensive support for multi-GPU environments, a crucial feature for leveraging the full computational power of systems equipped with multiple NVIDIA GPUs. This capability is part of the API's CUDA backend, which is designed to efficiently manage polynomial computations across different GPUs.
-
-### Setting the CUDA Device
-
-Like other components of the icicle framework, the Polynomial API allows explicit setting of the current CUDA device:
-
-```cpp
-cudaSetDevice(int deviceID);
-```
-
-This function sets the active CUDA device. All subsequent operations that allocate or deal with polynomial data will be performed on this device.
-
-### Allocation Consistency
-
-Polynomials are always allocated on the current CUDA device at the time of their creation. It is crucial to ensure that the device context is correctly set before initiating any operation that involves memory allocation:
-
-```cpp
-// Set the device before creating polynomials
-cudaSetDevice(0);
-Polynomial p1 = Polynomial::from_coefficients(coeffs, size);
-
-cudaSetDevice(1);
-Polynomial p2 = Polynomial::from_coefficients(coeffs, size);
-```
-
-### Matching Devices for Operations
-
-When performing operations that result in the creation of new polynomials (such as addition or multiplication), it is imperative that both operands are on the same CUDA device. If the operands reside on different devices, an exception is thrown:
-
-```cpp
-// Ensure both operands are on the same device
-cudaSetDevice(0);
-auto p3 = p1 + p2; // Throws an exception if p1 and p2 are not on the same device
-```
-
-### Device-Agnostic Operations
-
-Operations that do not involve the creation of new polynomials, such as computing the degree of a polynomial or performing in-place modifications, can be executed regardless of the current device setting:
-
-```cpp
-// 'degree' and in-place operations do not require device matching
-int deg = p1.degree();
-p1 += p2; // Valid if p1 and p2 are on the same device, throws otherwise
-```
-
-### Error Handling
-
-The API is designed to throw exceptions if operations are attempted across polynomials that are not located on the same GPU. This ensures that all polynomial operations are performed consistently and without data integrity issues due to device mismatches.
-
-### Best Practices
-
-To maximize the performance and avoid runtime errors in a multi-GPU setup, always ensure that:
-
- The CUDA device is set correctly before polynomial allocation.
- Operations involving new polynomial creation are performed with operands on the same device.
-
-By adhering to these guidelines, developers can effectively harness the power of multiple GPUs to handle large-scale polynomial computations efficiently.
--- a/docs/docs/icicle/primitives/image-1.png
+++ b/docs/docs/icicle/primitives/image-1.png
--- a/docs/docs/icicle/primitives/image-2.png
+++ b/docs/docs/icicle/primitives/image-2.png
--- a/docs/docs/icicle/primitives/image-3.png
+++ b/docs/docs/icicle/primitives/image-3.png
--- a/docs/docs/icicle/primitives/image.png
+++ b/docs/docs/icicle/primitives/image.png
--- a/docs/docs/icicle/primitives/keccak.md
+++ b/docs/docs/icicle/primitives/keccak.md
@@ -1,75 +0,0 @@
-# Keccak
-
-[Keccak](https://keccak.team/files/Keccak-implementation-3.2.pdf) is a cryptographic hash function designed by Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. It was selected as the winner of the NIST hash function competition, becoming the basis for the [SHA-3 standard](https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf).
-
-Keccak operates on a message input of any length and produces a fixed-size hash output. The hash function is built upon the sponge construction, which involves absorbing the input data followed by squeezing out the hash value.
-
-At its core, Keccak consists of a permutation function operating on a state array. The permutation function employs a round function that operates iteratively on the state array. Each round consists of five main steps:
-
- **Theta:** This step introduces diffusion by performing a bitwise XOR operation between the state and a linear combination of its neighboring columns.
- **Rho:** This step performs bit rotation operations on each lane of the state array.
- **Pi:** This step rearranges the positions of the lanes in the state array.
- **Chi:** This step applies a nonlinear mixing operation to each lane of the state array.
- **Iota:** This step introduces a round constant to the state array.
-
-## Using Keccak
-
-ICICLE Keccak supports batch hashing, which can be utilized for constructing a merkle tree or running multiple hashes in parallel.
-
-### Supported Bindings
-
- [Golang](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/golang/hash/keccak)
- [Rust](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-hash)
-
-### Example usage
-
-This is an example of running 1024 Keccak-256 hashes in parallel, where input strings are of size 136 bytes:
-
-```rust
-use icicle_core::hash::HashConfig;
-use icicle_cuda_runtime::memory::HostSlice;
-use icicle_hash::keccak::keccak256;
-
-let config = HashConfig::default();
-let input_block_len = 136;
-let number_of_hashes = 1024;
-
-let preimages = vec![1u8; number_of_hashes * input_block_len];
-let mut digests = vec![0u8; number_of_hashes * 64];
-
-let preimages_slice = HostSlice::from_slice(&preimages);
-let digests_slice = HostSlice::from_mut_slice(&mut digests);
-
-keccak256(
-    preimages_slice,
-    input_block_len as u32,
-    number_of_hashes as u32,
-    digests_slice,
-    &config,
-)
-.unwrap();
-```
-
-### Merkle Tree
-
-You can build a keccak merkle tree using the corresponding functions:
-
-```rust
-use icicle_core::tree::{merkle_tree_digests_len, TreeBuilderConfig};
-use icicle_cuda_runtime::memory::HostSlice;
-use icicle_hash::keccak::build_keccak256_merkle_tree;
-
-let mut config = TreeBuilderConfig::default();
-config.arity = 2;
-let height = 22;
-let input_block_len = 136;
-let leaves = vec![1u8; (1 << height) * input_block_len];
-let mut digests = vec![0u64; merkle_tree_digests_len((height + 1) as u32, 2, 1)];
-
-let leaves_slice = HostSlice::from_slice(&leaves);
-let digests_slice = HostSlice::from_mut_slice(&mut digests);
-
-build_keccak256_merkle_tree(leaves_slice, digests_slice, height, input_block_len, &config).unwrap();
-```
-
-In the example above, a binary tree of height 22 is being built. Each leaf is considered to be a 136 byte long array. The leaves and digests are aligned in a flat array. You can also use keccak512 in `build_keccak512_merkle_tree` function.
--- a/docs/docs/icicle/primitives/msm.md
+++ b/docs/docs/icicle/primitives/msm.md
@@ -1,195 +0,0 @@
-# MSM - Multi scalar multiplication
-
-MSM stands for Multi scalar multiplication, it's defined as:
-
-<math xmlns="http://www.w3.org/1998/Math/MathML">
-  <mi>M</mi>
-  <mi>S</mi>
-  <mi>M</mi>
-  <mo stretchy="false">(</mo>
-  <mi>a</mi>
-  <mo>,</mo>
-  <mi>G</mi>
-  <mo stretchy="false">)</mo>
-  <mo>=</mo>
-  <munderover>
-    <mo data-mjx-texclass="OP" movablelimits="false">&#x2211;</mo>
-    <mrow data-mjx-texclass="ORD">
-      <mi>j</mi>
-      <mo>=</mo>
-      <mn>0</mn>
-    </mrow>
-    <mrow data-mjx-texclass="ORD">
-      <mi>n</mi>
-      <mo>&#x2212;</mo>
-      <mn>1</mn>
-    </mrow>
-  </munderover>
-  <msub>
-    <mi>a</mi>
-    <mi>j</mi>
-  </msub>
-  <msub>
-    <mi>G</mi>
-    <mi>j</mi>
-  </msub>
-</math>
-
-Where
-
-$G_j \in G$ - points from an Elliptic Curve group.
-
-$a_0, \ldots, a_n$ - Scalars
-
-$MSM(a, G) \in G$ - a single EC (elliptic curve) point
-
-In words, MSM is the sum of scalar and EC point multiplications. We can see from this definition that the core operations occurring are Modular Multiplication and Elliptic curve point addition. It's obvious that multiplication can be computed in parallel and then the products summed, making MSM inherently parallelizable.
-
-Accelerating MSM is crucial to a ZK protocol's performance due to the [large percent of run time](https://hackmd.io/@0xMonia/SkQ6-oRz3#Hardware-acceleration-in-action) they take when generating proofs.
-
-You can learn more about how MSMs work from this [video](https://www.youtube.com/watch?v=Bl5mQA7UL2I) and from our resource list on [Ingopedia](https://www.ingonyama.com/ingopedia/msm).
-
-## Supported Bindings
-
- [Golang](../golang-bindings/msm.md)
- [Rust](../rust-bindings//msm.md)
-
-## Algorithm description
-
-We follow the bucket method algorithm. The GPU implementation consists of four phases:
-
-1. Preparation phase - The scalars are split into smaller scalars of `c` bits each. These are the bucket indices. The points are grouped according to their corresponding bucket index and the buckets are sorted by size.
-2. Accumulation phase - Each bucket accumulates all of its points using a single thread. More than one thread is assigned to large buckets, in proportion to their size. A bucket is considered large if its size is above the large bucket threshold that is determined by the `large_bucket_factor` parameter. The large bucket threshold is the expected average bucket size times the `large_bucket_factor` parameter.
-3. Buckets Reduction phase - bucket results are multiplied by their corresponding bucket number and each bucket module is reduced to a small number of final results. By default, this is done by an iterative algorithm which is highly parallel. Setting `is_big_triangle` to `true` will switch this phase to the running sum algorithm described in the above YouTube talk which is much less parallel.
-4. Final accumulation phase - The final results from the last phase are accumulated using the double-and-add algorithm.
-
-## Batched MSM
-
-The MSM supports batch mode - running multiple MSMs in parallel. It's always better to use the batch mode instead of running single msms in serial as long as there is enough memory available. We support running a batch of MSMs that share the same points as well as a batch of MSMs that use different points.
-
-## MSM configuration
-
-```cpp
-  /**
-   * @struct MSMConfig
-   * Struct that encodes MSM parameters to be passed into the [MSM](@ref MSM) function. The intended use of this struct
-   * is to create it using [default_msm_config](@ref default_msm_config) function and then you'll hopefully only need to
-   * change a small number of default values for each of your MSMs.
-   */
-  struct MSMConfig {
-    device_context::DeviceContext ctx; /**< Details related to the device such as its id and stream id. */
-    int points_size;         /**< Number of points in the MSM. If a batch of MSMs needs to be computed, this should be
-                              *   a number of different points. So, if each MSM re-uses the same set of points, this
-                              *   variable is set equal to the MSM size. And if every MSM uses a distinct set of
-                              *   points, it should be set to the product of MSM size and [batch_size](@ref
-                              *   batch_size). Default value: 0 (meaning it's equal to the MSM size). */
-    int precompute_factor;   /**< The number of extra points to pre-compute for each point. See the
-                              *   [precompute_msm_points](@ref precompute_msm_points) function, `precompute_factor` passed
-                              *   there needs to be equal to the one used here. Larger values decrease the
-                              *   number of computations to make, on-line memory footprint, but increase the static
-                              *   memory footprint. Default value: 1 (i.e. don't pre-compute). */
-    int c;                   /**< \f$ c \f$ value, or "window bitsize" which is the main parameter of the "bucket
-                              *   method" that we use to solve the MSM problem. As a rule of thumb, larger value
-                              *   means more on-line memory footprint but also more parallelism and less computational
-                              *   complexity (up to a certain point). Currently pre-computation is independent of
-                              *   \f$ c \f$, however in the future value of \f$ c \f$ here and the one passed into the
-                              *   [precompute_msm_points](@ref precompute_msm_points) function will need to be identical.
-                              *    Default value: 0 (the optimal value of \f$ c \f$ is chosen automatically).  */
-    int bitsize;             /**< Number of bits of the largest scalar. Typically equals the bitsize of scalar field,
-                              *   but if a different (better) upper bound is known, it should be reflected in this
-                              *   variable. Default value: 0 (set to the bitsize of scalar field). */
-    int large_bucket_factor; /**< Variable that controls how sensitive the algorithm is to the buckets that occur
-                              *   very frequently. Useful for efficient treatment of non-uniform distributions of
-                              *   scalars and "top windows" with few bits. Can be set to 0 to disable separate
-                              *   treatment of large buckets altogether. Default value: 10. */
-    int batch_size;          /**< The number of MSMs to compute. Default value: 1. */
-    bool are_scalars_on_device;       /**< True if scalars are on device and false if they're on host. Default value:
-                                       *   false. */
-    bool are_scalars_montgomery_form; /**< True if scalars are in Montgomery form and false otherwise. Default value:
-                                       *   true. */
-    bool are_points_on_device; /**< True if points are on device and false if they're on host. Default value: false. */
-    bool are_points_montgomery_form; /**< True if coordinates of points are in Montgomery form and false otherwise.
-                                      *   Default value: true. */
-    bool are_results_on_device; /**< True if the results should be on device and false if they should be on host. If set
-                                 *   to false, `is_async` won't take effect because a synchronization is needed to
-                                 *   transfer results to the host. Default value: false. */
-    bool is_big_triangle;       /**< Whether to do "bucket accumulation" serially. Decreases computational complexity
-                                 *   but also greatly decreases parallelism, so only suitable for large batches of MSMs.
-                                 *   Default value: false. */
-    bool is_async;              /**< Whether to run the MSM asynchronously. If set to true, the MSM function will be
-                                 *   non-blocking and you'd need to synchronize it explicitly by running
-                                 *   `cudaStreamSynchronize` or `cudaDeviceSynchronize`. If set to false, the MSM
-                                 *   function will block the current CPU thread. */
-  };
-```
-
-## Choosing optimal parameters
-
-`is_big_triangle` should be `false` in almost all cases. It might provide better results only for very small MSMs (smaller than 2^8^) with a large batch (larger than 100) but this should be tested per scenario.
-Large buckets exist in two cases:
-1. When the scalar distribution isn't uniform.
-2. When `c` does not divide the scalar bit-size.
-
-`large_bucket_factor` that is equal to 10 yields good results for most cases, but it's best to fine tune this parameter per `c` and per scalar distribution.
-The two most important parameters for performance are `c` and the `precompute_factor`. They affect the number of EC additions as well as the memory size. When the points are not known in advance we cannot use precomputation. In this case the best `c` value is usually around $log_2(msmSize) - 4$. However, in most protocols the points are known in advance and precomputation can be used unless limited by memory. Usually it's best to use maximum precomputation (such that we end up with only a single bucket module) combined with a `c` value around $log_2(msmSize) - 1$.
-
-## Memory usage estimation
-
-The main memory requirements of the MSM are the following:
-
- Scalars - `sizeof(scalar_t) * msm_size * batch_size`
- Scalar indices - `~6 * sizeof(unsigned) * nof_bucket_modules * msm_size * batch_size`
- Points - `sizeof(affine_t) * msm_size * precomp_factor * batch_size`
- Buckets - `sizeof(projective_t) * nof_bucket_modules * 2^c * batch_size`
-
-where `nof_bucket_modules =  ceil(ceil(bitsize / c) / precompute_factor)`
-
-During the MSM computation first the memory for scalars and scalar indices is allocated, then the indices are freed and points and buckets are allocated. This is why a good estimation for the required memory is the following formula:
-
-$max(scalars + scalarIndices, scalars + points + buckets)$
-
-This gives a good approximation within 10% of the actual required memory for most cases.
-
-## Example parameters
-
-Here is a useful table showing optimal parameters for different MSMs. They are optimal for BLS12-377 curve when running on NVIDIA GeForce RTX 3090 Ti. This is the configuration used:
-
-```cpp
-  msm::MSMConfig config = {
-    ctx,            // DeviceContext
-    N,              // points_size
-    precomp_factor, // precompute_factor
-    user_c,         // c
-    0,              // bitsize
-    10,             // large_bucket_factor
-    batch_size,     // batch_size
-    false,          // are_scalars_on_device
-    false,          // are_scalars_montgomery_form
-    true,           // are_points_on_device
-    false,          // are_points_montgomery_form
-    true,           // are_results_on_device
-    false,          // is_big_triangle
-    true            // is_async
-  };
-```
-
-Here are the parameters and the results for the different cases:
-
-| MSM size | Batch size | Precompute factor | c | Memory estimation (GB) | Actual memory (GB) | Single MSM time (ms) |
-| --- | --- | --- | --- | --- | --- | --- |
-| 10 | 1 | 1 | 9 | 0.00227 | 0.00277 | 9.2 |
-| 10 | 1 | 23 | 11 | 0.00259 | 0.00272 | 1.76 |
-| 10 | 1000 | 1 | 7 | 0.94 | 1.09 | 0.051 |
-| 10 | 1000 | 23 | 11 | 2.59 | 2.74 | 0.025 |
-| 15 | 1 | 1 | 11 | 0.011 | 0.019 | 9.9 |
-| 15 | 1 | 16 | 16 | 0.061 | 0.065 | 2.4 |
-| 15 | 100 | 1 | 11 | 1.91 | 1.92 | 0.84 |
-| 15 | 100 | 19 | 14 | 6.32 | 6.61 | 0.56 |
-| 18 | 1 | 1 | 14 | 0.128 | 0.128 | 14.4 |
-| 18 | 1 | 15 | 17 | 0.40 | 0.42 | 5.9 |
-| 22 | 1 | 1 | 17 | 1.64 | 1.65 | 68 |
-| 22 | 1 | 13 | 21 | 5.67 | 5.94 | 54 |
-| 24 | 1 | 1 | 18 | 6.58 | 6.61 | 232 |
-| 24 | 1 | 7 | 21 | 12.4 | 13.4 | 199 |
-
-The optimal values can vary per GPU and per curve. It is best to try a few combinations until you get the best results for your specific case.
--- a/docs/docs/icicle/primitives/ntt.md
+++ b/docs/docs/icicle/primitives/ntt.md
@@ -1,159 +0,0 @@
-# NTT - Number Theoretic Transform
-
-The Number Theoretic Transform (NTT) is a variant of the Fourier Transform used over finite fields, particularly those of integers modulo a prime number. NTT operates in a discrete domain and is used primarily in applications requiring modular arithmetic, such as cryptography and polynomial multiplication.
-
-NTT is defined similarly to the Discrete Fourier Transform (DFT), but instead of using complex roots of unity, it uses roots of unity within a finite field. The definition hinges on the properties of the finite field, specifically the existence of a primitive root of unity of order $N$ (where $N$ is typically a power of 2), and the modulo operation is performed with respect to a specific prime number that supports these roots.
-
-Formally, given a sequence of integers $a_0, a_1, ..., a_{N-1}$, the NTT of this sequence is another sequence of integers $A_0, A_1, ..., A_{N-1}$, computed as follows:
-
-$$
-A_k = \sum_{n=0}^{N-1} a_n \cdot \omega^{nk} \mod p
-$$
-
-where:
-
- $N$ is the size of the input sequence and is a power of 2,
- $p$ is a prime number such that $p = kN + 1$ for some integer $k$, ensuring that $p$ supports the existence of $N$th roots of unity,
- $\omega$ is a primitive $N$th root of unity modulo $p$, meaning $\omega^N \equiv 1 \mod p$ and no smaller positive power of $\omega$ is congruent to 1 modulo $p$,
- $k$ ranges from 0 to $N-1$, and it indexes the output sequence.
-
-NTT is particularly useful because it enables efficient polynomial multiplication under modulo arithmetic, crucial for algorithms in cryptographic protocols and other areas requiring fast modular arithmetic operations.
-
-There exists also INTT which is the inverse operation of NTT. INTT can take as input an output sequence of integers from an NTT and reconstruct the original sequence.
-
-## Using NTT
-
-### Supported Bindings
-
- [Golang](../golang-bindings/ntt.md)
- [Rust](../rust-bindings/ntt.md)
-
-### Examples
-
- [Rust API examples](https://github.com/ingonyama-zk/icicle/blob/d84ffd2679a4cb8f8d1ac2ad2897bc0b95f4eeeb/examples/rust/ntt/src/main.rs#L1)
-
- [C++ API examples](https://github.com/ingonyama-zk/icicle/blob/d84ffd2679a4cb8f8d1ac2ad2897bc0b95f4eeeb/examples/c%2B%2B/ntt/example.cu#L1)
-
-### Ordering
-
-The `Ordering` enum defines how inputs and outputs are arranged for the NTT operation, offering flexibility in handling data according to different algorithmic needs or compatibility requirements. It primarily affects the sequencing of data points for the transform, which can influence both performance and the compatibility with certain algorithmic approaches. The available ordering options are:
-
- **`kNN` (Natural-Natural):** Both inputs and outputs are in their natural order. This is the simplest form of ordering, where data is processed in the sequence it is given, without any rearrangement.
-
- **`kNR` (Natural-Reversed):** Inputs are in natural order, while outputs are in bit-reversed order. This ordering is typically used in algorithms that benefit from having the output in a bit-reversed pattern.
-
- **`kRN` (Reversed-Natural):** Inputs are in bit-reversed order, and outputs are in natural order. This is often used with the Cooley-Tukey FFT algorithm.
-
- **`kRR` (Reversed-Reversed):** Both inputs and outputs are in bit-reversed order.
-
- **`kNM` (Natural-Mixed):** Inputs are provided in their natural order, while outputs are arranged in a digit-reversed (mixed) order. This ordering is good for mixed radix NTT operations, where the mixed or digit-reversed ordering of outputs is a generalization of the bit-reversal pattern seen in simpler, radix-2 cases.
-
- **`kMN` (Mixed-Natural):** Inputs are in a digit-reversed (mixed) order, while outputs are restored to their natural order. This ordering would primarily be used for mixed radix NTT
-
-Choosing an algorithm is heavily dependent on your use case. For example Cooley-Tukey will often use `kRN` and Gentleman-Sande often uses `kNR`.
-
-### Modes
-
-NTT also supports two different modes `Batch NTT` and `Single NTT`
-
-Deciding whether to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
-
-#### Single NTT
-
-Single NTT will launch a single NTT computation.
-
-Choose this mode when your application requires processing individual NTT operations in isolation.
-
-#### Batch NTT Mode
-
-Batch NTT allows you to run many NTTs with a single API call. Batch NTT mode can significantly reduce read/write times as well as computation overhead by executing multiple NTT operations in parallel. Batch mode may also offer better utilization of computational resources (memory and compute).
-
-## Supported algorithms
-
-Our NTT implementation supports two algorithms `radix-2` and `mixed-radix`.
-
-### Radix 2
-
-At its core, the Radix-2 NTT algorithm divides the problem into smaller sub-problems, leveraging the properties of "divide and conquer" to reduce the overall computational complexity. The algorithm operates on sequences whose lengths are powers of two.
-
-1. **Input Preparation:**
-   The input is a sequence of integers $a_0, a_1, \ldots, a_{N-1}, \text{ where } N$ is a power of two.
-
-2. **Recursive Decomposition:**
-   The algorithm recursively divides the input sequence into smaller sequences. At each step, it separates the sequence into even-indexed and odd-indexed elements, forming two subsequences that are then processed independently.
-
-3. **Butterfly Operations:**
-   The core computational element of the Radix-2 NTT is the "butterfly" operation, which combines pairs of elements from the sequences obtained in the decomposition step.
-
-   Each butterfly operation involves multiplication by a "twiddle factor," which is a root of unity in the finite field, and addition or subtraction of the results, all performed modulo the prime modulus.
-
-   $$
-    X_k = (A_k + B_k \cdot W^k) \mod p
-   $$
-
-   $X_k$ - The output of the butterfly operation for the $k$-th element
-
-   $A_k$ - an element from the even-indexed subset
-
-   $B_k$ - an element from the odd-indexed subset
-
-   $p$ - prime modulus
-
-   $k$ - The index of the current operation within the butterfly or the transform stage
-
-   The twiddle factors are precomputed to save runtime and improve performance.
-
-4. **Bit-Reversal Permutation:**
-   A final step involves rearranging the output sequence into the correct order. Due to the halving process in the decomposition steps, the elements of the transformed sequence are initially in a bit-reversed order. A bit-reversal permutation is applied to obtain the final sequence in natural order.
-
-### Mixed Radix
-
-The Mixed Radix NTT algorithm extends the concepts of the Radix-2 algorithm by allowing the decomposition of the input sequence based on various factors of its length. Specifically ICICLEs implementation splits the input into blocks of sizes 16, 32, or 64 compared to radix2 which is always splitting such that we end with NTT of size 2. This approach offers enhanced flexibility and efficiency, especially for input sizes that are composite numbers, by leveraging the "divide and conquer" strategy across multiple radices.
-
-The NTT blocks in Mixed Radix are implemented more efficiently based on winograd NTT but also optimized memory and register usage is better compared to Radix-2.
-
-Mixed Radix can reduce the number of stages required to compute for large inputs.
-
-1. **Input Preparation:**
-   The input to the Mixed Radix NTT is a sequence of integers $a_0, a_1, \ldots, a_{N-1}$, where $N$ is not strictly required to be a power of two. Instead, $N$ can be any composite number, ideally factorized into primes or powers of primes.
-
-2. **Factorization and Decomposition:**
-   Unlike the Radix-2 algorithm, which strictly divides the computational problem into halves, the Mixed Radix NTT algorithm implements a flexible decomposition approach which isn't limited to prime factorization.
-
-   For example, an NTT of size 256 can be decomposed into two stages of $16 \times \text{NTT}_{16}$, leveraging a composite factorization strategy rather than decomposing into eight stages of $\text{NTT}_{2}$. This exemplifies the use of composite factors (in this case, $256 = 16 \times 16$) to apply smaller NTT transforms, optimizing computational efficiency by adapting the decomposition strategy to the specific structure of $N$.
-
-3. **Butterfly Operations with Multiple Radices:**
-   The Mixed Radix algorithm utilizes butterfly operations for various radix sizes. Each sub-transform involves specific butterfly operations characterized by multiplication with twiddle factors appropriate for the radix in question.
-
-   The generalized butterfly operation for a radix-$r$ element can be expressed as:
-
-   $$
-   X_{k,r} = \sum_{j=0}^{r-1} (A_{j,k} \cdot W^{jk}) \mod p
-   $$
-
-   where:
-
-   $X_{k,r}$ - is the output of the $radix-r$ butterfly operation for the $k-th$ set of inputs
-
-   $A_{j,k}$ - represents the $j-th$ input element for the $k-th$ operation
-
-   $W$ - is the twiddle factor
-
-   $p$ - is the prime modulus
-
-4. **Recombination and Reordering:**
-   After applying the appropriate butterfly operations across all decomposition levels, the Mixed Radix algorithm recombines the results into a single output sequence. Due to the varied sizes of the sub-transforms, a more complex reordering process may be required compared to Radix-2. This involves digit-reversal permutations to ensure that the final output sequence is correctly ordered.
-
-### Which algorithm should I choose ?
-
-Both work only on inputs of power of 2 (e.g., 256, 512, 1024).
-
-Radix 2 is faster for small NTTs. A small NTT would be around logN = 16 and batch size 1. Radix 2 won't necessarily perform better for smaller `logn` with larger batches.
-
-Mixed radix on the other hand works better for larger NTTs with larger input sizes.
-
-Performance really depends on logn size, batch size, ordering, inverse, coset, coeff-field and which GPU you are using.
-
-For this reason we implemented our [heuristic auto-selection](https://github.com/ingonyama-zk/icicle/blob/main/icicle/src/ntt/ntt.cu#L573) which should choose the most efficient algorithm in most cases.
-
-We still recommend you benchmark for your specific use case if you think a different configuration would yield better results.
--- a/docs/docs/icicle/primitives/overview.md
+++ b/docs/docs/icicle/primitives/overview.md
@@ -1,12 +0,0 @@
-# ICICLE Primitives
-
-This section of the documentation is dedicated to the ICICLE primitives, we will cover the usage and internal details of our primitives such as hashing algorithms, MSM and NTT.
-
-
-## Supported primitives
-
-
- [MSM](./msm.md)
- [NTT](./ntt.md)
- [Keccak Hash](./keccak.md)
- [Poseidon Hash](./poseidon.md)
--- a/docs/docs/icicle/primitives/poseidon.md
+++ b/docs/docs/icicle/primitives/poseidon.md
@@ -1,216 +0,0 @@
-# Poseidon
-
-[Poseidon](https://eprint.iacr.org/2019/458.pdf) is a popular hash in the ZK ecosystem primarily because it's optimized to work over large prime fields, a common setting for ZK proofs, thereby minimizing the number of multiplicative operations required.
-
-Poseidon has also been specifically designed to be efficient when implemented within ZK circuits, Poseidon uses far less constraints compared to other hash functions like Keccak or SHA-256 in the context of ZK circuits.
-
-Poseidon has been used in many popular ZK protocols such as Filecoin and [Plonk](https://drive.google.com/file/d/1bZZvKMQHaZGA4L9eZhupQLyGINkkFG_b/view?usp=drive_open).
-
-Our implementation of Poseidon is implemented in accordance with the optimized [Filecoin version](https://spec.filecoin.io/algorithms/crypto/poseidon/).
-
-Lets understand how Poseidon works.
-
-## Initialization
-
-Poseidon starts with the initialization of its internal state, which is composed of the input elements and some pre-generated constants. An initial round constant is added to each element of the internal state. Adding the round constants ensures the state is properly mixed from the beginning.
-
-This is done to prevent collisions and to prevent certain cryptographic attacks by ensuring that the internal state is sufficiently mixed and unpredictable.
-
-![Poseidon initialization of internal state added with pre-generated round constants](https://github.com/ingonyama-zk/icicle/assets/122266060/52257f5d-6097-47c4-8f17-7b6449b9d162)
-
-## Applying full and partial rounds
-
-To generate a secure hash output, the algorithm goes through a series of "full rounds" and "partial rounds" as well as transformations between these sets of rounds in the following order:
-
-```First full rounds -> apply S-box and Round constants -> partial rounds -> Last full rounds -> Apply S-box```
-
-### Full rounds
-
-![Full round iterations consisting of S box operations, adding round constants, and a Full MDS matrix multiplication](https://github.com/ingonyama-zk/icicle/assets/122266060/e4ce0e98-b90b-4261-b83e-3cd8cce069cb)
-
-**Uniform Application of S-box:** In full rounds, the S-box (a non-linear transformation) is applied uniformly to every element of the hash function's internal state. This ensures a high degree of mixing and diffusion, contributing to the hash function's security. The functions S-box involves raising each element of the state to a certain power denoted by `α` a member of the finite field defined by the prime `p`; `α` can be different depending on the implementation and user configuration.
-
-**Linear Transformation:** After applying the S-box, a linear transformation is performed on the state. This involves multiplying the state by a MDS (Maximum Distance Separable) Matrix. which further diffuses the transformations applied by the S-box across the entire state.
-
-**Addition of Round Constants:** Each element of the state is then modified by adding a unique round constant. These constants are different for each round and are precomputed as part of the hash function's initialization. The addition of round constants ensures that even minor changes to the input produce significant differences in the output.
-
-### Partial Rounds
-
-![Partial round iterations consisting of selective S box operation, adding a round constant and performing an MDS multiplication with a sparse matrix](https://github.com/ingonyama-zk/icicle/assets/122266060/e8c198b4-7aa4-4b4d-9ec4-604e39e07692)
-
-**Selective Application of S-Box:** Partial rounds apply the S-box transformation to only one element of the internal state per round, rather than to all elements. This selective application significantly reduces the computational complexity of the hash function without compromising its security. The choice of which element to apply the S-box to can follow a specific pattern or be fixed, depending on the design of the hash function.
-
-**Linear Transformation and Round Constants:** A linear transformation is performed and round constants are added. The linear transformation in partial rounds can be designed to be less computationally intensive (this is done by using a sparse matrix) than in full rounds, further optimizing the function's efficiency.
-
-The user of Poseidon can often choose how many partial or full rounds he wishes to apply; more full rounds will increase security but degrade performance. The choice and balance are highly dependent on the use case.
-
-## Using Poseidon
-
-ICICLE Poseidon is implemented for GPU and parallelization is performed for each element of the state rather than for each state.
-What that means is we calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
-
-So for Poseidon of arity 2 and input of size 1024 * 2, we would expect 1024 elements of output. Which means each block would be of size 2 and that would result in 1024 Poseidon hashes being performed.
-
-### Supported Bindings
-
-[`Go`](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/golang/curves/bn254/poseidon/poseidon.go)
-[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon)
-
-### Constants
-
-Poseidon is extremely customizable and using different constants will produce different hashes, security levels and performance results.
-
-We support pre-calculated and optimized constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants) and are labeled clearly per curve `<curve_name>_poseidon.h`.
-
-If you wish to generate your own constants you can use our python script which can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon/constants/generate_parameters.py).
-
-Prerequisites:
-
- Install python 3
- `pip install poseidon-hash`
- `pip install galois==0.3.7`
- `pip install numpy`
-
-You will then need to modify the following values before running the script.
-
-```python
-# Modify these
-arity = 11 # we support arity 2, 4, 8 and 11.
-p = 0x73EDA753299D7D483339D80809A1D80553BDA402FFFE5BFEFFFFFFFF00000001 # bls12-381
-# p = 0x12ab655e9a2ca55660b44d1e5c37b00159aa76fed00000010a11800000000001 # bls12-377
-# p = 0x30644e72e131a029b85045b68181585d2833e84879b9709143e1f593f0000001 # bn254
-# p = 0x1ae3a4617c510eac63b05c06ca1493b1a22d9f300f5138f1ef3622fba094800170b5d44300000008508c00000000001 # bw6-761
-prime_bit_len = 255
-field_bytes = 32
-
-...
-
-# primitive_element = None
-primitive_element = 7 # bls12-381
-# primitive_element = 22 # bls12-377
-# primitive_element = 5 # bn254
-# primitive_element = 15 # bw6-761
-```
-
-### Rust API
-
-This is the most basic way to use the Poseidon API.
-
-```rust
-let test_size = 1 << 10;
-let arity = 2u32;
-let ctx = get_default_device_context();
-let poseidon = Poseidon::load(arity, &ctx).unwrap();
-let config = HashConfig::default();
-
-let inputs = vec![F::one(); test_size * arity as usize];
-let outputs = vec![F::zero(); test_size];
-let mut input_slice = HostOrDeviceSlice::on_host(inputs);
-let mut output_slice = HostOrDeviceSlice::on_host(outputs);
-
-poseidon.hash_many::<F>(
-    &mut input_slice,
-    &mut output_slice,
-    test_size as u32,
-    arity as u32,
-    1, // Output length
-    &config,
-)
-.unwrap();
-```
-
-The `HashConfig` can be modified, by default the inputs and outputs are set to be on `Host` for example.
-
-```rust
-impl<'a> Default for HashConfig<'a> {
-    fn default() -> Self {
-        let ctx = get_default_device_context();
-        Self {
-            ctx,
-            are_inputs_on_device: false,
-            are_outputs_on_device: false,
-            is_async: false,
-        }
-    }
-}
-```
-
-In the example above `Poseidon::load(arity, &ctx).unwrap();` is used which will load the correct constants based on arity and curve. It's possible to [generate](#constants) your own constants and load them.
-
-```rust
-let ctx = get_default_device_context();
-let custom_poseidon = Poseidon::new(
-    arity, // The arity of poseidon hash. The width will be equal to arity + 1
-    alpha, // The S-box power
-    full_rounds_half,
-    partial_rounds,
-    round_constants,
-    mds_matrix, 
-    non_sparse_matrix,
-    sparse_matrices,
-    domain_tag,
-    ctx,
-)
-.unwrap();
-```
-
-## The Tree Builder
-
-The tree builder allows you to build Merkle trees using Poseidon.
-
-You can define both the tree's `height` and its `arity`. The tree `height` determines the number of layers in the tree, including the root and the leaf layer. The `arity` determines how many children each internal node can have.
-
-```rust
-use icicle_bn254::tree::Bn254TreeBuilder;
-use icicle_bn254::poseidon::Poseidon;
-
-let mut config = TreeBuilderConfig::default();
-let arity = 2;
-config.arity = arity as u32;
-let input_block_len = arity;
-let leaves = vec![F::one(); (1 << height) * arity];
-let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
-
-let leaves_slice = HostSlice::from_slice(&leaves);
-let digests_slice = HostSlice::from_mut_slice(&mut digests);
-
-let ctx = device_context::DeviceContext::default();
-let hash = Poseidon::load(2, &ctx).unwrap();
-
-let mut config = TreeBuilderConfig::default();
-config.keep_rows = 5;
-Bn254TreeBuilder::build_merkle_tree(
-    leaves_slice,
-    digests_slice,
-    height,
-    input_block_len,
-    &hash,
-    &hash,
-    &config,
-)
-.unwrap();
-```
-
-Similar to Poseidon, you can also configure the Tree Builder `TreeBuilderConfig::default()`
-
- `keep_rows`: The number of rows which will be written to output, 0 will write all rows.
- `are_inputs_on_device`: Have the inputs been loaded to device memory ?
- `is_async`: Should the TreeBuilder run asynchronously? `False` will block the current CPU thread. `True` will require you call `cudaStreamSynchronize` or `cudaDeviceSynchronize` to retrieve the result.
-
-### Benchmarks
-
-We ran the Poseidon tree builder on:
-
-**CPU**: 12th Gen Intel(R) Core(TM) i9-12900K/
-
-**GPU**: RTX 3090 Ti
-
-**Tree height**: 30 (2^29 elements)
-
-The benchmarks include copying data from and to the device.
-
-| Rows to keep parameter      | Run time, Icicle | Supranational PC2
-| ----------- | ----------- | -----------
-| 10          | 9.4 seconds       |    13.6 seconds
-| 20          | 9.5 seconds       |    13.6 seconds
-| 29          | 13.7 seconds       |    13.6 seconds
--- a/docs/docs/icicle/primitives/poseidon2.md
+++ b/docs/docs/icicle/primitives/poseidon2.md
@@ -1,88 +0,0 @@
-# Poseidon2
-
-[Poseidon2](https://eprint.iacr.org/2023/323) is a recently released optimized version of Poseidon1. The two versions differ in two crucial points. First, Poseidon is a sponge hash function, while Poseidon2 can be either a sponge or a compression function depending on the use case. Secondly, Poseidon2 is instantiated by new and more efficient linear layers with respect to Poseidon. These changes decrease the number of multiplications in the linear layer by up to 90% and the number of constraints in Plonk circuits by up to 70%. This makes Poseidon2 currently the fastest arithmetization-oriented hash function without lookups.
-
-
-## Using Poseidon2
-
-ICICLE Poseidon2 is implemented for GPU and parallelization is performed for each state.
-We calculate multiple hash-sums over multiple pre-images in parallel, rather than going block by block over the input vector.
-
-For example, for Poseidon2 of width 16, input rate 8, output elements 8 and input of size 1024 * 8, we would expect 1024 * 8 elements of output. Which means each input block would be of size 8, resulting in 1024 Poseidon2 hashes being performed.
-
-### Supported Bindings
-
-[`Rust`](https://github.com/ingonyama-zk/icicle/tree/main/wrappers/rust/icicle-core/src/poseidon2)
-
-### Constants
-
-Poseidon2 is also extremely customizable and using different constants will produce different hashes, security levels and performance results.
-
-We support pre-calculated constants for each of the [supported curves](../core#supported-curves-and-operations). The constants can be found [here](https://github.com/ingonyama-zk/icicle/tree/main/icicle/include/poseidon2/constants) and are labeled clearly per curve `<curve_name>_poseidon2.h`.
-
-You can also use your own set of constants as shown [here](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-fields/icicle-babybear/src/poseidon2/mod.rs#L290)
-
-### Rust API
-
-This is the most basic way to use the Poseidon2 API.
-
-```rust
-let test_size = 1 << 10;
-let width = 16;
-let rate = 8;
-let ctx = get_default_device_context();
-let poseidon = Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
-let config = HashConfig::default();
-
-let inputs = vec![F::one(); test_size * rate as usize];
-let outputs = vec![F::zero(); test_size];
-let mut input_slice = HostOrDeviceSlice::on_host(inputs);
-let mut output_slice = HostOrDeviceSlice::on_host(outputs);
-
-poseidon.hash_many::<F>(
-    &mut input_slice,
-    &mut output_slice,
-    test_size as u32,
-    rate as u32,
-    8, // Output length
-    &config,
-)
-.unwrap();
-```
-
-In the example above `Poseidon2::load(width, rate, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();` is used to load the correct constants based on width and curve. Here, the default MDS matrices and diffusion are used. If you want to get a Plonky3 compliant version, set them to `MdsType::Plonky` and `DiffusionStrategy::Montgomery` respectively.
-
-## The Tree Builder
-
-Similar to Poseidon1, you can use Poseidon2 in a tree builder.
-
-```rust
-use icicle_bn254::tree::Bn254TreeBuilder;
-use icicle_bn254::poseidon2::Poseidon2;
-
-let mut config = TreeBuilderConfig::default();
-let arity = 2;
-config.arity = arity as u32;
-let input_block_len = arity;
-let leaves = vec![F::one(); (1 << height) * arity];
-let mut digests = vec![F::zero(); merkle_tree_digests_len((height + 1) as u32, arity as u32, 1)];
-
-let leaves_slice = HostSlice::from_slice(&leaves);
-let digests_slice = HostSlice::from_mut_slice(&mut digests);
-
-let ctx = device_context::DeviceContext::default();
-let hash = Poseidon2::load(arity, arity, MdsType::Default, DiffusionStrategy::Default, &ctx).unwrap();
-
-let mut config = TreeBuilderConfig::default();
-config.keep_rows = 5;
-Bn254TreeBuilder::build_merkle_tree(
-    leaves_slice,
-    digests_slice,
-    height,
-    input_block_len,
-    &hash,
-    &hash,
-    &config,
-)
-.unwrap();
-```
--- a/docs/docs/icicle/rust-bindings.md
+++ b/docs/docs/icicle/rust-bindings.md
@@ -1,87 +0,0 @@
-# Rust bindings
-
-Rust bindings allow you to use ICICLE as a rust library.
-
-`icicle-core` defines all interfaces, macros and common methods.
-
-`icicle-cuda-runtime` defines DeviceContext which can be used to manage a specific GPU as well as wrapping common CUDA methods.
-
-`icicle-curves` implements all interfaces and macros from icicle-core for each curve. For example icicle-bn254 implements curve bn254. Each curve has its own build script which will build the CUDA libraries for that curve as part of the rust-toolchain build.
-
-## Using ICICLE Rust bindings in your project
-
-Simply add the following to your `Cargo.toml`.
-
-```toml
-# GPU Icicle integration
-icicle-cuda-runtime = { git = "https://github.com/ingonyama-zk/icicle.git" }
-icicle-core = { git = "https://github.com/ingonyama-zk/icicle.git" }
-icicle-bn254 = { git = "https://github.com/ingonyama-zk/icicle.git" }
-```
-
-`icicle-bn254` being the curve you wish to use and `icicle-core` and `icicle-cuda-runtime` contain ICICLE utilities and CUDA wrappers.
-
-If you wish to point to a specific ICICLE branch add `branch = "<name_of_branch>"` or `tag = "<name_of_tag>"` to the ICICLE dependency. For a specific commit add `rev = "<commit_id>"`.
-
-When you build your project ICICLE will be built as part of the build command.
-
-## How do the rust bindings work?
-
-The rust bindings are just rust wrappers for ICICLE Core static libraries which can be compiled. We integrate the compilation of the static libraries into rusts toolchain to make usage seamless and easy. This is achieved by [extending rusts build command](https://github.com/ingonyama-zk/icicle/blob/main/wrappers/rust/icicle-curves/icicle-bn254/build.rs).
-
-```rust
-use cmake::Config;
-use std::env::var;
-
-fn main() {
-    println!("cargo:rerun-if-env-changed=CXXFLAGS");
-    println!("cargo:rerun-if-changed=../../../../icicle");
-
-    let cargo_dir = var("CARGO_MANIFEST_DIR").unwrap();
-    let profile = var("PROFILE").unwrap();
-
-    let out_dir = Config::new("../../../../icicle")
-                .define("BUILD_TESTS", "OFF") //TODO: feature
-                .define("CURVE", "bn254")
-                .define("CMAKE_BUILD_TYPE", "Release")
-                .build_target("icicle")
-                .build();
-
-    println!("cargo:rustc-link-search={}/build", out_dir.display());
-
-    println!("cargo:rustc-link-lib=ingo_bn254");
-    println!("cargo:rustc-link-lib=stdc++");
-    // println!("cargo:rustc-link-search=native=/usr/local/cuda/lib64");
-    println!("cargo:rustc-link-lib=cudart");
-}
-```
-
-## Supported curves, fields and operations
-
-### Supported curves and operations
-
-| Operation\Curve | bn254 | bls12_377 | bls12_381 | bw6-761 | grumpkin |
-| --- | :---: | :---: | :---: | :---: | :---: |
-| MSM | ✅ | ✅ | ✅ | ✅ | ✅ |
-| G2  | ✅ | ✅ | ✅ | ✅ | ❌ |
-| NTT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| ECNTT | ✅ | ✅ | ✅ | ✅ | ❌ |
-| VecOps | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Polynomials | ✅ | ✅ | ✅ | ✅ | ❌ |
-| Poseidon | ✅ | ✅ | ✅ | ✅ | ✅ |
-| Merkle Tree | ✅ | ✅ | ✅ | ✅ | ✅ |
-
-### Supported fields and operations
-
-| Operation\Field | babybear | stark252 |
-| --- | :---: | :---: |
-| VecOps | ✅ | ✅ |
-| Polynomials | ✅ | ✅ |
-| NTT | ✅ | ✅ |
-| Extension Field | ✅ | ❌ |
-
-### Supported hashes
-
-| Hash | Sizes |
-| --- | :---: |
-| Keccak | 256, 512 |
--- a/docs/docs/icicle/rust-bindings/ecntt.md
+++ b/docs/docs/icicle/rust-bindings/ecntt.md
@@ -1,31 +0,0 @@
-# ECNTT
-
-## ECNTT Method
-
-The `ecntt` function computes the Elliptic Curve Number Theoretic Transform (EC-NTT) or its inverse on a batch of points of a curve.
-
-```rust
-pub fn ecntt<C: Curve>(
-    input: &(impl HostOrDeviceSlice<Projective<C>> + ?Sized),
-    dir: NTTDir,
-    cfg: &NTTConfig<C::ScalarField>,
-    output: &mut (impl HostOrDeviceSlice<Projective<C>> + ?Sized),
-) -> IcicleResult<()>
-where
-    C::ScalarField: FieldImpl,
-    <C::ScalarField as FieldImpl>::Config: ECNTT<C>,
-{
-    // ... function implementation ...
-}
-```
-
-## Parameters
-
- **`input`**: The input data as a slice of `Projective<C>`. This represents points on a specific elliptic curve `C`.
- **`dir`**: The direction of the NTT. It can be `NTTDir::kForward` for forward NTT or `NTTDir::kInverse` for inverse NTT.
- **`cfg`**: The NTT configuration object of type `NTTConfig<C::ScalarField>`. This object specifies parameters for the NTT computation, such as the batch size and algorithm to use.
- **`output`**: The output buffer to write the results into. This should be a slice of `Projective<C>` with the same size as the input.
-
-## Return Value
-
- **`IcicleResult<()>`**: This function returns an `IcicleResult` which is a wrapper type that indicates success or failure of the NTT computation. On success, it contains `Ok(())`.
--- a/docs/docs/icicle/rust-bindings/keccak.md
+++ b/docs/docs/icicle/rust-bindings/keccak.md
@@ -1,96 +0,0 @@
-# Keccak
-
-## Keccak Example
-
-```rust
-use icicle_cuda_runtime::memory::{DeviceVec, HostSlice};
-use icicle_hash::keccak::{keccak256, HashConfig};
-use rand::{self, Rng};
-
-fn main() {
-    let mut rng = rand::thread_rng();
-    let initial_data: Vec<u8> = (0..120).map(|_| rng.gen::<u8>()).collect();
-    println!("initial data: {}", hex::encode(&initial_data));
-    let input = HostSlice::<u8>::from_slice(initial_data.as_slice());
-    let mut output = DeviceVec::<u8>::cuda_malloc(32).unwrap();
-
-    let mut config = HashConfig::default();
-    keccak256(input, initial_data.len() as i32, 1, &mut output[..], &mut config).expect("Failed to execute keccak256 hashing");
-
-    let mut output_host = vec![0_u8; 32];
-    output.copy_to_host(HostSlice::from_mut_slice(&mut output_host[..])).unwrap();
-
-    println!("keccak256 result: {}", hex::encode(&output_host));
-}
-```
-
-## Keccak Methods
-
-```rust
-pub fn keccak256(
-    input: &(impl HostOrDeviceSlice<u8> + ?Sized),
-    input_block_size: i32,
-    number_of_blocks: i32,
-    output: &mut (impl HostOrDeviceSlice<u8> + ?Sized),
-    config: &mut HashConfig,
-) -> IcicleResult<()>
-
-pub fn keccak512(
-    input: &(impl HostOrDeviceSlice<u8> + ?Sized),
-    input_block_size: i32,
-    number_of_blocks: i32,
-    output: &mut (impl HostOrDeviceSlice<u8> + ?Sized),
-    config: &mut HashConfig,
-) -> IcicleResult<()> 
-```
-
-### Parameters
-
- **`input`**: A slice containing the input data for the Keccak256 hash function. It can reside in either host memory or device memory.
- **`input_block_size`**: An integer specifying the size of the input data for a single hash.
- **`number_of_blocks`**: An integer specifying the number of results in the hash batch.
- **`output`**: A slice where the resulting hash will be stored. This slice can be in host or device memory.
- **`config`**: A pointer to a `HashConfig` object, which contains various configuration options for the Keccak256 operation.
-
-### Return Value
-
- **`IcicleResult`**: Returns a CUDA error code indicating the success or failure of the Keccak256/Keccak512 operation.
-
-## HashConfig
-
-The `HashConfig` structure holds configuration parameters for the Keccak256/Keccak512 operation, allowing customization of its behavior to optimize performance based on the specifics of the operation or the underlying hardware.
-
-```rust
-pub struct HashConfig<'a> {
-    pub ctx: DeviceContext<'a>,
-    pub are_inputs_on_device: bool,
-    pub are_outputs_on_device: bool,
-    pub is_async: bool,
-}
-```
-
-### Fields
-
- **`ctx`**: Device context containing details like device id and stream.
- **`are_inputs_on_device`**: Indicates if input data is located on the device.
- **`are_outputs_on_device`**: Indicates if output hash is stored on the device.
- **`is_async`**: If true, runs the Keccak256/Keccak512 operation asynchronously.
-
-### Usage
-
-Example initialization with default settings:
-
-```rust
-let default_config = HashConfig::default();
-```
-
-Customizing the configuration:
-
-```rust
-let custom_config = NTTConfig {
-    ctx: custom_device_context,
-    are_inputs_on_device: true,
-    are_outputs_on_device: true,
-    is_async: false,
-};
-```
--- a/docs/docs/icicle/rust-bindings/msm-pre-computation.md
+++ b/docs/docs/icicle/rust-bindings/msm-pre-computation.md
@@ -1,45 +0,0 @@
-# MSM Pre computation
-
-To understand the theory behind MSM pre computation technique refer to Niall Emmart's [talk](https://youtu.be/KAWlySN7Hm8?feature=shared&t=1734).
-
-## `precompute_points`
-
-Precomputes bases for the multi-scalar multiplication (MSM) by extending each base point with its multiples, facilitating more efficient MSM calculations.
-
-```rust
-pub fn precompute_points<C: Curve + MSM<C>>(
-    points: &(impl HostOrDeviceSlice<Affine<C>> + ?Sized),
-    msm_size: i32,
-    cfg: &MSMConfig,
-    output_bases: &mut DeviceSlice<Affine<C>>,
-) -> IcicleResult<()>
-```
-
-### Parameters
-
- **`points`**: The original set of affine points (\(P_1, P_2, ..., P_n\)) to be used in the MSM. For batch MSM operations, this should include all unique points concatenated together.
- **`msm_size`**: The size of a single msm in order to determine optimal parameters.
- **`cfg`**: The MSM configuration parameters.
- **`output_bases`**: The output buffer for the extended bases. Its size must be `points.len() * precompute_factor`. This buffer should be allocated on the device for GPU computations.
-
-#### Returns
-
-`Ok(())` if the operation is successful, or an `IcicleResult` error otherwise.
-
-#### Description
-
-This function extends each provided base point $(P)$ with its multiples $(2^lP, 2^{2l}P, ..., 2^{(precompute_factor - 1) \cdot l}P)$, where $(l)$ is a level of precomputation determined by the `precompute_factor`. The extended set of points facilitates faster MSM computations by allowing the MSM algorithm to leverage precomputed multiples of base points, reducing the number of point additions required during the computation.
-
-The precomputation process is crucial for optimizing MSM operations, especially when dealing with large sets of points and scalars. By precomputing and storing multiples of the base points, the MSM function can more efficiently compute the scalar-point multiplications.
-
-#### Example Usage
-
-```rust
-let cfg = MSMConfig::default();
-let precompute_factor = 4; // Number of points to precompute
-let mut extended_bases = HostOrDeviceSlice::cuda_malloc(expected_size).expect("Failed to allocate memory for extended bases");
-
-// Precompute the bases using the specified factor
-precompute_points(&points, msm_size, &cfg, &mut extended_bases)
-    .expect("Failed to precompute bases");
-```
--- a/docs/docs/icicle/rust-bindings/msm.md
+++ b/docs/docs/icicle/rust-bindings/msm.md
@@ -1,170 +0,0 @@
-# MSM
-
-## Example
-
-```rust
-use icicle_bn254::curve::{CurveCfg, G1Projective, ScalarCfg};
-use icicle_core::{curve::Curve, msm, traits::GenerateRandom};
-use icicle_cuda_runtime::{memory::HostOrDeviceSlice, stream::CudaStream};
-
-fn main() {
-    let size: usize = 1 << 10; // Define the number of points and scalars
-
-    // Generate random points and scalars
-    println!("Generating random G1 points and scalars for BN254...");
-    let points = CurveCfg::generate_random_affine_points(size);
-    let scalars = ScalarCfg::generate_random(size);
-
-    // Wrap points and scalars in HostOrDeviceSlice for MSM
-    let points_host = HostOrDeviceSlice::Host(points);
-    let scalars_host = HostOrDeviceSlice::Host(scalars);
-
-    // Allocate memory on the CUDA device for MSM results
-    let mut msm_results: HostOrDeviceSlice<'_, G1Projective> = HostOrDeviceSlice::cuda_malloc(1).expect("Failed to allocate CUDA memory for MSM results");
-
-    // Create a CUDA stream for asynchronous execution
-    let stream = CudaStream::create().expect("Failed to create CUDA stream");
-    let mut cfg = msm::MSMConfig::default();
-    cfg.ctx.stream = &stream;
-    cfg.is_async = true; // Enable asynchronous execution
-
-    // Execute MSM on the device
-    println!("Executing MSM on device...");
-    msm::msm(&scalars_host, &points_host, &cfg, &mut msm_results).expect("Failed to execute MSM");
-
-    // Synchronize CUDA stream to ensure MSM execution is complete
-    stream.synchronize().expect("Failed to synchronize CUDA stream");
-
-    // Optionally, move results to host for further processing or printing
-    println!("MSM execution complete.");
-}
-```
-
-## MSM API Overview
-
-```rust
-pub fn msm<C: Curve>(
-    scalars: &HostOrDeviceSlice<C::ScalarField>,
-    points: &HostOrDeviceSlice<Affine<C>>,
-    cfg: &MSMConfig,
-    results: &mut HostOrDeviceSlice<Projective<C>>,
-) -> IcicleResult<()>
-```
-
-### Parameters
-
- **`scalars`**: A buffer containing the scalar values to be multiplied with corresponding points.
- **`points`**: A buffer containing the points to be multiplied by the scalars.
- **`cfg`**: MSM configuration specifying additional parameters for the operation.
- **`results`**: A buffer where the results of the MSM operations will be stored.
-
-### MSM Config
-
-```rust
-pub struct MSMConfig<'a> {
-    pub ctx: DeviceContext<'a>,
-    points_size: i32,
-    pub precompute_factor: i32,
-    pub c: i32,
-    pub bitsize: i32,
-    pub large_bucket_factor: i32,
-    batch_size: i32,
-    are_scalars_on_device: bool,
-    pub are_scalars_montgomery_form: bool,
-    are_points_on_device: bool,
-    pub are_points_montgomery_form: bool,
-    are_results_on_device: bool,
-    pub is_big_triangle: bool,
-    pub is_async: bool,
-}
-```
-
- **`ctx: DeviceContext`**: Specifies the device context, device id and the CUDA stream for asynchronous execution.
- **`point_size: i32`**:
- **`precompute_factor: i32`**: Determines the number of extra points to pre-compute for each point, affecting memory footprint and performance.
- **`c: i32`**: The "window bitsize," a parameter controlling the computational complexity and memory footprint of the MSM operation.
- **`bitsize: i32`**: The number of bits of the largest scalar, typically equal to the bit size of the scalar field.
- **`large_bucket_factor: i32`**: Adjusts the algorithm's sensitivity to frequently occurring buckets, useful for non-uniform scalar distributions.
- **`batch_size: i32`**: The number of MSMs to compute in a single batch, for leveraging parallelism.
- **`are_scalars_montgomery_form`**: Set to `true` if scalars are in montgomery form.
- **`are_points_montgomery_form`**: Set to `true` if points are in montgomery form.
- **`are_scalars_on_device: bool`**, **`are_points_on_device: bool`**, **`are_results_on_device: bool`**: Indicate whether the corresponding buffers are on the device memory.
- **`is_big_triangle`**: If `true` MSM will run in Large triangle accumulation if `false` Bucket accumulation will be chosen. Default value: false.
- **`is_async: bool`**: Whether to perform the MSM operation asynchronously.
-
-### Usage
-
-The `msm` function is designed to compute the sum of multiple scalar-point multiplications efficiently. It supports both single MSM operations and batched operations for increased performance. The configuration allows for detailed control over the execution environment and performance characteristics of the MSM operation.
-
-When performing MSM operations, it's crucial to match the size of the `scalars` and `points` arrays correctly and ensure that the `results` buffer is appropriately sized to hold the output. The `MSMConfig` should be set up to reflect the specifics of the operation, including whether the operation should be asynchronous and any device-specific settings.
-
-## How do I toggle between the supported algorithms?
-
-When creating your MSM Config you may state which algorithm you wish to use. `is_big_triangle=true` will activate Large triangle reduction and `is_big_triangle=false` will activate iterative reduction.
-
-```rust
-...
-
-let mut cfg_bls12377 = msm::get_default_msm_config::<BLS12377CurveCfg>();
-
-// is_big_triangle will determine which algorithm to use 
-cfg_bls12377.is_big_triangle = true;
-
-msm::msm(&scalars, &points, &cfg, &mut msm_results).unwrap();
-...
-```
-
-You may reference the rust code [here](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/wrappers/rust/icicle-core/src/msm/mod.rs#L54).
-
-## How do I toggle between MSM modes?
-
-Toggling between MSM modes occurs automatically based on the number of results you are expecting from the `msm::msm` function. If you are expecting an array of `msm_results`, ICICLE will automatically split `scalars` and `points` into equal parts and run them as multiple MSMs in parallel.
-
-```rust
-...
-
-let mut msm_result: HostOrDeviceSlice<'_, G1Projective> = HostOrDeviceSlice::cuda_malloc(1).unwrap();
-msm::msm(&scalars, &points, &cfg, &mut msm_result).unwrap();
-
-...
-```
-
-In the example above we allocate a single expected result which the MSM method will interpret as `batch_size=1` and run a single MSM.
-
-In the next example, we are expecting 10 results which sets `batch_size=10` and runs 10 MSMs in batch mode.
-
-```rust
-...
-
-let mut msm_results: HostOrDeviceSlice<'_, G1Projective> = HostOrDeviceSlice::cuda_malloc(10).unwrap();
-msm::msm(&scalars, &points, &cfg, &mut msm_results).unwrap();
-
-...
-```
-
-Here is a [reference](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/wrappers/rust/icicle-core/src/msm/mod.rs#L108) to the code which automatically sets the batch size. For more MSM examples have a look [here](https://github.com/ingonyama-zk/icicle/blob/77a7613aa21961030e4e12bf1c9a78a2dadb2518/examples/rust/msm/src/main.rs#L1).
-
-## Parameters for optimal performance
-
-Please refer to the [primitive description](../primitives/msm#choosing-optimal-parameters)
-
-## Support for G2 group
-
-MSM also supports G2 group.
-
-Using MSM in G2 requires a G2 config, and of course your Points should also be G2 Points.
-
-```rust
-... 
-
-let scalars = HostOrDeviceSlice::Host(upper_scalars[..size].to_vec());
-let g2_points = HostOrDeviceSlice::Host(g2_upper_points[..size].to_vec());
-let mut g2_msm_results: HostOrDeviceSlice<'_, G2Projective> = HostOrDeviceSlice::cuda_malloc(1).unwrap();
-let mut g2_cfg = msm::get_default_msm_config::<G2CurveCfg>();
-
-msm::msm(&scalars, &g2_points, &g2_cfg, &mut g2_msm_results).unwrap();
-
-...
-```
-
-Here you can [find an example](https://github.com/ingonyama-zk/icicle/blob/5a96f9937d0a7176d88c766bd3ef2062b0c26c37/examples/rust/msm/src/main.rs#L114) of MSM on G2 Points.
--- a/docs/docs/icicle/rust-bindings/multi-gpu.md
+++ b/docs/docs/icicle/rust-bindings/multi-gpu.md
@@ -1,202 +0,0 @@
-# Multi GPU APIs
-
-To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation.
-
-Here we will cover the core multi GPU apis and a [example](#a-multi-gpu-example)
-
-
-## A Multi GPU example
-
-In this example we will display how you can
-
-1. Fetch the number of devices installed on a machine
-2. For every GPU launch a thread and set an active device per thread.
-3. Execute a MSM on each GPU
-
-
-
-```rust
-
-...
-
-let device_count = get_device_count().unwrap();
-
-(0..device_count)
-        .into_par_iter()
-        .for_each(move |device_id| {
-          set_device(device_id).unwrap();
-
-          // you can allocate points and scalars_d here
-
-          let mut cfg = MSMConfig::default_for_device(device_id);
-          cfg.ctx.stream = &stream;
-          cfg.is_async = true;
-          cfg.are_scalars_montgomery_form = true;
-          msm(&scalars_d, &HostOrDeviceSlice::on_host(points), &cfg, &mut msm_results).unwrap();
-
-          // collect and process results
-        })
-
-...
-```
-
-
-We use `get_device_count` to fetch the number of connected devices, device IDs will be `0, 1, 2, ..., device_count - 1`
-
-[`into_par_iter`](https://docs.rs/rayon/latest/rayon/iter/trait.IntoParallelIterator.html#tymethod.into_par_iter) is a parallel iterator, you should expect it to launch a thread for every iteration.
-
-We then call `set_device(device_id).unwrap();` it should set the context of that thread to the selected `device_id`.
-
-Any data you now allocate from the context of this thread will be linked to the `device_id`. We create our `MSMConfig` with the selected device ID `let mut cfg = MSMConfig::default_for_device(device_id);`, behind the scene this will create for us a `DeviceContext` configured for that specific GPU. 
-
-We finally call our `msm` method.
-
-
-## Device management API
-
-To streamline device management we offer as part of `icicle-cuda-runtime` package methods for dealing with devices.
-
-#### [`set_device`](https://github.com/ingonyama-zk/icicle/blob/e6035698b5e54632f2c44e600391352ccc11cad4/wrappers/rust/icicle-cuda-runtime/src/device.rs#L6)
-
-Sets the current CUDA device by its ID, when calling `set_device` it will set the current thread to a CUDA device.
-
-**Parameters:**
-
- **`device_id: usize`**: The ID of the device to set as the current device. Device IDs start from 0.
-
-**Returns:**
-
- **`CudaResult<()>`**: An empty result indicating success if the device is set successfully. In case of failure, returns a `CudaError`.
-
-**Errors:**
-
- Returns a `CudaError` if the specified device ID is invalid or if a CUDA-related error occurs during the operation.
-
-**Example:**
-
-```rust
-let device_id = 0; // Device ID to set
-match set_device(device_id) {
-    Ok(()) => println!("Device set successfully."),
-    Err(e) => eprintln!("Failed to set device: {:?}", e),
-}
-```
-
-#### [`get_device_count`](https://github.com/ingonyama-zk/icicle/blob/e6035698b5e54632f2c44e600391352ccc11cad4/wrappers/rust/icicle-cuda-runtime/src/device.rs#L10)
-
-Retrieves the number of CUDA devices available on the machine.
-
-**Returns:**
-
- **`CudaResult<usize>`**: The number of available CUDA devices. On success, contains the count of CUDA devices. On failure, returns a `CudaError`.
-
-**Errors:**
-
- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the device count.
-
-**Example:**
-
-```rust
-match get_device_count() {
-    Ok(count) => println!("Number of devices available: {}", count),
-    Err(e) => eprintln!("Failed to get device count: {:?}", e),
-}
-```
-
-#### [`get_device`](https://github.com/ingonyama-zk/icicle/blob/e6035698b5e54632f2c44e600391352ccc11cad4/wrappers/rust/icicle-cuda-runtime/src/device.rs#L15)
-
-Retrieves the ID of the current CUDA device.
-
-**Returns:**
-
- **`CudaResult<usize>`**: The ID of the current CUDA device. On success, contains the device ID. On failure, returns a `CudaError`.
-
-**Errors:**
-
- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the current device ID.
-
-**Example:**
-
-```rust
-match get_device() {
-    Ok(device_id) => println!("Current device ID: {}", device_id),
-    Err(e) => eprintln!("Failed to get current device: {:?}", e),
-}
-```
-
-## Device context API
-
-The `DeviceContext` is embedded into `NTTConfig`, `MSMConfig` and `PoseidonConfig`, meaning you can simply pass a `device_id` to your existing config and the same computation will be triggered on a different device.
-
-#### [`DeviceContext`](https://github.com/ingonyama-zk/icicle/blob/e6035698b5e54632f2c44e600391352ccc11cad4/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L11)
-
-Represents the configuration a CUDA device, encapsulating the device's stream, ID, and memory pool. The default device is always `0`.
-
-```rust
-pub struct DeviceContext<'a> {
-    pub stream: &'a CudaStream,
-    pub device_id: usize,
-    pub mempool: CudaMemPool,
-}
-```
-
-##### Fields
-
- **`stream: &'a CudaStream`**
-
-  A reference to a `CudaStream`. This stream is used for executing CUDA operations. By default, it points to a null stream CUDA's default execution stream.
-
- **`device_id: usize`**
-
-  The index of the GPU currently in use. The default value is `0`, indicating the first GPU in the system.
-
-  In some cases assuming `CUDA_VISIBLE_DEVICES` was configured, for example as `CUDA_VISIBLE_DEVICES=2,3,7` in the system with 8 GPUs - the `device_id=0` will correspond to GPU with id 2. So the mapping may not always be a direct reflection of the number of GPUs installed on a system.
-
- **`mempool: CudaMemPool`**
-
-  Represents the memory pool used for CUDA memory allocations. The default is set to a null pointer, which signifies the use of the default CUDA memory pool.
-
-##### Implementation Notes
-
- The `DeviceContext` structure is cloneable and can be debugged, facilitating easier logging and duplication of contexts when needed.
-
-
-#### [`DeviceContext::default_for_device(device_id: usize) -> DeviceContext<'static>`](https://github.com/ingonyama-zk/icicle/blob/e6035698b5e54632f2c44e600391352ccc11cad4/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L30)
-
-Provides a default `DeviceContext` with system-wide defaults, ideal for straightforward setups.
-
-#### Returns
-
-A `DeviceContext` instance configured with:
- The default stream (`null_mut()`).
- The default device ID (`0`).
- The default memory pool (`null_mut()`).
-
-#### Parameters
-
- **`device_id: usize`**: The ID of the device for which to create the context.
-
-#### Returns
-
-A `DeviceContext` instance with the provided `device_id` and default settings for the stream and memory pool.
-
-
-#### [`check_device(device_id: i32)`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L42)
-
-Validates that the specified `device_id` matches the ID of the currently active device, ensuring operations are targeted correctly.
-
-#### Parameters
-
- **`device_id: i32`**: The device ID to verify against the currently active device.
-
-#### Behavior
-
- **`Panics`** if the `device_id` does not match the active device's ID, preventing cross-device operation errors.
-
-#### Example
-
-```rust
-let device_id: i32 = 0; // Example device ID
-check_device(device_id);
-// Ensures that the current context is correctly set for the specified device ID.
-```
--- a/docs/docs/icicle/rust-bindings/ntt.md
+++ b/docs/docs/icicle/rust-bindings/ntt.md
@@ -1,200 +0,0 @@
-# NTT
-
-## Example
-
-```rust
-use icicle_bn254::curve::{ScalarCfg, ScalarField};
-use icicle_core::{ntt::{self, NTT}, traits::GenerateRandom};
-use icicle_cuda_runtime::{device_context::DeviceContext, memory::HostOrDeviceSlice, stream::CudaStream};
-
-fn main() {
-    let size = 1 << 12; // Define the size of your input, e.g., 2^10
-
-    let icicle_omega = <Bn254Fr as FftField>::get_root_of_unity(
-        size.try_into()
-            .unwrap(),
-    )
-
-    // Generate random inputs
-    println!("Generating random inputs...");
-    let scalars = HostOrDeviceSlice::Host(ScalarCfg::generate_random(size));
-
-    // Allocate memory on CUDA device for NTT results
-    let mut ntt_results: HostOrDeviceSlice<'_, ScalarField> = HostOrDeviceSlice::cuda_malloc(size).expect("Failed to allocate CUDA memory");
-
-    // Create a CUDA stream
-    let stream = CudaStream::create().expect("Failed to create CUDA stream");
-    let ctx = DeviceContext::default(); // Assuming default device context
-    ScalarCfg::initialize_domain(ScalarField::from_ark(icicle_omega), &ctx, true).unwrap();
-
-    // Configure NTT
-    let mut cfg = ntt::NTTConfig::default();
-    cfg.ctx.stream = &stream;
-    cfg.is_async = true; // Set to true for asynchronous execution
-
-    // Execute NTT on device
-    println!("Executing NTT on device...");
-    ntt::ntt(&scalars, ntt::NTTDir::kForward, &cfg, &mut ntt_results).expect("Failed to execute NTT");
-
-    // Synchronize CUDA stream to ensure completion
-    stream.synchronize().expect("Failed to synchronize CUDA stream");
-
-    // Optionally, move results to host for further processing or verification
-    println!("NTT execution complete.");
-}
-```
-
-## NTT API overview
-
-```rust
-pub fn ntt<F>(
-    input: &HostOrDeviceSlice<F>,
-    dir: NTTDir,
-    cfg: &NTTConfig<F>,
-    output: &mut HostOrDeviceSlice<F>,
-) -> IcicleResult<()>
-```
-
-`ntt:ntt` expects:
-
- **`input`** - buffer to read the inputs of the NTT from.
- **`dir`** - whether to compute forward or inverse NTT.
- **`cfg`** - config used to specify extra arguments of the NTT.
- **`output`** - buffer to write the NTT outputs into. Must be of the same  size as input.
-
-The `input` and `output` buffers can be on device or on host. Being on host means that they will be transferred to device during runtime.
-
-### NTT Config
-
-```rust
-pub struct NTTConfig<'a, S> {
-    pub ctx: DeviceContext<'a>,
-    pub coset_gen: S,
-    pub batch_size: i32,
-    pub columns_batch: bool,
-    pub ordering: Ordering,
-    are_inputs_on_device: bool,    
-    are_outputs_on_device: bool,
-    pub is_async: bool,
-    pub ntt_algorithm: NttAlgorithm,
-}
-```
-
-The `NTTConfig` struct is a configuration object used to specify parameters for an NTT instance.
-
-#### Fields
-
- **`ctx: DeviceContext<'a>`**: Specifies the device context, including the device ID and the stream ID.
-
- **`coset_gen: S`**: Defines the coset generator used for coset (i)NTTs. By default, this is set to `S::one()`, indicating that no coset is being used.
-
- **`batch_size: i32`**: Determines the number of NTTs to compute in a single batch. The default value is 1, meaning that operations are performed on individual inputs without batching. Batch processing can significantly improve performance by leveraging parallelism in GPU computations.
-
- **`columns_batch`**: If true the function will compute the NTTs over the columns of the input matrix and not over the rows. Defaults to `false`.
-
- **`ordering: Ordering`**: Controls the ordering of inputs and outputs for the NTT operation. This field can be used to specify decimation strategies (in time or in frequency) and the type of butterfly algorithm (Cooley-Tukey or Gentleman-Sande). The ordering is crucial for compatibility with various algorithmic approaches and can impact the efficiency of the NTT.
-
- **`are_inputs_on_device: bool`**: Indicates whether the input data has been preloaded on the device memory. If `false` inputs will be copied from host to device.
-
- **`are_outputs_on_device: bool`**: Indicates whether the output data is preloaded in device memory. If `false` outputs will be copied from host to device. If the inputs and outputs are the same pointer NTT will be computed in place.
-
- **`is_async: bool`**: Specifies whether the NTT operation should be performed asynchronously. When set to `true`, the NTT function will not block the CPU, allowing other operations to proceed concurrently. Asynchronous execution requires careful synchronization to ensure data integrity and correctness.
-
- **`ntt_algorithm: NttAlgorithm`**: Can be one of `Auto`, `Radix2`, `MixedRadix`.
-`Auto` will select `Radix 2` or `Mixed Radix` algorithm based on heuristics.
-`Radix2` and `MixedRadix` will force the use of an algorithm regardless of the input size or other considerations. You should use one of these options when you know for sure that you want to
-
-#### Usage
-
-Example initialization with default settings:
-
-```rust
-let default_config = NTTConfig::default();
-```
-
-Customizing the configuration:
-
-```rust
-let custom_config = NTTConfig {
-    ctx: custom_device_context,
-    coset_gen: my_coset_generator,
-    batch_size: 10,
-    columns_batch: false,
-    ordering: Ordering::kRN,
-    are_inputs_on_device: true,
-    are_outputs_on_device: true,
-    is_async: false,
-    ntt_algorithm: NttAlgorithm::MixedRadix,
-};
-```
-
-### Modes
-
-NTT supports two different modes `Batch NTT` and `Single NTT`
-
-You may toggle between single and batch NTT by simply configure `batch_size` to be larger then 1 in your `NTTConfig`.
-
-```rust
-let mut cfg = ntt::get_default_ntt_config::<ScalarField>();
-cfg.batch_size = 10 // your ntt using this config will run in batch mode.
-```
-
-`batch_size=1` would keep our NTT in single NTT mode.
-
-Deciding weather to use `batch NTT` vs `single NTT` is highly dependent on your application and use case.
-
-### Initializing the NTT Domain
-
-Before performing NTT operations, its necessary to initialize the NTT domain, It only needs to be called once per GPU since the twiddles are cached.
-
-```rust
-ScalarCfg::initialize_domain(ScalarField::from_ark(icicle_omega), &ctx, true).unwrap();
-```
-
-### `initialize_domain`
-
-```rust
-pub fn initialize_domain<F>(primitive_root: F, ctx: &DeviceContext, fast_twiddles: bool) -> IcicleResult<()>
-where
-    F: FieldImpl,
-    <F as FieldImpl>::Config: NTT<F>;
-```
-
-#### Parameters
-
- **`primitive_root`**: The primitive root of unity, chosen based on the maximum NTT size required for the computations. It must be of an order that is a power of two. This root is used to generate twiddle factors that are essential for the NTT operations.
-
- **`ctx`**: A reference to a `DeviceContext` specifying which device and stream the computation should be executed on.
-
-#### Returns
-
- **`IcicleResult<()>`**: Will return an error if the operation fails.
-
-#### Parameters
-
- **`primitive_root`**: The primitive root of unity, chosen based on the maximum NTT size required for the computations. It must be of an order that is a power of two. This root is used to generate twiddle factors that are essential for the NTT operations.
-
- **`ctx`**: A reference to a `DeviceContext` specifying which device and stream the computation should be executed on.
-
-#### Returns
-
- **`IcicleResult<()>`**: Will return an error if the operation fails.
-
-### Releasing the domain
-
-The `release_domain` function is responsible for releasing the resources associated with a specific domain in the CUDA device context.
-
-```rust
-pub fn release_domain<F>(ctx: &DeviceContext) -> IcicleResult<()>
-where
-    F: FieldImpl,
-    <F as FieldImpl>::Config: NTT<F>
-```
-
-#### Parameters
-
- **`ctx`**: A reference to a `DeviceContext` specifying which device and stream the computation should be executed on.
-
-#### Returns
-
-The function returns an `IcicleResult<()>`, which represents the result of the operation. If the operation is successful, the function returns `Ok(())`, otherwise it returns an error.
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
guy-ingo	de1d27d846	new api - wip	2023-06-22 23:39:54 +03:00
ImmanuelSegol	392c9f8e2e	wip - towrds better rust frontend	2023-06-05 14:41:29 +03:00