mirror of https://github.com/pseXperiments/icicle.git synced 2026-01-08 23:17:54 -05:00

Files

nonam3e 8e62bde16d bit reverse (#528 )

This PR adds bit reverse operation support to icicle

2024-06-02 16:37:58 +07:00

CMakeLists.txt

Stas/best-practice-ntt (#517 )

2024-05-16 23:51:49 +03:00

compile.sh

Stas/best-practice-ntt (#517 )

2024-05-16 23:51:49 +03:00

example.cu

bit reverse (#528 )

2024-06-02 16:37:58 +07:00

README.md

Stas/best-practice-ntt (#517 )

2024-05-16 23:51:49 +03:00

run.sh

Stas/best-practice-ntt (#517 )

2024-05-16 23:51:49 +03:00

README.md

ICICLE best practices: Concurrent Data Transfer and NTT Computation

The Number Theoretic Transform (NTT) is an integral component of many cryptographic algorithms, such as polynomial multiplication in Zero Knowledge Proofs. The performance bottleneck of NTT on GPUs is the data transfer between the host (CPU) and the device (GPU). In a typical NVIDIA GPU this transfer dominates the total NTT execution time.

Key-Takeaway

When you have to run several NTTs, consider Concurrent Data Download, Upload, and Computation to improve data bus (PCIe) and GPU utilization, and get better total execution time.

Typically, you concurrently

Download the output of a previous NTT back to the host
Upload the input for a next NTT on the device
Run current NTT

Note

This approach requires two on-device memory vectors, decreasing the maximum size of NTT by 2x.

Best-Practices

Use three separate CUDA streams for Download, Upload, and Compute operations
Use pinned (page-locked) memory on host to speed data bus transfers. Calling cudaHostAlloc allocates pinned memory.
Use in-place NTT to save on device memory.

Running the example

To change the default curve BN254, edit compile.sh and CMakeLists.txt

./compile.sh
./run.sh

To compare with ICICLE baseline (i.e. non-concurrent) NTT, you can run this example.