ICICLE best practices: Concurrent Data Transfer and NTT Computation
The Number Theoretic Transform (NTT) is an integral component of many cryptographic algorithms, such as polynomial multiplication in Zero Knowledge Proofs. The performance bottleneck of NTT on GPUs is the data transfer between the host (CPU) and the device (GPU). In a typical NVIDIA GPU this transfer dominates the total NTT execution time.
Key-Takeaway
When you have to run several NTTs, consider Concurrent Data Download, Upload, and Computation to improve data bus (PCIe) and GPU utilization, and get better total execution time.
Typically, you concurrently
- Download the output of a previous NTT back to the host
- Upload the input for a next NTT on the device
- Run current NTT
Note
This approach requires two on-device memory vectors, decreasing the maximum size of NTT by 2x.
Best-Practices
- Use three separate CUDA streams for Download, Upload, and Compute operations
- Use pinned (page-locked) memory on host to speed data bus transfers. Calling
cudaHostAllocallocates pinned memory. - Use in-place NTT to save on device memory.
Running the example
To change the default curve BN254, edit compile.sh and CMakeLists.txt
./compile.sh
./run.sh
To compare with ICICLE baseline (i.e. non-concurrent) NTT, you can run this example.