The bug is in how twiddles array is indexed when multiplied by a mixed
(M) vector to implement (I)NTT on cosets.
The fix is to use the DIF-digit-reverse to compute the index of the element in the
natural (N) vector that moved to index 'i' in the M vector. This is
emulating a DIT-digit-reverse (which is mixing like a DIF-compute)
reorder of the twiddles array and element-wise multiplication without
reordering the twiddles memory.
## Describe the changes
This PR adds a NTT ReleaseDomain API in Golang and Rust
## Linked Issues
Resolves #
---------
Co-authored-by: Yuval Shekel <yshekel@gmail.com>
## Describe the changes
This PR adds an extern C link to the transpose kernel, now in
vec_ops.cu.
Also Rust binding, and I updated the test check_ntt_batch to use the new
transpose function.
The test passes.
## Linked Issues
Resolves #
---------
Co-authored-by: LeonHibnik <leon@ingonyama.com>
## Describe the changes
Due to Rust's ownership rules, we can't run NTT inplace using the
[`ntt`](https://github.com/ingonyama-zk/icicle/blob/v1.9.1/wrappers/rust/icicle-core/src/ntt/mod.rs#L139)
function. Which is why we saw a need to add a separate function a couple
of times.
Incidentally an issue with radix-2 NTT was found when ran inplace,
`__syncthreads()` was used in reverse order kernel as if it was a global
barrier for all blocks and not block-local one. Thus data race happened
that is fixed by this PR.
## Brief description
This PR adds pre-computation to the MSM, for some theory see
[this](https://youtu.be/KAWlySN7Hm8?si=XeR-htjbnK_ySbUo&t=1734) timecode
of Niall Emmart's talk.
In terms of public APIs, one method is added. It does the
pre-computation on-device leaving resulting data on-device as well. No
extra structures are added, only `precompute_factor` from `MSMConfig` is
now activated.
## Performance
While performance gains are for now often limited by our inflexibility
in choice of `c` (for example, very large MSMs get basically no speedup
from pre-compute because currently `c` cannot be larger than 16),
there's still a number of MSM sizes which get noticeable improvement:
| Pre-computation factor | bn254 size `2^20` MSM, ms. | bn254 size
`2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. |
bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
| ------------- | ------------- | ------------- | ------------- |
------------- |
| 1 | 14.1 | 82.8 | 25.5 | 136.7 |
| 2 | 11.8 | 76.6 | 20.3 | 123.8 |
| 4 | 10.9 | 73.8 | 18.1 | 117.8 |
| 8 | 10.6 | 73.7 | 17.2 | 116.0 |
Here for example pre-computation factor = 4 means that alongside each
original base point, we pre-compute and pass into the MSM 3 of its
"shifted" versions. Pre-computation factor = 1 means no pre-computation.
GPU used for benchmarks is a 3090Ti.
## TODOs and open questions
- Golang APIs are missing;
- I mentioned that to utilise pre-compute to its full potential we need
arbitrary choice of `c`. One issue with this is that pre-compute will
become dependent on `c`. For now this is not the case as `c` can only be
a power of 2 and powers of 2 can always share the same pre-computation.
So apparently we need to make `c` a parameter of the precompute function
to future-proof it from a breaking change. This is pretty unnatural and
counterintuitive as `c` is typically chosen in runtime after pre-compute
is done but I don't really see another way, pls let me know if you do.
UPD: `c` is added into pre-compute function, for now it's unused and
it's documented how it will change in the future.
Resolves https://github.com/ingonyama-zk/icicle/issues/147
Co-authored with @ChickenLover
---------
Co-authored-by: ChickenLover <romangg81@gmail.com>
Co-authored-by: nonam3e <timur@ingonyama.com>
Co-authored-by: nonam3e <71525212+nonam3e@users.noreply.github.com>
Co-authored-by: LeonHibnik <leon@ingonyama.com>
This PR adds the columns batch feature - enabling batch NTT computation
to be performed directly on the columns of a matrix without having to
transpose it beforehand, as requested in issue #264.
Also some small fixes to the reordering kernels were added and some
unnecessary parameters were removes from functions interfaces.
---------
Co-authored-by: DmytroTym <dmytrotym1@gmail.com>
## Describe the changes
Add a non-blocking `warmup` function to `CudaStream`
> when you run the benchmark (e.g. the msm example you have) the first
instance is always slow, with a constant overhead of 200~300ms cuda
stream warmup. and I want to get rid of that in my application by
warming it up in parallel while my host do something else.
## Describe the changes
This PR:
- Moves common crate attributes to the workspace Cargo.toml.
- Adds a manual release flow for bumping, tagging, and draft release
This PR is a compilation of small improvements
- Lock bindgen version for `icicle-cuda-runtime`
- Add an error message when trying to build on Mac (or any non
windows/linux machine)
- Add documentation and template files for adding new curve
- Add documentation on _params.cuh contents
- Add the script to bump all the rust crates versions to the same
version
Resolves#313
- this mode is allocating additional 4N twiddle-factors to achieve faster computation
- enabled by flag for initDomain(). Defaults to false.
Co-authored-by: hadaringonyama <hadar@ingonyama.com>
* Improved MSM
* Zero point handling in large buckets
* Fixed affine zero point conversion for arkworks
* cargo fmt
* Addressed comments
* MSM comments
* All zero scalars case handled
* clang format
- Mixed-radix NTT orderings support
- radix-2 small refactor: split core logic to function and renamed ct_butterfly to dit
- testing both radix2 and mixed-radix algs for all ntt tests
* BW scalar field is now the same as BLS base field
* add poseidon
* add merkle tree builder
* poseidon rust bindings
* implement rust bindings
* add doc comments
* remove global poseidon constants
* add custom constants API and script for generating new constants
* add the rest of the curves for poseidon
* add all the curves for real
* misname bls12-377
* typo
* partial rounds
* minor fixes
* small tweak for big performance boost
* add CHK_INIT_IF_RETURN
---------
Co-authored-by: DmytroTym <dmytrotym1@gmail.com>
* BW scalar field is now the same as BLS base field
* add poseidon
* add merkle tree builder
* poseidon rust bindings
* implement rust bindings
* add doc comments
* remove global poseidon constants
* add custom constants API and script for generating new constants
* add the rest of the curves for poseidon
* add all the curves for real
* misname bls12-377
* typo
* partial rounds
* minor fixes
* small tweak for big performance boost
* add CHK_INIT_IF_RETURN
---------
Co-authored-by: DmytroTym <dmytrotym1@gmail.com>
- Make the curve config's omegas_count conditionally accessed when creating fields
- Remove the extern C function that returns a UDT containing non-POD types and replace it with a default_config function on the Rust bindings side