## Describe the changes
This PR fixes affine to projective functions in bindings by adding a
condition if the point in affine form is zero then return the projective zero
---------
Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
## Describe the changes
This PR adds the capability to pin host memory in golang bindings
allowing data transfers to be quicker. Memory can be pinned once for
multiple devices by passing the flag
`cuda_runtime.CudaHostRegisterPortable` or
`cuda_runtime.CudaHostAllocPortable` depending on how pinned memory is
called
This PR enables using MSM with any value of c.
Note: default c isn't necessarily optimal, the user is expected to
choose c and the precomputation factor that give the best results for
the relevant case.
---------
Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
- removed poly API to access view of evaluations. This is a problematic API since it cannot handle small domains and for large domains requires the polynomial to use more memory than need to.
- added evaluate_on_rou_domain() API instead that supports any domain size (powers of two size).
- the new API can compute to HOST or DEVICE memory
- Rust wrapper for evaluate_on_rou_domain()
- updated documentation: overview and Rust wrappers
- faster division by vanishing poly for common case where numerator is 2N and vanishing poly is of degree N.
- allow division a/b where deg(a)<deg(b) instead of throwing an error.
# This PR
1. Adds C++ API
2. Renames a lot of API functions
3. Adds inplace poseidon2
4. Makes input const at all poseidon functions
5. Adds benchmark for poseidon2
## Describe the changes
This PR fixes an issue when `RunOnDevice` is called for multi-gpu while
other goroutines calling device operations are run outside of
`RunOnDevice`. The issue comes from setting a device other than the
default device (device 0) on a host thread within `RunOnDevice` and not
unsetting that host threads device when `RunOnDevice` finishes.
When `RunOnDevice` locks a host thread to ensure that all other calls in
the go routine are on the same device, it never unsets that thread’s
device. Once the thread is unlocked, other go routines can get scheduled
to it but it still has the device set to whatever it was before while it
was locked so its possible that the following sequence happens:
1. NTT domain is initialized on thread 2 via a goroutine on device 0
2. MSM multiGPU test runs and is locked on thread 3 setting its device
to 1
3. Other tests run concurrently on threads other than 3 (since it is
locked)
4. MSM multiGPU test finishes and release thread 3 back to the pool but
its device is still 1
5. NTT test runs and is assigned to thread 3 --> this will fail because
the thread’s device wasn’t released back
We really only want to set a thread's device while the thread is locked.
But once we unlock a thread, it’s device should return to whatever it
was set at originally. In theory, it should always be 0 if `SetDevice`
is never used outside of `RunOnDevice` - which it shouldn’t be in most
situations
The bug is in how twiddles array is indexed when multiplied by a mixed
(M) vector to implement (I)NTT on cosets.
The fix is to use the DIF-digit-reverse to compute the index of the element in the
natural (N) vector that moved to index 'i' in the M vector. This is
emulating a DIT-digit-reverse (which is mixing like a DIF-compute)
reorder of the twiddles array and element-wise multiplication without
reordering the twiddles memory.
## Describe the changes
This PR adds a NTT ReleaseDomain API in Golang and Rust
## Linked Issues
Resolves #
---------
Co-authored-by: Yuval Shekel <yshekel@gmail.com>
## Describe the changes
This PR adds an extern C link to the transpose kernel, now in
vec_ops.cu.
Also Rust binding, and I updated the test check_ntt_batch to use the new
transpose function.
The test passes.
## Linked Issues
Resolves #
---------
Co-authored-by: LeonHibnik <leon@ingonyama.com>
## Describe the changes
This PR adds the capability to slice a DeviceSlice, allowing portions of
data that are already on the device to be reused.
Additionally, this PR removes the need for a HostSlice underlying type
to implement a Size function and uses unsafe.Sizeof instead. This
together with #407 will allow direct usage of gnark-crypto types with
HostSlice without the need for converting to ICICLE types
---------
Co-authored-by: nonam3e <timur@ingonyama.com>
## Describe the changes
Due to Rust's ownership rules, we can't run NTT inplace using the
[`ntt`](https://github.com/ingonyama-zk/icicle/blob/v1.9.1/wrappers/rust/icicle-core/src/ntt/mod.rs#L139)
function. Which is why we saw a need to add a separate function a couple
of times.
Incidentally an issue with radix-2 NTT was found when ran inplace,
`__syncthreads()` was used in reverse order kernel as if it was a global
barrier for all blocks and not block-local one. Thus data race happened
that is fixed by this PR.
## Brief description
This PR adds pre-computation to the MSM, for some theory see
[this](https://youtu.be/KAWlySN7Hm8?si=XeR-htjbnK_ySbUo&t=1734) timecode
of Niall Emmart's talk.
In terms of public APIs, one method is added. It does the
pre-computation on-device leaving resulting data on-device as well. No
extra structures are added, only `precompute_factor` from `MSMConfig` is
now activated.
## Performance
While performance gains are for now often limited by our inflexibility
in choice of `c` (for example, very large MSMs get basically no speedup
from pre-compute because currently `c` cannot be larger than 16),
there's still a number of MSM sizes which get noticeable improvement:
| Pre-computation factor | bn254 size `2^20` MSM, ms. | bn254 size
`2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. |
bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
| ------------- | ------------- | ------------- | ------------- |
------------- |
| 1 | 14.1 | 82.8 | 25.5 | 136.7 |
| 2 | 11.8 | 76.6 | 20.3 | 123.8 |
| 4 | 10.9 | 73.8 | 18.1 | 117.8 |
| 8 | 10.6 | 73.7 | 17.2 | 116.0 |
Here for example pre-computation factor = 4 means that alongside each
original base point, we pre-compute and pass into the MSM 3 of its
"shifted" versions. Pre-computation factor = 1 means no pre-computation.
GPU used for benchmarks is a 3090Ti.
## TODOs and open questions
- Golang APIs are missing;
- I mentioned that to utilise pre-compute to its full potential we need
arbitrary choice of `c`. One issue with this is that pre-compute will
become dependent on `c`. For now this is not the case as `c` can only be
a power of 2 and powers of 2 can always share the same pre-computation.
So apparently we need to make `c` a parameter of the precompute function
to future-proof it from a breaking change. This is pretty unnatural and
counterintuitive as `c` is typically chosen in runtime after pre-compute
is done but I don't really see another way, pls let me know if you do.
UPD: `c` is added into pre-compute function, for now it's unused and
it's documented how it will change in the future.
Resolves https://github.com/ingonyama-zk/icicle/issues/147
Co-authored with @ChickenLover
---------
Co-authored-by: ChickenLover <romangg81@gmail.com>
Co-authored-by: nonam3e <timur@ingonyama.com>
Co-authored-by: nonam3e <71525212+nonam3e@users.noreply.github.com>
Co-authored-by: LeonHibnik <leon@ingonyama.com>