## Describe the changes
This PR refactors the different affine to projective conversion
functions using the C function
also small bug fix for ProjectiveToAffine() function in Go
## Linked Issues
Resolves #
# Updates:
## Hashing
- Added SpongeHasher class
- Can be used to accept any hash function as an argument
- Absorb and squeeze are now separated
- Memory management is now mostly done by SpongeHasher class, each hash
function only describes permutation kernels
## Tree builder
- Tree builder is now hash-agnostic.
- Tree builder now supports 2D input (matrices)
- Tree builder can now use two different hash functions for layer 0 and
compression layers
## Poseidon1
- Interface changed to classes
- Now allows for any alpha
- Now allows passing constants not in a single vector
- Now allows for any domain tag
- Constants are now released upon going out of scope
- Rust wrappers changed to Poseidon struct
## Poseidon2
- Interface changed to classes
- Constants are now released upon going out of scope
- Rust wrappers changed to Poseidon2 struct
## Keccak
- Added Keccak class which inherits SpongeHasher
- Now doesn't use gpu registers for storing states
To do:
- [x] Update poseidon1 golang bindings
- [x] Update poseidon1 examples
- [x] Fix poseidon2 cuda test
- [x] Fix poseidon2 merkle tree builder test
- [x] Update keccak class with new design
- [x] Update keccak test
- [x] Check keccak correctness
- [x] Update tree builder rust wrappers
- [x] Leave doc comments
Future work:
- [ ] Add keccak merkle tree builder externs
- [ ] Add keccak rust tree builder wrappers
- [ ] Write docs
- [ ] Add example
- [ ] Fix device output for tree builder
---------
Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
Co-authored-by: nonam3e <71525212+nonam3e@users.noreply.github.com>
## Describe the changes
This PR fixes affine to projective functions in bindings by adding a
condition if the point in affine form is zero then return the projective zero
---------
Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
## Describe the changes
This PR adds the capability to pin host memory in golang bindings
allowing data transfers to be quicker. Memory can be pinned once for
multiple devices by passing the flag
`cuda_runtime.CudaHostRegisterPortable` or
`cuda_runtime.CudaHostAllocPortable` depending on how pinned memory is
called
This PR enables using MSM with any value of c.
Note: default c isn't necessarily optimal, the user is expected to
choose c and the precomputation factor that give the best results for
the relevant case.
---------
Co-authored-by: Jeremy Felder <jeremy.felder1@gmail.com>
## Describe the changes
This PR adds the capability to slice a DeviceSlice, allowing portions of
data that are already on the device to be reused.
Additionally, this PR removes the need for a HostSlice underlying type
to implement a Size function and uses unsafe.Sizeof instead. This
together with #407 will allow direct usage of gnark-crypto types with
HostSlice without the need for converting to ICICLE types
---------
Co-authored-by: nonam3e <timur@ingonyama.com>
## Brief description
This PR adds pre-computation to the MSM, for some theory see
[this](https://youtu.be/KAWlySN7Hm8?si=XeR-htjbnK_ySbUo&t=1734) timecode
of Niall Emmart's talk.
In terms of public APIs, one method is added. It does the
pre-computation on-device leaving resulting data on-device as well. No
extra structures are added, only `precompute_factor` from `MSMConfig` is
now activated.
## Performance
While performance gains are for now often limited by our inflexibility
in choice of `c` (for example, very large MSMs get basically no speedup
from pre-compute because currently `c` cannot be larger than 16),
there's still a number of MSM sizes which get noticeable improvement:
| Pre-computation factor | bn254 size `2^20` MSM, ms. | bn254 size
`2^12` MSM, size `2^10` batch, ms. | bls12-381 size `2^20` MSM, ms. |
bls12-381 size `2^12` MSM, size `2^10` batch, ms. |
| ------------- | ------------- | ------------- | ------------- |
------------- |
| 1 | 14.1 | 82.8 | 25.5 | 136.7 |
| 2 | 11.8 | 76.6 | 20.3 | 123.8 |
| 4 | 10.9 | 73.8 | 18.1 | 117.8 |
| 8 | 10.6 | 73.7 | 17.2 | 116.0 |
Here for example pre-computation factor = 4 means that alongside each
original base point, we pre-compute and pass into the MSM 3 of its
"shifted" versions. Pre-computation factor = 1 means no pre-computation.
GPU used for benchmarks is a 3090Ti.
## TODOs and open questions
- Golang APIs are missing;
- I mentioned that to utilise pre-compute to its full potential we need
arbitrary choice of `c`. One issue with this is that pre-compute will
become dependent on `c`. For now this is not the case as `c` can only be
a power of 2 and powers of 2 can always share the same pre-computation.
So apparently we need to make `c` a parameter of the precompute function
to future-proof it from a breaking change. This is pretty unnatural and
counterintuitive as `c` is typically chosen in runtime after pre-compute
is done but I don't really see another way, pls let me know if you do.
UPD: `c` is added into pre-compute function, for now it's unused and
it's documented how it will change in the future.
Resolves https://github.com/ingonyama-zk/icicle/issues/147
Co-authored with @ChickenLover
---------
Co-authored-by: ChickenLover <romangg81@gmail.com>
Co-authored-by: nonam3e <timur@ingonyama.com>
Co-authored-by: nonam3e <71525212+nonam3e@users.noreply.github.com>
Co-authored-by: LeonHibnik <leon@ingonyama.com>
This PR adds the columns batch feature - enabling batch NTT computation
to be performed directly on the columns of a matrix without having to
transpose it beforehand, as requested in issue #264.
Also some small fixes to the reordering kernels were added and some
unnecessary parameters were removes from functions interfaces.
---------
Co-authored-by: DmytroTym <dmytrotym1@gmail.com>
## Describe the changes
This PR adds multi gpu support in the golang bindings.
Tha main changes are to DeviceSlice which now includes a `deviceId`
attribute specifying which device the underlying data resides on and
checks for correct deviceId and current device when using DeviceSlices
in any operation.
In Go, most concurrency can be done via Goroutines (described as
lightweight threads - in reality, more of a threadpool manager),
however, there is no guarantee that a goroutine stays on a specific host
thread. Therefore, a function `RunOnDevice` was added to the
cuda_runtime package which locks a goroutine into a specific host
thread, sets a current GPU device, runs a provided function, and unlocks
the goroutine from the host thread after the provided function finishes.
While the goroutine is locked to the hsot thread, the Go runtime will
not assign other goroutines to that host thread