All tests passed. :)
Operations:
- shl<0> A1, A2, B -> C1, C2
- shr<1> A1, A2, B -> C1, C2
- `A1` `A2` are 16 bit limbs of 32 bit `A` in little-endian order.
Likewise for `C1` and `C2`
Implementation:
- We adopted a similar implementation to our prior shift machine, which
decomposes `A` to 4 bytes and looks up each byte to a lookup table of
`[A_byte, B (shift amount), block row, operation id]`, so the size of
the lookup table is `[256, 32, 4, 2] = 65536`. Each row looks up to the
resulting byte after the bit shifting, and the results are added
together to obtain `C`.
- In our design, instead of looking up to 32-bit `C` column, we are
looking up to two 16-bit `C1` and `C2` columns. Overall, there are more
witness columns due to decomposing to 16-bit limbs the in the main shift
machine and one more fixed column in the lookup table, but the same
number of lookups performed.
Future optimization:
- There's ample ground for "reshaping" the main machine to have more
columns but fewer rows, and do more lookups in each row so that we are
not just processing one byte in each row. In the most aggressive case,
we can even process everything in the same row.
- For example, processing two bytes in one row (instead of one byte)
should have half the number of rows but less than twice the number of
columns, and therefore fewer cells in total. This should be good for
provers with time linear to the number of cells.
---------
Co-authored-by: onurinanc <e191322@metu.edu.tr>
We currently hardcode the range of degrees that variable degree machines
are preprocessed for. Expose that in machines instead.
This changes pil namespaces to accept a min and max degree:
```
namespace main(123..456);
namespace main(5); // allowed for backward compatibility, translates to `5..5`
```
It adds two new builtins:
```
std::prover::min_degree
std::prover::max_degree
```
And sets the behavior of the `std::prover::degree` builtin to only
succeed if `min_degree` and `max_degree` are equal.
Input for syscall is one memory pointer to the state array of 25 gl fe.
Output is calculated in place.
All functions outside of keccakf are executed in Rust. Might need to
delete everything except keccakf from keccakf.asm (including the
padding, updating, and byte <-> u64 conversions).
Not tested. Should I wait till all infrastructure needed are merged?
Technically can also do the following for the machine and I think I can
test it already?
```
pol commit input_state[25];
pol commit output_state[25];
operation keccakf_permutation<0> input_state[0], input_state[1], ..., input_state[24] -> output_state[0], output_state[1], ..., output_state[24]
array::zip(input_state, output_state, |input_state, output_state| (keccakf(input_state) - output_state = 0);
```
Builds on #1687Fixed#1572
With this PR, we are using dynamic VADCOP in the RISC-V zk-VM.
There were a few smaller fixes needed to make this work. In summary, the
changes are as follows:
- We set the degree the main machine to `None`, and all fixed lookup
machines to the appropriate size. As a consequence, the CPU, all block
machines & memory have a dynamic size.
- As a consequence, I had to adjust some tests (set the size of all
machines, so they can still be run with monolithic provers) *and* was
able to remove the `Memory_<size>` machines 🎉
- With the main machine being of flexible size, the prover can chose for
how long to run it. We run it for `1 << (MAX_DEGREE_LOG - 2)` steps and
compute the bootloader inputs accordingly. With this choice, we can
guarantee that the register memory (which can be up to 4x larger than
the main machine) does not run out of rows.
Note that while we do access `MAX_DEGREE_LOG` in a bunch of places now,
this will go away once #1667 is merged, which will allow us to configure
the degree range in ASM and for each machine individually.
### Example:
```bash
export MAX_LOG_DEGREE=18
cargo run -r --bin powdr-rs compile riscv/tests/riscv_data/many_chunks -o output --continuations
cargo run -r --bin powdr-rs execute output/many_chunks.asm -o output --continuations -w
cargo run -r --features plonky3,halo2 prove output/many_chunks.asm -d output/chunk_0 --field gl --backend plonky3-composite
```
This leads to the following output:
```
== Proving machine: main (size 65536), stage 0
==> Proof stage computed in 1.918317417s
== Proving machine: main__rom (size 8192), stage 0
==> Proof stage computed in 45.847375ms
== Proving machine: main_binary (size 1024), stage 0
==> Proof stage computed in 27.718416ms
== Proving machine: main_bit2 (size 4), stage 0
==> Proof stage computed in 15.280667ms
== Proving machine: main_bit6 (size 64), stage 0
==> Proof stage computed in 17.449875ms
== Proving machine: main_bit7 (size 128), stage 0
==> Proof stage computed in 20.717834ms
== Proving machine: main_bootloader_inputs (size 262144), stage 0
==> Proof stage computed in 524.013375ms
== Proving machine: main_byte (size 256), stage 0
==> Proof stage computed in 17.280167ms
== Proving machine: main_byte2 (size 65536), stage 0
==> Proof stage computed in 164.709625ms
== Proving machine: main_byte_binary (size 262144), stage 0
==> Proof stage computed in 504.743917ms
== Proving machine: main_byte_compare (size 65536), stage 0
==> Proof stage computed in 169.881542ms
== Proving machine: main_byte_shift (size 65536), stage 0
==> Proof stage computed in 146.235916ms
== Proving machine: main_memory (size 32768), stage 0
==> Proof stage computed in 326.522167ms
== Proving machine: main_poseidon_gl (size 16384), stage 0
==> Proof stage computed in 1.324662625s
== Proving machine: main_regs (size 262144), stage 0
==> Proof stage computed in 2.009408667s
== Proving machine: main_shift (size 32), stage 0
==> Proof stage computed in 13.71825ms
== Proving machine: main_split_gl (size 16384), stage 0
==> Proof stage computed in 108.019334ms
Proof generation took 7.364567s
Proof size: 8432928 bytes
Writing output/chunk_0/many_chunks_proof.bin.
```
Note that `main_bootloader_inputs` is still equal to the maximum size,
we should fix that in a following PR!
Pulled out of #1683
Because our memory machine requires a `Byte2` machine to range-check
some values, we need to provide that machine in any test that uses
memory. But that machine needs a size of $2^{16}$, which is a lot for
just some simple tests.
Previously, the `Byte2` machine did not have a size set (this will
change in #1683), so it would just get the size of the main machine.
This works because in our simple tests, the values are small enough in
practice. But really, this breaks an assumption of the `Byte2` machine.
With this PR, this is explicit, with all the affected test (that we want
to keep small) using `std::machines::test_util::FakeByte2` instead of
`std::machines::range::Byte2`. Note that this is exploiting the fact
that we're not type checking the machine passed to memory. But the
alternative would be to copy the memory machine and use that in the
test.
When creating the ROM machine, give it a degree:
- if the VM has a degree, pass the same degree to its ROM machine (this
is required to support the monolithic case where all degrees must match)
- otherwise, pass the length of the ROM
---------
Co-authored-by: Georg Wiese <georgwiese@gmail.com>
Fixes#1604
With this PR, we bypass machine detection during witness generation of
stages > 0. See [this
comment](https://github.com/powdr-labs/powdr/issues/1604#issuecomment-2257059636)
for a motivation.
This currently needs to be tested manually, as follows:
```
$ RUST_LOG=trace cargo run pil test_data/asm/block_to_block_with_bus.asm -o output -f --field bn254 --prove-with halo2-mock
...
===== Summary for row 7:
main.acc1 = 20713437912485111384541749944547180564950035591542371144095269313127123163196
main.acc2 = 5162472027861336027760332823162682203738251621730423286600997430635718406729
main.z = 3
main.res = 9
main_arith.acc1 = 1174804959354163837704655800710094523598328808873663199602934873448685332421
main_arith.acc2 = 16725770843977939194486072922094592884810112778685611057097206755940090088888
main_arith.acc1_next = 463668501342879563405020640323131794083013726819708055681247370540753473777
main_arith.acc2_next = 20043340305711842349747334022818855888193664738087810292789887394167185113571
main_arith.y = 1
main_arith.z = 1
main.dummy = 0
main.acc1_next = 0
main.acc2_next = 0
main_arith.x = 0
main_arith.sel[0] = 0
---------------------
...
```
Computing `main.acc1 + main_arith.acc1` and `main.acc2 +
main_arith.acc2` both yields
`21888242871839275222246405745257275088548364400416034343698204186575808495617`,
which is the BN254 scalar field prime! In other words, the partial
accumulators sum to 0.
---------
Co-authored-by: Leo <leo@powdrlabs.com>
LogUp implementation using Bus
As @georgwiese suggested in
https://github.com/powdr-labs/powdr/issues/1573#issuecomment-2250875201
- That makes the machine detection simpler, because we wouldn't have
constraints that depend on both the LHS and RHS of a lookup.
- It's what we'll actually need for VADCOP, so maybe we don't even have
to support std::protocols::lookup as a first step.
Fixes#1190
Fixes https://github.com/powdr-labs/powdr/issues/1488
```
// creates intermediate column.
let x: inter = ...
// same here, expects an array on the rhs
let x: inter[k] = ...
// Creates an intermediate column, this is printed from Analyzed
col x = ...;
// Creates an array of intermediate columns, this is printed from analyzed and it's actually new syntax.
col x[k] = ...;
// old syntax for intermediate columns, this just defines a "generic variable" after the change,
// essentially an intermediate column that is always inlined.
let x: expr = ...;
```
---------
Co-authored-by: Leo <leo@powdrlabs.com>
What we can have after this PR:
- only dynamic namespaces
- dynamic main, others dynamic or static
What we still cannot have:
- static main and another machines dynamic
The change to `_linker_first_step` can be reverted after #1565
The previous version of link_merging.asm is not working correctly since
the `operation sub<1>` implementation is wrong. With this PR, `operation
sub<1>` implementation is fixed.
I started this in order to test #1603. However, due to #1604, currently
only one machine has a bus so far. This should be fixed as part of
#1604.
Then, what we would expect is that both machines have the same final
accumulator values, assuming challenges are shared across machines.
Fixes#1496
Also, a step towards #1572
This PR implements the steps needed in `CompositeBackend` to implement
dynamic VADCOP.
In summary:
- If a machines size (a.k.a. "degree") is set to `None`, fixed columns
are computed in all powers of too in some hard-coded range. This fixes
#1572. As a result, machines with a size set to `None` are available in
multiple sizes. If the size is explicitly set by the user, the machine
is only available in that one size.
- Note that the ASM linker still sets the size of machines without a
size. So, currently, this can only happen when coming from PIL directly.
- `CompositeBackend` instantiates a new backend for each machine *and
size*:
- The verification key contains a key for each machine and size.
- When proving, it it uses the backend of whatever size the witness has.
The size chosen is also stored in the proof.
- When verifying, the verification key of the reported size is used.
- Witness generation currently chooses the largest available size. This
will change in a future PR.
This is an example:
```
$ cargo run pil test_data/pil/vm_to_block_dynamic_length.pil -o output -f --field bn254 --prove-with halo2-mock-composite
...
== Proving machine: main (size 256)
==> Machine proof of 256 rows (0 bytes) computed in 209.101166ms
== Proving machine: main__rom (size 256)
==> Machine proof of 256 rows (0 bytes) computed in 226.87175ms
== Proving machine: main_arith (size 1024)
==> Machine proof of 1024 rows (0 bytes) computed in 432.807583ms
```
Related to the issue of implementing basic bus (#1497), I have
implemented basic bus together with an example
(`permutation_via_bus.asm`) as specified inside the issue.
Currently, `test_data/std/bus_permutation_via_challenges.asm` works as
intended (To make it sound, stage(1) witness columns need to be exposed
publicly and verifier needs to check such as `out_z1 + out_z2 = 0`) We
can now check using RUST_LOG=trace and adding the final z1 and z2 is
equal to 0.
However, `test_data/std/bus_permutation_via_challenges_ext.asm` is not
working correctly as intended. This will be fixed with the following
commits.
With this PR:
- Simplifies the `build_machine_pil` function by simply re-parsing a
subset if the PIL. This has the extra advantage that all IDs are
consecutive, which enables supporting STARK backends as well!
- With this approach, it is necessary to have all definitions available,
which makes it necessary to merge `std` namespaces. Those are detected
as namespaces that don't define any column.
- Also, some backends expect at least one witness column and constraint,
so I added a dummy column & constraint for now.
- With this PR, we have everything done for static VadCop! So I added a
test & removed the `should_panic` in an existing test.
This PR splits from the main Trait implementation PR #1450 to simplify
the review process.
It includes only the parsing of the traits (not impls) and some
functionality necessary for the code to compile.
---------
Co-authored-by: chriseth <chris@ethereum.org>
Implements #1055 for the Poseidon machines. Pulled out of #1508.
Specifically, this PR adds a new `PoseidonGLMemory` machine which
receives 2 memory points and then:
- Reads 24 32-Bit words and packs them into 12 field elements
- Computes the Poseidon permutation (just like `PoseidonGL`)
- For each of the 4 output field elements, it:
- Invokes the `SplitGL` machine to get the canonical `u64`
representation
- Writes the 8 32-Bit words to memory at the provided memory pointer
The read and write memory regions can even overlap! 🎉
This should simplify our RISC-V machine, as the syscall already expects
two memory pointers. We can simply pass it to the machine directly.
I started doing that in #1533, but I think it makes sense to wait until
#1443 is merged.
To test:
```
cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky
```
I recommend reviewing the diff between
`std/machines/hash/poseidon_gl.asm` and
`std/machines/hash/poseidon_gl_memory.asm`
### Discussion
The overhead of the memory read / write is quite high (18 extra witness
columns, see [this
comment](40bdca4368/std/machines/hash/poseidon_gl_memory.asm (L13-L23)),
mostly because we now need to have the input available in all rows
(which previously was only the case for the outputs). If we had offsets
other than 0 and 1, this could be avoided. Doing 24 parallel memory
reads in the first row would *not* help, because we'd have to add 24
witness columns (instead of 2 now) to store the result of the memory
operation.
A few more notes:
- With Vadcop, 18 extra witness columns in a secondary machine is *a
lot* better than introducing more registers (either "regular" registers
or assignment registers) in the main machine
- As mentioned
[here](40bdca4368/std/machines/hash/poseidon_gl_memory.asm (L111-L113)),
we could get rid of two permutations if either:
- We were able to express explicitly that we want to call at most one
operation in the current row, or
- We had an optimizer that would be smart enough to batch the memory
reads and writes.
- We could also have just 1 read or write at a time (instead of 2), but
we'd have to increase the block size from 31 to 32 and the
implementation would be more complicated.
- We could also store the full final state of the Poseidon permutation,
instead of just the first 4 elements. This would need 8 more witness
columns to make the entire output available in all rows. Then, one could
use the machine to implement a Poseidon sponge, instead of.
- Looking at the bootloader, maybe it makes sense to pass 3 input
pointers instead of 1: One for the first 4 elements, one for the next 4,
and one for the capacity (often just a constant). For example, when
computing a Merkle root, you'd pass pointers for the two children hashes
and a pointer to the capacity constant.
Fixes#1494
- use cbor for witness and constant files (moving polygon serialisation
to the relevant backend)
- add `degree` field on `Symbol`, inherited from the namespace degree
- have each machine in witgen operate over its own degree
- fail in the backend if we have many degrees
Includes #1511, but also removes the declarations of the challenges in
the `permutation` and `lookup` modules.
With these changes, we never declare a column or challenge outside a
machine namespace. This greatly simplifies #1470.
---------
Co-authored-by: schaeff <thibaut@schaeff.fr>
Co-authored-by: chriseth <chris@ethereum.org>
Merge compatible `link`s into a single permutation/lookup.
We only consider merging links from different instructions, as a single
instruction can be active at a time.
Links with next references are ignored due to a limitation in witgen
(left a TODO so its easily fixed upon witgen support)
This PR builds on top of #1393.
It mainly modifies the grammar by changing the way SelectedExpressions
are declared, to allow blocks to be empty.
---------
Co-authored-by: chriseth <chris@ethereum.org>
Fixes#1493
Allow passing machines as argument when instantiating submachines, as
in:
```
use std::machines::binary::Binary;
machine Main with degree: 262144 {
reg pc[@pc];
reg X[<=];
reg Y[<=];
reg Z[<=];
reg A;
Binary binary;
WithArg sub(binary);
instr and X, Y -> Z ~ binary.and;
instr or X, Y -> Z ~ binary.or;
instr xor X, Y -> Z ~ binary.xor;
...
}
machine WithArg(bin: Binary) {
reg pc[@pc];
reg X[<=];
reg Y[<=];
reg Z[<=];
reg A;
reg B;
instr and X, Y -> Z ~ bin.and;
instr or X, Y -> Z ~ bin.or;
instr xor X, Y -> Z ~ bin.xor;
...
}
```
This PR solves #1374
There are 3 examples included in the PR:
1. `lookup_via_challenges_ext_simple.asm` is the most basic example
which implements {a} in {b} using extension field without using
selectors
2. `lookup_via_challenges_ext_no_selector.asm` implements {a1, a2} in
{b1, b2} using extension field without using selectors
3. `lookup_via_challenges_ext.asm` is a more complex example than others
which implements {a1, a2, a3} in {b1, b2, b3} using extension field
(which also handles tuples using Reed-Solomon fingerprinting). It also
use different lhs and rhs selectors.
---------
Co-authored-by: Georg Wiese <georgwiese@gmail.com>
This PR adds a new memory machine to the standard library that also
supports a `mstore_bootloader` operation: It's like a normal `mstore`,
but the first access to every memory cell *has to* come from this
operation.
As a follow-up, we'll be able to use it in the RISC-V machine (#1462).
I think this is best reviewed by reviewing the diff to the normal memory
machine and comparing it to [the code we have currently inlined in the
RISC-V
machine](abbe26618f/riscv/src/code_gen.rs (L500-L553)):
```diff
3,5c3,7
< // A read/write memory, similar to that of Polygon:
< // https://github.com/0xPolygonHermez/zkevm-proverjs/blob/main/pil/mem.pil
< machine Memory with
---
> /// This machine is a slightly extended version of std::machines::memory::Memory,
> /// where in addition to mstore, there is an mstore_bootloader operation. It behaves
> /// just like mstore, except that the first access to each memory cell must come
> /// from the mstore_bootloader operation.
> machine MemoryWithBootloaderWrite with
7c9
< operation_id: m_is_write,
---
> operation_id: operation_id,
13a16
> operation mstore_bootloader<2> m_addr, m_step, m_value ->;
29a33
> col witness m_is_bootloader_write;
30a35,36
> std::utils::force_bool(m_is_bootloader_write);
> col operation_id = m_is_write + 2 * m_is_bootloader_write;
35a42
> (1 - is_mem_op) * m_is_bootloader_write = 0;
37,39c44,45
< // If the next line is a not a write and we have an address change,
< // then the value is zero.
< (1 - m_is_write') * m_change * m_value' = 0;
---
> // The first operation of a new address has to be a bootloader write
> m_change * (1 - m_is_bootloader_write') = 0;
41,42c47,52
< // change has to be 1 in the last row, so that a first read on row zero is constrained to return 0
< (1 - m_change) * LAST = 0;
---
> // m_change has to be 1 in the last row, so that the above constraint is triggered.
> // An exception to this when the last address is -1, which is only possible if there is
> // no memory operation in the entire chunk (because addresses are 32 bit unsigned).
> // This exception is necessary so that there can be valid assignment in this case.
> pol m_change_or_no_memory_operations = (1 - m_change) * (m_addr + 1);
> LAST * m_change_or_no_memory_operations = 0;
46c56
< (1 - m_is_write') * (1 - m_change) * (m_value' - m_value) = 0;
---
> (1 - m_is_write' - m_is_bootloader_write') * (1 - m_change) * (m_value' - m_value) = 0;
➜ powdr git:(memory-with-bootloader-write) ✗
```
Allow VM instructions to use the `link` notation, unifying the way
machines are linked from VMs and block machines.
Previous syntax for "external instructions" not allowed anymore, and
should use the new `link` syntax.
Makes the permutation argument sound on the Goldilocks field by
evaluating polynomials on the extension field introduced in #1310.
I also used the new `Constr::Permutation` variant!
A few test cases (also tested in CI):
#### No extension field
`cargo run pil test_data/std/permutation_via_challenges.asm -o output -f
--field bn254 --prove-with halo2-mock`
This still works and produces the same output as before, thanks to the
PIL evaluator removing multiplications by 0 etc:
```
col witness stage(1) z;
(std::protocols::permutation::is_first * (main.z - 1)) = 0;
((((1 - main.first_four) * ((std::protocols::permutation::beta1 - ((std::protocols::permutation::alpha1 * main.b1) + main.b2)) - 1)) + 1) * main.z') = (((main.first_four * ((std::protocols::permutation::beta1 - ((std::protocols::permutation::alpha1 * main.a1) + main.a2)) - 1)) + 1) * main.z);
```
#### With extension field
`cargo run pil test_data/std/permutation_via_challenges_ext.asm -o
output -f --field bn254 --prove-with halo2-mock`
The constraints are significantly more complex but seem correct to me:
```
col witness stage(1) z1;
col witness stage(1) z2;
(std::protocols::permutation::is_first * (main.z1 - 1)) = 0;
(std::protocols::permutation::is_first * main.z2) = 0;
(((((1 - main.first_four) * ((std::protocols::permutation::beta1 - ((std::protocols::permutation::alpha1 * main.b1) + main.b2)) - 1)) + 1) * main.z1') + ((7 * ((1 - main.first_four) * (std::protocols::permutation::beta2 - (std::protocols::permutation::alpha2 * main.b1)))) * main.z2')) = ((((main.first_four * ((std::protocols::permutation::beta1 - ((std::protocols::permutation::alpha1 * main.a1) + main.a2)) - 1)) + 1) * main.z1) + ((7 * (main.first_four * (std::protocols::permutation::beta2 - (std::protocols::permutation::alpha2 * main.a1)))) * main.z2));
((((1 - main.first_four) * (std::protocols::permutation::beta2 - (std::protocols::permutation::alpha2 * main.b1))) * main.z1') + ((((1 - main.first_four) * ((std::protocols::permutation::beta1 - ((std::protocols::permutation::alpha1 * main.b1) + main.b2)) - 1)) + 1) * main.z2')) = (((main.first_four * (std::protocols::permutation::beta2 - (std::protocols::permutation::alpha2 * main.a1))) * main.z1) + (((main.first_four * ((std::protocols::permutation::beta1 - ((std::protocols::permutation::alpha1 * main.a1) + main.a2)) - 1)) + 1) * main.z2));
```
#### On Goldilocks
Running the first example on GL fails, because using the permutation
argument without the extension field would not be sound. The second
example works, but because we don't support challenges on GL yet, it
doesn't actually run the second-phase witness generation.
---------
Co-authored-by: chriseth <chris@ethereum.org>
This simulates one approach we could go for when moving registers to
memory. The memory machine remains completely unchanged, but the step is
increased by more than 1 in each step of the main machine. This way,
from the point of view of memory, all the memory operations happen at
different time steps, which allows for:
- Reading from the same address twice
- Writing to the same address that we read from (which from the point of
view of memory should happen *after* the read)
The only downside I see with this approach is that this makes the
differences of time steps between memory accesses bigger: Before it was
at most the degree, now it is some small constant times the degree (in
this example 3). The way the memory machine is currently built, the
difference can be at most $2^{32} - 1$, so I think this is fine in
practice. E.g., for a degree $2^{30}$ machine we could do up to 4
parallel reads / writes.