### PR: Update Powdr's `stwo` Dependency and Align Toolchain
This PR updates Powdr's `stwo` dependency to the latest version, which
now uses the `nightly-2024-12-17` Rust toolchain. To ensure
compatibility, Powdr's toolchain has also been aligned with this new
nightly version.
As part of this update:
- Several modifications were made to address stricter rules and lints
introduced by the newer version of Clippy.
- System dependencies, including `uuid-dev` and `libgrpc++-dev`, were
added to resolve build and runtime issues brought about by updated
dependencies and toolchain requirements.
This adds a function to range constraints that splits a range constraint
into two disjoint range constraints of roughly the same size that equal
the initial range constraint when combined.
This function is planned to be used for branching in autowitjitgen.
This PR puts together the pieces to run compile-time witgen for block
machines. There are still many cases where it doesn't work yet, in which
case it falls back to run-time solving. These cases should be fixed in
future PRs.
It also fixes two bugs:
- When multiplying two affine expression, the case where one of them is
zero is now handled properly.
- `WitgenInference` now handles intermediate columns.
Note that this PR could slow down witgen by attempting to compile code
once per incoming connection and input / output combination, in block
machines. I think this should be negligible though and it gives us that
much of the new pipeline is already running in the tests and elsewhere.
# Benchmark results
I tested the code with different opt levels on a benchmark that computes
ca. $2^{16}$ Poseidon hashes.
## Baseline
```
== Witgen profile (393220 events)
93.0% ( 30.8s): Secondary machine 0: main_poseidon (BlockMachine)
4.1% ( 1.4s): witgen (outer code)
2.3% ( 750.8ms): Main machine (Dynamic)
0.6% ( 204.4ms): FixedLookup
0.0% ( 3.2µs): range constraint multiplicity witgen
---------------------------
==> Total: 33.109672458s
```
## JIT (opt level 1)
```
== Witgen profile (393222 events)
52.3% ( 7.7s): JIT-compilation
32.0% ( 4.7s): Secondary machine 0: main_poseidon (BlockMachine)
9.2% ( 1.3s): witgen (outer code)
5.1% ( 748.3ms): Main machine (Dynamic)
1.4% ( 213.5ms): FixedLookup
0.0% ( 417.0ns): range constraint multiplicity witgen
---------------------------
==> Total: 14.729149333s
```
## JIT (opt level 3)
```
== Witgen profile (393222 events)
94.6% ( 107.9s): JIT-compilation
3.4% ( 3.9s): Secondary machine 0: main_poseidon (BlockMachine)
1.1% ( 1.3s): witgen (outer code)
0.7% ( 746.5ms): Main machine (Dynamic)
0.2% ( 204.1ms): FixedLookup
0.0% ( 542.0ns): range constraint multiplicity witgen
---------------------------
==> Total: 114.036571291s
```
Add a function to the Machine trait that can be used to determine if a
submachine call with certain known inputs and certain range constraints
can be fully processed by the called machine.
---------
Co-authored-by: Georg Wiese <georgwiese@gmail.com>
This is mostly a reduced copy of the goldilocks implementation we
already have with the main difference that the division tries to perform
integer division first if it can be done without remainder.
This fixes a bug in witjitgen where the bit operations used for masking
in bit-decomposition would use field elements as types for the masks.
The problem is that we sometimes mask using masks outside the field and
the proper type for masks is FieldElement::Integer.
This PR adds a basic block machine processor. Currently, we assume a
rectangular block shape and just iterate over all rows and identities of
the block until no more progress is made. This is sufficient to generate
code for Poseidon.
Formats a vector of "effects" coming from the witgen solver into rust
code, compiles and loads it.
Submachine calls and receiving arguments will be done in another PR.
This code assumes `known` to be a padded bit vector ( #2230 ).
---------
Co-authored-by: Georg Wiese <georgwiese@gmail.com>
A few preparations for #2226:
- Extracted a `test_utils` module
- Introduced a new `Variable` type, which can refer to either a cell or
a "parameter" (either input or output of a machine call). I think in the
future, we could have more variants (e.g. scalar publics). `Variable` is
now used instead of `Cell` in `WitgenInference`.
- `WitgenInference::process_identity` now also returns whether any
progress has been made.
- Renamed `lookup` -> `machine_call` when rendering `Effect`s
This PR adds a component that can derive assignments and other code on
identities and multiple rows. It keeps track of which cells in the trace
are already known and which not. The way to access fixed rows is
abstracted because it does not have a concept of an absolute row. While
this might work for block machines with cyclic fixed columns, it does
not work in the general case.
What it does not do:
- have a sequence of which identities to consider on which rows
- a mechanism that determines when it is finished
---------
Co-authored-by: Georg Wiese <georgwiese@gmail.com>
This PR performs preliminary preparations in the block machine so that
it will be able to JIT-compile and evaluate lookups into this machine
given a certain combination of "known inputs".
---------
Co-authored-by: Georg Wiese <georgwiese@gmail.com>
This module is an equivalent of the existing affine_expression.rs, but
for compile-time execution on symbolic values instead of run-time
execution on concrete values.
Using the operators defined on that that type, you can build a
SymbolicAffineExpression from a polynomial identity and then use
`.solve()` to try to solve for one unknown variable. The result (instead
of a concrete assignment as in affine_expression.rs) is a
SymbolicExpression, i.e. a complex expression involving variables
(assumed to have a concrete value known at run time), constants and
certain operators on them.
The idea is that SymbolicAffineExpression is used on polynomial
identities in turn and solving for one cell after the other in the
trace. The resulting SymbolicExpression can be translated to rust or
pil.
This PR refactors a few things:
- `process_lookup_direct` no longer has a default implementation.
Eventually, we want all machines to implement it, so I figured it would
be better to explicitly panic in each machine.
- Refactored the implementation of
`FixedLookupMachine::process_plookup`, pulling some stuff out into a new
`CallerData` struct. This is similar to what @chriseth has done on
[`call_jit_from_block`](https://github.com/powdr-labs/powdr/compare/main...call_jit_from_block),
see the comment below.
- As a first test, I implemented `process_lookup_direct` for the
"large"-field memory machine (and `process_plookup` by wrapping
`process_lookup_direct`)
This PR:
- Removes the `acc_next` columns, which were only needed because of a
limitation of prover functions. The prover function that existed is now
removed entirely, because we use the hand written witgen anyway, see
#2191.
- Also this PR materializes the folded tuple. This lowers the degree of
the constraints if the tuples being sent have a degree > 1. It also
enables next references in the tuple being sent.
As a result, we can now generate Plonky3 proofs with a bus!
```bash
cargo run -r --features plonky3 --bin powdr-rs compile riscv/tests/riscv_data/keccak -o output --max-degree-log 18 --field gl
cargo run -r --features plonky3 pil output/keccak.asm -o output -f --field gl --prove-with plonky3 --linker-mode bus
```
The proof generation takes 8.32s (of which 394ms are spent on generating
the second-stage witness). This compares to 2.07s proof time without a
bus.
Builds on #2194 and #2183.
This PR gives us (relatively) fast witness generation for the bus, by
writing custom code instead of relying on the generic solver + prover
functions:
```
$ cargo run -r --features plonky3 --bin powdr-rs compile riscv/tests/riscv_data/keccak-o output --max-degree-log 18 --field gl
$ cargo run -r --features plonky3 pil output/$TEST.asm -o output -f --field gl --prove-with mock --linker-mode bus
...
Running main machine for 262144 rows
[00:00:05 (ETA: 00:00:05)] █████████░░░░░░░░░░░ 48% - 24283 rows/s, 3169k identities/s, 92% progress
Found loop with period 1 starting at row 127900
[00:00:05 (ETA: 00:00:00)] ████████████████████ 100% - 151125 rows/s, 16170k identities/s, 100% progress
Witness generation took 5.748081s
Writing output/commits.bin.
Backend setup for mock...
Setup took 0.54769236s
Generating later-stage witnesses took 0.29s
Proof generation took 2.0383847s
```
On `main`, second-stage witgen for the main machine alone takes about 5
minutes.
This PR:
- Renames the current `executor::witgen::ExpressionEvaluator` to
`executor::witgen::evaluators::partial_expression_evaluator::PartialExpressionEvaluator`
- It is used when solving and evaluates to a
`AffineResult<AlgebraicVariable<'a>, T>`, which might still contain
unknown variables.
- Adds a new `ExpressionEvaluator` that simply evaluates to `T`
- Changes `MockBackend` to use the new `ExpressionEvaluator` (previously
wrapped what is now called the `PartialExpressionEvaluator`)
As a result, the code in `MockBackend` can be simplified. Also, I'm
building on this in #2191 for fast witness generation for the bus.
This PR adds a `Constr:: PhantomBusInteraction` variant. For now, it is
ignored - if users want to use a bus, they need to express this in terms
of phantom lookups / permutations as before this PR.
I added a few `TODO(bus_interaction)` and opened #2184 to track support
for phantom bus interactions.
One use-case this could have before though is to trigger a
"hand-written" witness generation for the bus, as discussed in the chat.
Builds on #2169
With this PR, second-stage witness generation works for the bus used in
the RISC-V machine 🎉
This is an end-to-end test:
```bash
cargo run -r --bin powdr-rs compile riscv/tests/riscv_data/sum-o output --max-degree-log 15 --field gl
cargo run -r pil output/sum.asm -o output -f --field gl --prove-with mock --linker-mode bus -i 1,1,1
```
What's needed is two small changes to `VmProcessor`:
- The degree is now passed by the caller (`DynamicMachine` or
`SecondStageMachine`). That way, `SecondStageMachine` can set it to the
actual final size, instead of the maximum allowed degree.
- I disabled loop detection for second-stage witness generation for now.
Cherry-picked from #2174
With this PR, we run all prover functions in parallel when solving for
the witness in `VmProcessor`. Interestingly, this didn't require any
changes to the order in which things are done: We already ran the
functions independently and applied the combined updates. So, this is a
classic map-reduce.
I think this change always makes sense, but is especially useful for the
prover functions we have to set bus accumulator values. For example, in
our RISC-V machine, the main machine has ~30 bus interactions, with a
fairly expensive prover function for each.
When used on top of #2173 and #2175, this accelerates second-stage
witness generation for the main machine from ~10s to ~6s for the example
mentioned in #2173.
This introduces a new machine which is always used for the second-stage
witness generation. Currently it is a copy of DynamicMachine, but in the
end it will be optimized for second-stage witness generation.
With this PR, we get much better error messages for connection errors.
This is an example of an issue we're currently debugging:
```
$ cargo run -r pil test_data/std/keccakf16_memory_test.asm -o output -f --prove-with mock --export-witness-csv
...
Errors in 50 / 213 connections:
Connection failed between main_keccakf16_memory and main_memory:
main_keccakf16_memory::sel[0] * main_keccakf16_memory::step_flags[0] $ [0, main_keccakf16_memory::input_addr_h, main_keccakf16_memory::input_addr_l, main_keccakf16_memory::time_step, main_keccakf16_memory::preimage[3], main_keccakf16_memory::preimage[2]] is main_memory::selectors[2] $ [main_memory::m_is_write, main_memory::m_addr_high, main_memory::m_addr_low, main_memory::m_step_high * 65536 + main_memory::m_step_low, main_memory::m_value1, main_memory::m_value2];
The following tuples appear in main_memory, but not in main_keccakf16_memory:
Row 24: (0, 0, 0, 32, 0, 0)
Connection failed between main_keccakf16_memory and main_memory:
main_keccakf16_memory::sel[0] * main_keccakf16_memory::step_flags[0] $ [0, main_keccakf16_memory::addr_h[0], main_keccakf16_memory::addr_l[0], main_keccakf16_memory::time_step, main_keccakf16_memory::preimage[1], main_keccakf16_memory::preimage[0]] is main_memory::selectors[3] $ [main_memory::m_is_write, main_memory::m_addr_high, main_memory::m_addr_low, main_memory::m_step_high * 65536 + main_memory::m_step_low, main_memory::m_value1, main_memory::m_value2];
The following tuples appear in main_memory, but not in main_keccakf16_memory:
Row 24: (0, 0, 4, 32, 0, 0)
Connection failed between main_keccakf16_memory and main_memory:
main_keccakf16_memory::sel[0] * main_keccakf16_memory::step_flags[0] $ [0, main_keccakf16_memory::addr_h[1], main_keccakf16_memory::addr_l[1], main_keccakf16_memory::time_step, main_keccakf16_memory::preimage[7], main_keccakf16_memory::preimage[6]] is main_memory::selectors[4] $ [main_memory::m_is_write, main_memory::m_addr_high, main_memory::m_addr_low, main_memory::m_step_high * 65536 + main_memory::m_step_low, main_memory::m_value1, main_memory::m_value2];
The following tuples appear in main_memory, but not in main_keccakf16_memory:
Row 24: (0, 0, 8, 32, 0, 0)
Connection failed between main_keccakf16_memory and main_memory:
main_keccakf16_memory::sel[0] * main_keccakf16_memory::step_flags[0] $ [0, main_keccakf16_memory::addr_h[2], main_keccakf16_memory::addr_l[2], main_keccakf16_memory::time_step, main_keccakf16_memory::preimage[5], main_keccakf16_memory::preimage[4]] is main_memory::selectors[5] $ [main_memory::m_is_write, main_memory::m_addr_high, main_memory::m_addr_low, main_memory::m_step_high * 65536 + main_memory::m_step_low, main_memory::m_value1, main_memory::m_value2];
The following tuples appear in main_memory, but not in main_keccakf16_memory:
Row 24: (0, 0, 12, 32, 0, 1)
Connection failed between main_keccakf16_memory and main_memory:
main_keccakf16_memory::sel[0] * main_keccakf16_memory::step_flags[0] $ [0, main_keccakf16_memory::addr_h[3], main_keccakf16_memory::addr_l[3], main_keccakf16_memory::time_step, main_keccakf16_memory::preimage[11], main_keccakf16_memory::preimage[10]] is main_memory::selectors[6] $ [main_memory::m_is_write, main_memory::m_addr_high, main_memory::m_addr_low, main_memory::m_step_high * 65536 + main_memory::m_step_low, main_memory::m_value1, main_memory::m_value2];
The following tuples appear in main_memory, but not in main_keccakf16_memory:
Row 24: (0, 0, 16, 32, 0, 0)
... and 45 more errors
thread 'main' panicked at cli/src/main.rs:727:14:
called `Result::unwrap()` on an `Err` value: ["Constraint check failed"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```
With this PR, we compute the later-stage witnesses per machine instead
of globally. This has two advantages:
- We're able to handle machines of different sizes
- We can parallelize later-stage witness generation
This affects the two backend that can deal with multiple machines in the
first place: `Plonky3Backend` and `CompositeBackend`
Multiplicity columns of *global* range constraints used to end up in the
main machine, which is a problem if it has a length different from the
range check machine.
Prepares #2129
With this PR, later-stage witness columns & identities referencing them
(or later-stage challenges) are completely ignored. The columns are not
assigned to any machine. Previously, they would end up in the main
machine and never receive any updates. That doesn't work if machines
have different sized though.
---------
Co-authored-by: Thibaut Schaeffer <schaeffer.thibaut@gmail.com>
MutableState is the main way to get access to sub-machines during
witness generation. We used to create copies of MutableState for each
lookup, extracting the "current machine" where we need mutable access
and creating copies of references to the other machines. This causes an
allocation for each lookup (including fixed lookups, I think) which is
bad for performance.
This PR changes the approach to use RefCell instead: MutableState now
owns the machines and for each call to a machine, we mutably borrow that
machine. The RefCell mechanism ensures that there are no recursive calls
to machines and also avoids allocations.
I also changed the query callback to use non-mut references.
The first and third commits only move code around.