Allows fixed columns to be evaluated in the "single step jit processor"
by pre-computing if a fixed column has a constant value apart from the
first and last row.
While working on #2306, @Schaeff came across several bugs in
multiplicity witness generation. These were undetected, because we
ignored multiplicities in the mock prover, which will be fixed by #2310.
With this PR, #2310 will be green.
The issue was that counting multiplicities inside
`Machine::process_plookup()` fails if the caller actually discards the
result. This happens in a few places, for example during our loop
optimization in the "dynamic machine".
With this PR, we instead have a centralized
`MultiplicityColumnGenerator` that counts multiplicities after the fact,
by going over each lookup, evaluating the two selected tuples on all
rows, and counting how often each element in the LHS appears in the RHS.
To measure the runtime of this, I ran:
```sh
export TEST=keccak
export POWDR_JIT_OPT_LEVEL=0
cargo run -r --bin powdr-rs compile riscv/tests/riscv_data/$TEST -o output --max-degree-log 18
cargo run -r --features plonky3,halo2 pil output/$TEST.asm -o output -f --field gl --linker-mode bus
```
I get the following profile on the server:
```
== Witgen profile (2554126 events)
32.4% ( 2.6s): Secondary machine 0: main_binary (BlockMachine)
23.1% ( 1.9s): Main machine (Dynamic)
12.7% ( 1.0s): Secondary machine 4: main_regs (DoubleSortedWitnesses32)
10.0% ( 809.9ms): FixedLookup
7.7% ( 621.1ms): Secondary machine 5: main_shift (BlockMachine)
5.6% ( 454.6ms): Secondary machine 2: main_poseidon_gl (BlockMachine)
3.8% ( 312.3ms): multiplicity witgen
3.8% ( 308.2ms): witgen (outer code)
0.6% ( 45.3ms): Secondary machine 1: main_memory (DoubleSortedWitnesses32)
0.4% ( 33.4ms): Secondary machine 6: main_split_gl (BlockMachine)
0.0% ( 8.0µs): Secondary machine 3: main_publics (WriteOnceMemory)
---------------------------
==> Total: 8.114630092s
```
So the cost is ~4%. I'm sure it can be optimized further but I would
like to leave this to a future PR.
This is a first step to generalize the expression evaluator. Currently
just merging the `TraceValues` and `GlobalValues` traits, as it seemed a
bit unnecessary to have two of them.
Eventually, I want to have a more generic `ExpressionWalker`, of which
`ExpressionEvaluator` would be a special case. That walker could also be
used to do things like:
- Degree computation
- check if the expression (or any referenced intermediate) has a next
reference
While still preserving the caching behavior, so that we don't run
exponentially long on these operations either!
This PR move the `ExpressionEvaluator` to `executor-utils` and
generalizes it such that it can evaluate an `AlgebraicExpression<T>` to
any type, not just `T`. This makes it possible to use the evaluator in
the backends. I used it in Plonky3 and Stwo, which leads to significant
code deletion.
The main feature of `ExpressionEvaluator` is that it handles
intermediate polynomials by caching during evaluation. This is cheaper
than using `Analyzed::identities_with_inlined_intermediate_polynomials`,
which might build exponentially large expressions. The Plonky3
implementation already did the same; The stwo implementation still used
`identities_with_inlined_intermediate_polynomials()` and now handles
intermediates properly.
With this PR, we compute the later-stage witnesses per machine instead
of globally. This has two advantages:
- We're able to handle machines of different sizes
- We can parallelize later-stage witness generation
This affects the two backend that can deal with multiple machines in the
first place: `Plonky3Backend` and `CompositeBackend`