While working on #2306, @Schaeff came across several bugs in
multiplicity witness generation. These were undetected, because we
ignored multiplicities in the mock prover, which will be fixed by #2310.
With this PR, #2310 will be green.
The issue was that counting multiplicities inside
`Machine::process_plookup()` fails if the caller actually discards the
result. This happens in a few places, for example during our loop
optimization in the "dynamic machine".
With this PR, we instead have a centralized
`MultiplicityColumnGenerator` that counts multiplicities after the fact,
by going over each lookup, evaluating the two selected tuples on all
rows, and counting how often each element in the LHS appears in the RHS.
To measure the runtime of this, I ran:
```sh
export TEST=keccak
export POWDR_JIT_OPT_LEVEL=0
cargo run -r --bin powdr-rs compile riscv/tests/riscv_data/$TEST -o output --max-degree-log 18
cargo run -r --features plonky3,halo2 pil output/$TEST.asm -o output -f --field gl --linker-mode bus
```
I get the following profile on the server:
```
== Witgen profile (2554126 events)
32.4% ( 2.6s): Secondary machine 0: main_binary (BlockMachine)
23.1% ( 1.9s): Main machine (Dynamic)
12.7% ( 1.0s): Secondary machine 4: main_regs (DoubleSortedWitnesses32)
10.0% ( 809.9ms): FixedLookup
7.7% ( 621.1ms): Secondary machine 5: main_shift (BlockMachine)
5.6% ( 454.6ms): Secondary machine 2: main_poseidon_gl (BlockMachine)
3.8% ( 312.3ms): multiplicity witgen
3.8% ( 308.2ms): witgen (outer code)
0.6% ( 45.3ms): Secondary machine 1: main_memory (DoubleSortedWitnesses32)
0.4% ( 33.4ms): Secondary machine 6: main_split_gl (BlockMachine)
0.0% ( 8.0µs): Secondary machine 3: main_publics (WriteOnceMemory)
---------------------------
==> Total: 8.114630092s
```
So the cost is ~4%. I'm sure it can be optimized further but I would
like to leave this to a future PR.
With this PR, we compute the later-stage witnesses per machine instead
of globally. This has two advantages:
- We're able to handle machines of different sizes
- We can parallelize later-stage witness generation
This affects the two backend that can deal with multiple machines in the
first place: `Plonky3Backend` and `CompositeBackend`