This PR refactors `QuadraticSymbolicExpression`: I removed the reference
to
[`SymbolicExpression`](f60e9d9f69/constraint-solver/src/quadratic_symbolic_expression.rs (L70)).
It can represent values that are known at runtime, represented as an
expression. We don't need this in the context of APC: Variables are
either unknown or known at compile time, and therefore can be
represented as a `FieldElement` instead.
The idea is to introduce [this
trait](4989be08f3/constraint-solver/src/runtime_constant.rs (L11)),
which is implemented by `SymbolicExpression` and any `T: FieldElement`.
This way, we can continue to use the solver with
`QuadraticSymbolicExpression<SymbolicExpression<T, V>, V>` (equivalent
to the current `QuadraticSymbolicExpression<T, V>`) in the old JIT
pipeline and use the simpler `QuadraticSymbolicExpression<T, V>` in the
APC pipeline.
---------
Co-authored-by: chriseth <chriseth.github@gmail.com>
Simple one. Update patch so that it matches up exactly to the git
dependencies.
Before this PR, I had many compilation errors (cuz we are missing a
couple local crates) when syncing up patches during local development,
so might as well push to main so others don't need to fix up in the
future ;)
Pgo input is now "max total columns" instead of "max apc columns",
because there can be many non APC columns. To be exact, roughly ~580 for
RV32 chip set and ~15K in the set used in `reth-benchmark`. Tested and
works for Keccak.
This PR calculates the total non APC columns number, logs them by chip
in case we interested in what airs they belong to, and then subtract the
non APC total from the max total the user supplies to PGO to obtain the
max APC columns, which is then used in the KnapSack algorithm to limit
APC generation in Pgo::Cell mode.
This PR's method is a bit of simplification, we are not sure if all
chips in the original config's chip inventory get eventually proven.
There are three scenarios (in order of ascending restrictiveness):
1. All columns in original inventory stay in.
2. Those that get executed stay in.
3. Those that get proven stay in.
Scenario 3 is more restrictive than scenario 2, because we need all
dummy chips executed for witgen but not necessarily all dummy columns
for proving.
I think I'll focus more on PGO tmr related to APC vs non APC columns +
some more tests, so might as well merge this for now.
Conclusion for shift chip: no further optimization detected. Did a nice
job removing compile time flags and `imm` value related markers.
In cell pgo:
- rank apcs by `cells_saved / column_count`
- select them as long as the total column count doesn't cross a max
value, possibly skipping some
---------
Co-authored-by: Steve Wang <qian.wang.wg24@wharton.upenn.edu>
Again, leaving comments here as I mark things or detect further
optimizations.
Intended single intruction tests for Branch Eq and Branch Lt chips (6
instructions total). Should be the most popular chips after ALU and LS.
Findings: after fully inspecting these branch related instructions, I
conclude that there aren't further optimizations needed. These chips
mostly revolve around value comparison (lots of diff and inverse
columns), all of which require knowing register values to further
simplify, which aren't typically solved at compile time, and therefore
almost all original constraints and bus interactions stay in except
those that involve compile time opcode flags.
Added an instruction builder macro for branch opcodes only, which has
fixed `d=1` and `e=1` (both of which means register address space)
according to the OVM ISA specification. We only need to supply
`rs1_ptr`, `rs2_ptr`, and `imm` to the instruction, and it'll perform
`to_pc = pc + imm if load(REG, rs1_ptr) OP load(REG, rs2_ptr) else pc +
4`.
This PR covers APC single instruction tests for ALU (5 instructions) and
LoadStore (8 instructions). The PR itself is quite simple, but I'm also
reviewing the constraints and bus interactions line by line to see if we
are missing any potential optimization and also leaving comments for
potential things we could do.
LoadStore non-sign extend instructions (6 of them) have 14 variants,
because the non 4-byte aligned ones (half word or byte instructions) can
have different shift amounts (0, 2 for half word and 0, 1, 2, 3 for
byte). These 14 variants are expressed in a combination of 4 flags
(`flags[0, 1, 2, 3]`), as shown in code snippet from OVM below. While we
have optimized away some more "high level" LoadStore flags, (e.g.
`LOADW` flags must be `2, 0, 0, 0`), we haven't optimized some more
"fine grained" cases, (e.g. `LOADBU` flags must be `0, 0/2, 0/2, 0` but
we currently constrain them to `0, 0/1/2, 0/1/2, 0`). There are many
such cases, and I'm leaving a bunch of comments to identify them below.
```
match (opcode, record.shift) {
(LOADW, 0) => flags[0] = F::TWO,
(LOADHU, 0) => flags[1] = F::TWO,
(LOADHU, 2) => flags[2] = F::TWO,
(LOADBU, 0) => flags[3] = F::TWO,
(LOADBU, 1) => flags[0] = F::ONE,
(LOADBU, 2) => flags[1] = F::ONE,
(LOADBU, 3) => flags[2] = F::ONE,
(STOREW, 0) => flags[3] = F::ONE,
(STOREH, 0) => (flags[0], flags[1]) = (F::ONE, F::ONE),
(STOREH, 2) => (flags[0], flags[2]) = (F::ONE, F::ONE),
(STOREB, 0) => (flags[0], flags[3]) = (F::ONE, F::ONE),
(STOREB, 1) => (flags[1], flags[2]) = (F::ONE, F::ONE),
(STOREB, 2) => (flags[1], flags[3]) = (F::ONE, F::ONE),
(STOREB, 3) => (flags[2], flags[3]) = (F::ONE, F::ONE),
_ => unreachable!(),
};
```
- Introduce `OriginalAirs` struct in `powdr_openvm` to reduce air
extraction from once per opcode to once per air
- Expose a method to get air widths per opcode, useful in PGO
- Introduce `InstructionAirHandler` trait in `powdr_autoprecompiles` to
avoid this change leaking into that crate. The only thing that
`powdr_autoprecompiles` needs from the instruction airs is to fetch them
by opcode id.
- Implement this trait for `&OriginalAirs`
Currently, `powdr_autoprecompile` unwraps the result from getting the
air for a given opcode. In the future, we could encode "blocklisted"
opcodes as `get_instruction_air` returning Err.
The OVM CLI `Pgo` command and `powdr_ovm::pgo` function mostly collects
the execution frequency of each PC in the original program, and
therefore has become a misnomer now that the concept of PGO is mostly
implemented in `customize_exe.rs`.
This PR removes the `Pgo` command and also inlines `powdr_ovm::pgo`
function to `get_pc_idx_count`, which should be the correct name for the
intended purpose.
Depends on #2928.
New features:
1. Moved opcode related utilites from `customize_exe.rs` to `opcode.rs`,
so we don't overload the file and group all opcode stuffs in the same
file.
2. Bigint related opcodes are derived an updated macro in `opcode.rs`.
`ALL_OPCODES` contain bigint related opcodes.
3. Opcode lists can stay as `&[usize]` instead of forced to `BTreeSet`,
because both data structure can invoke `contains()` call.
Because we know the number of costliest APCs to compute at compile time,
we only need to cache the top `APC + APC_SKIP` costliest APCs for
Pgo::Cell mode instead of all of them. This is achieved by:
1. Maintaining a binary heap data structure which tracks the cost of
blocks and if we are over the max number of cache, insert the new cache
and eject the lowest costed block if the former is not the minimum
costed block.
2. Still keeping APC generation parallel by adding mutex guard around
the binary heap which is used for cost ordering as well as the map of
APC caches.
Create opcode formatter which instruction formatter uses. This is a
refactor that doesn't really increase the number of source code lines.
This is a required feature to format opcode in #2930.
Currently Pgo::Cell creates APCs for all BBs, but immediately after the
sorting the first `n_acc` BBs are created again regardless of PgoType.
This PR isolates the apc_caching for the first `n_acc` BBs as a separate
API that's only invoked by Pgo::Instruction and Pgo::None.
As per my reth bench mark test `MODE="execute" APC=1000 APC_SKIP=0
PGO_TYPE=cell /usr/bin/time -v ./run.sh` PLUS `panic!` immediately after
compilation, this reduces maximum resident size (memory consumption)
from 92 gigs to 72 gigs.
Currently we use a black list for opcode that shouldn't appear in a BB
for APC. This method has the disadvantage of encountering many bugs due
to new opcodes appearing that we can't create APC for (whether due to
unsupported expression type or compile time memory overload).
With the new reth-benchmark, we identified a few custom opcodes (likely
manual precompiles) that get injected in an APC, which causes an
abnormal amount of memory usage up to the 260 gigs limit of our server.
This calls for a systematic way to use an allow list for opcodes so that
we are robust against future manual opcodes.
A few key changes:
1. Use a macro to derive the vector of allowed opcodes from public
constants.
2. Allow list = basic opcodes - hint opcodes + bigint branch opcodes
Co-authored-by: schaeff <thibaut@powdrlabs.com>
Some refactoring:
- avoid collecting a large string in `export_pil`, add basic test
- introduce `OriginalVmConfig` which wraps SdkVmConfig and serves airs,
bus map, etc, with internal caching
- refactor to make the external API higher level. `compile_exe_with_elf`
can be used directly by `openvm-reth-benchmark`, see
https://github.com/powdr-labs/openvm-reth-benchmark/pull/7
This PR implements copy constraints in Plonk circuit.
Copy constraints are implemented as bus interactions.
Given an example:
$w_0 \times w_1 = temp_0$
$temp_0 \times w_0 = temp_2$
$w_0 + w_2 = temp_3$
$temp_0 \times temp_3 = temp_4$
In this example, there are 6 extra columns needed to implement the copy
constraint:
| a_id | b_id | c_id | a_perm | b_perm | c_perm |
|-------------|---------------------------|------------|---------------------|---------------------|---------------------|
| 0 | 1 | 2 | 4 | 1 | 3 |
| 3 | 4 | 5 |9 | 6 | 5 |
| 6 |7 |8 | 0 |7 |10 |
| 9 | 10 |11|2 | 8 |11|
For example, the variable $w_0$ appears in [a_id,b_id,c_id] with
position [0, 4, 6] (I call it permutation set). These positions are
shifted cyclically (by one) and placed in the same place of the
corresponding permutation column, as shown below
| a_id | b_id | c_id | a_perm | b_perm | c_perm |
|-------------|---------------------------|------------|---------------------|---------------------|---------------------|
| 0 | - | - | 4 | - | - |
| - | 4 | - |- | 6 | - |
| 6 |- |- | 0 |- |- |
| - | - |-|- | - |-|
The bus interaction are
receive: ([a, a_id],1), ([b,b_id],1), ([c, c_id],1)
send: ([a, a_perm],-1), ([b,b_perm],-1), ([c, c_perm],-1)
This PR creates permutation sets for only the
Witness(AlgebraicReference) cells, and then creates their corresponding
permutation columns as well.
Next PR will do the same for Temp(id) cells.
This PR changes `BusMap: bus_ids: BTreeMap<u64, BusType>` to `BusMap:
bus_ids: BTreeMap<u64, Option<BusType>>`, by doing this we can avoid
adding CopyConstraintLookup into BusType enum when adding this new bus
type by plonk chip, get_pil can behave correctly, see discussion
[here](https://github.com/powdr-labs/powdr/pull/2871#discussion_r2149829356)
Verbose, but working.
After this, the default config could be moved back into the
autoprecompiles repo to be used only in unit tests.
Update: not verbose anymore. We use `find_chip` with the periphery chips
that we are expecting. The exhaustive approach seemed like too much code
for the benefit: we would know if a new chip gets added, but we wouldn't
if a new bus gets added to an existing chip. So the current version is
good enough.
Depends on https://github.com/powdr-labs/powdr/pull/2876
Created this PR in response to comments:
https://github.com/powdr-labs/powdr/pull/2876#discussion_r2142388765
Currently, I composed the doc for each instruction such that it contains
the OVM doc for the adapter and core chips of each instruction. This has
the advantage of fully describing the constraints expected from each
instruction.
Because the doc should be repeated for each group of instructions, e.g.
(ADD and SUB), for an easy review, I'm not copying the docs to opcodes
that it should be repeated for yet, but eventually I would attempt to
use some macro schenanigans to avoid duplication as much as possible,
but in the worst case, repeat them for some opcodes.
P.S. They look pretty nice when you hover over the constructor function
with an IDE. :)
This PR enables us to remove an ugly heuristic: #2867
For context, this was the problem:
- In OpenVMs load/store chip, the memory interactions going to register
address space still include unknown columns in the address expression.
This is correct (we can't solve for these columns at compile time), but
we can narrow down the possible values for these columns. If you do
that, it turns our that for all possible assignments, the address
evaluates to the same constant value.
- #2866 added this type of reasoning to variables. But before this PR,
only witness columns are considered variables. The (register) memory
address is not stored in a witness column though.
This PR solves this by introducing variables for bus interaction fields.
When the solver is started, variables are wrapped with this type:
b23cfc5f70/constraint-solver/src/solver/bus_interaction_variable_wrapper.rs (L12-L17)
and bus interaction fields are replaced with `Variable::
BusInteractionField` instances. Additional constraints are added to bind
the new variable to the expression that was replaced.
In the end, the variables are unwrapped again and any bus interaction
field assignments are returned to the caller.
In the end, the solver is slower because of [this
addition](b23cfc5f70/constraint-solver/src/solver/exhaustive_search.rs (L132-L148)).
Measuring `time cargo test -r tests::keccak_machine_pgo -- --nocapture`
on my system, I get:
- 45s on `main`
- 98s on `introduce-bus-interaction-vars2` (this PR)
- 99s on `remove-is-load-heuristic` (#2867)
- 48s on `tweak-exhaustive-search-params` (#2908) after tweaking some
parameters
Changes:
- Removed `PolyID`, including the polynomial type. Now
`AlgebraicReference` is just an ID with a name (for display purposes)
- Removed the `Column` type, which would have been essentially the same
as `AlgebraicReference`
- Renamed `legacy_expression` -> `expression`
- Because of the type change, I needed to update checked-in CBOR files.
With this change, expected machines are stored in separate files. If the
optimizer changes and the test fails, one simply has to delete the file
and it gets regenerated.
We used to store the name in a string in `AlgebraicReference`. Now, it's
an `Arc<String>`, so that if a column is mentioned many times, its name
is only allocated once.
Builds on #2890
This PR removes the "next" flag from `AlgebraicReference`. This requires
a few changes, because OpenVM's AIRs do use them. It should still be
removed, because we can't deal with it when generating APCs.
The solution is to introduce a new type that wraps `AlgebraicReference`:
1cee98dbba/openvm/src/utils.rs (L25-L30)
Note that this also takes out the `is_{first,last,transition}`
reference, which we also can't handle when compiling APCs!
When exporting OpenVM's chips to PIL, we use the `OpenVmReference` type;
for everything else, we convert to `AlgebraicReference`. For this, we
need to make sure that we don't do that for blacklisted opcodes (because
they might contain next references).
---------
Co-authored-by: Leo <leo@powdrlabs.com>
All APCs need access to periphery chips such as the range checker or the
xor chip.
These chips are expensive to create in memory, but they are not needed
in the end and just get thrown away.
Before this PR, each APC creates one instance of them.
After this PR, a single instance is created and then shared across all
APCs.