Compare commits

...

1 Commits

Author SHA1 Message Date
yongkangc
89a1235a40 feat(db): add MDBX database warmup at startup
Add opt-in database warmup using MDBX's native mdbx_env_warmup() API
to prime the OS page cache after node restarts, reducing cold-start
latency.

New CLI flag: --db.warmup [off|madvise|force]
- off (default): no warmup
- madvise: async prefetch via madvise(MADV_WILLNEED)
- force: synchronously fault in all pages (recommended for ≥128GB RAM)

Inspired by Nethermind's Db.StateDbEnableFileWarmer feature.

Amp-Thread-ID: https://ampcode.com/threads/T-019c30e4-067e-76bd-a13b-0ecf46557093
2026-02-06 03:48:02 +00:00
12 changed files with 2452 additions and 1 deletions

124
OPTIMIZATION_ROADMAP.md Normal file
View File

@@ -0,0 +1,124 @@
# Nethermind → Reth Optimization Roadmap
## Phase 1: Quick Wins (< 1 week each)
| # | Description | Target File | Impact |
|---|-------------|-------------|--------|
| 1 | Cursor metrics: use `as_ref()` instead of `Arc::clone()` per op | `crates/storage/db/src/implementation/mdbx/cursor.rs` | 5-10% cursor overhead reduction |
| 2 | `reinsert_reorged_blocks()`: accept `&[ExecutedBlock]` instead of clone | `crates/engine/tree/src/tree/mod.rs` | Eliminates 2 Vec clones per reorg |
| 3 | Fast-path `find_disk_reorg()`: early-return if persisted tip in canonical chain | `crates/engine/tree/src/tree/mod.rs` | Skips O(n) walk in common case |
| 4 | `BlockBuffer::remove_block()`: replace `VecDeque::retain` with `HashSet` | `crates/engine/tree/src/tree/block_buffer.rs` | O(1) vs O(n) per removal |
| 5 | Wrap overlay in `Arc` in `StateProviderBuilder` | `crates/engine/tree/src/tree/mod.rs` | O(1) clone vs O(n) Vec realloc |
| 6 | Enable MDBX readahead during staged sync, disable for live following | `crates/storage/db/src/implementation/mdbx/mod.rs` | 10-20% faster initial sync I/O |
| 7 | `is_canonical()`: maintain `HashSet<B256>` for O(1) lookup | `crates/engine/tree/src/tree/state.rs` | Eliminates O(n) chain walks |
| 8 | `blocks_by_number`: store `B256` hashes instead of `ExecutedBlock` clones | `crates/engine/tree/src/tree/state.rs` | Less memory, fewer Arc bumps |
| 9 | Parallelize account trie reveals with `par_bridge_buffered()` | `crates/trie/sparse/src/state.rs` L466-499 | 10-20% faster proof reveals |
## Phase 2: Medium Refactors (1-4 weeks each)
| # | Description | Target File | Impact |
|---|-------------|-------------|--------|
| 1 | Batch `write_state_changes` across all blocks (match `write_hashed_state` pattern) | `crates/storage/provider/src/providers/database/provider.rs` L609-637 | 2-5x fewer cursor open/close for batches |
| 2 | Optimize DupSort storage writes: sort by (addr,slot), walk cursor sequentially | `crates/storage/provider/src/providers/database/provider.rs` L2395-2461 | 30-50% fewer MDBX ops |
| 3 | Add hot-state LRU cache at provider layer for top accounts/slots | `crates/storage/provider/src/providers/state/latest.rs` | 20-40% fewer state reads |
| 4 | DB file warmer: background `madvise(MADV_WILLNEED)` scan at startup | `crates/storage/db/src/implementation/mdbx/mod.rs` | Eliminates weeks-long cold start |
| 5 | Cache recently-persisted `ExecutedBlock`s (LRU-64) to avoid trie recomputation | `crates/engine/tree/src/tree/mod.rs` (canonical_block_by_hash) | Critical for reorg performance |
| 6 | Combine save + prune into single MDBX transaction | `crates/engine/tree/src/persistence.rs` L108-126 | Saves 1 fsync per cycle |
| 7 | Sender-grouped parallel prewarming (sequential intra-sender, parallel across) | `crates/engine/tree/src/tree/payload_processor/prewarm.rs` | 30-50% block processing speedup |
| 8 | Deferred history index updates to background task | `crates/storage/provider/src/providers/database/provider.rs` L1228-1276 | 15-25% faster save_blocks |
## Phase 3: Big Bets (1+ month each)
| # | Description | Target File | Impact |
|---|-------------|-------------|--------|
| 1 | Integrate `ParallelSparseTrie` into engine pipeline (replace serial) | `crates/engine/tree/src/tree/payload_processor/sparse_trie.rs` | 30-50% faster state root |
| 2 | EIP-7928 Block Access List prefetching (currently `todo!()`) | `crates/engine/tree/src/tree/payload_processor/sparse_trie.rs` L441 | 20-40% fewer proof round-trips |
| 3 | Adaptive MDBX tuning: switch config between sync and tip-following modes | `crates/storage/db/src/implementation/mdbx/mod.rs` | Lower write amplification |
| 4 | Sharded dirty trie node cache (256 shards, memory-budgeted, batch persist) | `crates/trie/` | 3x reduction in DB writes |
| 5 | Pipeline persistence with double-buffering (overlap I/O with prep) | `crates/engine/tree/src/persistence.rs` | Higher persistence throughput |
## Top 10 Recommendations (Mini-RFCs)
### 1. Sender-Grouped Prewarming
- **Problem:** Prewarming doesn't handle intra-sender tx dependencies, reducing cache hit rate.
- **Nethermind:** Groups txs by sender; sequential within sender, parallel across senders.
- **Reth change:** In `prewarm.rs`, group by `tx.signer()`, use rayon `par_iter` per group. Feed results into `cached_state.rs` pre-block cache with clear-on-new-block semantics.
- **Gain:** 30-50% block processing reduction. **Risk:** Low — prewarming is speculative.
### 2. ParallelSparseTrie for State Root
- **Problem:** Final `root_with_updates()` hashes the entire trie serially on one thread.
- **Nethermind:** Adaptive parallel commit: 1-3 levels of parallelism based on dirty node count.
- **Reth change:** Wire `ParallelSparseTrie` (already in `sparse-parallel/`) into `SparseTrieCacheTask`. Use its 256-subtrie rayon parallelism. Add adaptive thresholds to skip parallelism for small updates.
- **Gain:** 30-50% root computation reduction. **Risk:** Medium — integration complexity.
### 3. Batched Plain State Writes
- **Problem:** `write_state()` called per-block; opens new cursors N times in a batch.
- **Nethermind:** Single commit buffer absorbing hundreds of blocks.
- **Reth change:** In `provider.rs`, merge all `StateChangeset` across the block batch before writing. Use same pattern as `write_hashed_state` (L641-658).
- **Gain:** 2-5x cursor overhead reduction for multi-block batches. **Risk:** Low.
### 4. DB File Warmer
- **Problem:** After restart, OS page cache takes weeks to warm naturally.
- **Nethermind:** `StateDbEnableFileWarmer` reads all DB files sequentially at startup.
- **Reth change:** Spawn background thread at startup that reads MDBX data files with `posix_fadvise(FADV_SEQUENTIAL)`. ~50 lines of code.
- **Gain:** Minutes vs weeks for warm cache. **Risk:** Very low.
### 5. Hot State LRU Cache
- **Problem:** Every SLOAD goes to MDBX B-tree + Compact decode; no app-level cache.
- **Nethermind:** Aggressive caching via `PreBlockCaches` (concurrent dictionaries for accounts, storage, RLP).
- **Reth change:** Add `DashMap`-based LRU in `LatestStateProviderRef` for top ~10K accounts and ~100K slots. Clear on reorg. Invalidate on write.
- **Gain:** 20-40% fewer state reads. **Risk:** Medium — cache invalidation correctness.
### 6. DupSort Write Optimization
- **Problem:** Storage writes do seek-delete-upsert per slot (4000 random ops/block).
- **Nethermind:** Batch-oriented writes with path-sorted keys.
- **Reth change:** In `provider.rs` L2395-2461, sort all entries by (address, slot), open cursor once, walk forward sequentially instead of random seeks.
- **Gain:** 30-50% fewer MDBX ops on storage-heavy blocks. **Risk:** Low.
### 7. Sharded Dirty Node Cache
- **Problem:** Trie writes go directly to MDBX; no write absorption layer.
- **Nethermind:** 256-shard `ConcurrentDictionary` cache with 1-4GB budget, batch-persist at finality.
- **Reth change:** Add a `DashMap`-backed dirty node cache between trie computation and MDBX writes in `crates/trie/`. Persist only at finality boundaries. Track memory usage for eviction.
- **Gain:** 3x reduction in SSD writes. **Risk:** Medium — memory management, crash recovery.
### 8. EIP-7928 Block Access List Prefetching
- **Problem:** `MultiProofMessage::BlockAccessList` handler is `todo!()`.
- **Nethermind:** Pre-warms all addresses + access lists before execution.
- **Reth change:** Parse BAL at block start in `sparse_trie.rs` L441, convert to proof targets, dispatch all proofs before execution. Eliminates iterative reveal-execute-reveal cycle.
- **Gain:** 20-40% state root time reduction. **Risk:** Medium — depends on EIP adoption.
### 9. Thread Priority Elevation
- **Problem:** Block processing thread can be preempted by RPC/networking work.
- **Nethermind:** `Thread.SetHighestPriority()` during block processing.
- **Reth change:** In `crates/engine/tree/src/tree/mod.rs` engine loop, call `libc::setpriority` or `sched_setscheduler` to elevate thread during `on_new_payload`. Restore after.
- **Gain:** 5-15% tail latency reduction. **Risk:** Low — requires CAP_SYS_NICE.
### 10. Trie Write Deduplication
- **Problem:** Unchanged trie nodes may be rewritten to DB unnecessarily.
- **Nethermind:** Tracks `skipped_writes` metric; skips writes when RLP hasn't changed.
- **Reth change:** In trie commit path, compare new node bytes against existing before MDBX `upsert`. Add `reth_trie_skipped_writes` metric.
- **Gain:** Variable, significant for read-heavy blocks. **Risk:** Very low.
## Reth-Only Bugs to Fix
| # | Issue | File | Fix |
|---|-------|------|-----|
| 1 | `canonical_block_by_hash()` recomputes trie updates from DB on every call | `crates/engine/tree/src/tree/mod.rs` L1904-1955 | Cache last N persisted blocks |
| 2 | `MeteredStateHook::on_state()` recomputes counts O(n) per transaction | `crates/engine/tree/src/tree/mod.rs` L232-244 | Track incrementally |
| 3 | `InsertExecutedBlock` path clones block 3 times unnecessarily | `crates/engine/tree/src/tree/mod.rs` L1462-1486 | Use original for last consumer |
| 4 | `get_canonical_blocks_to_persist()` walks backwards from head instead of using BTreeMap range | `crates/engine/tree/src/tree/mod.rs` L1827-1871 | Use `blocks_by_number.range()` |
| 5 | Execution cache miss on non-prewarm path doesn't populate cache for intra-block reuse | `crates/engine/tree/src/tree/cached_state.rs` L310-325 | Populate on miss |
| 6 | `select_biased!` in sparse trie prioritizes updates over proof results, delaying proof application | `crates/engine/tree/src/tree/payload_processor/sparse_trie.rs` L364 | Alternate priority |
## Do NOT Port
| Technique | Why |
|-----------|-----|
| GC-aware scheduling (GCScheduler) | Rust has no GC; irrelevant |
| LOH compaction / memory decommit | .NET-specific memory management |
| Large Array Pool (1-8MB buffers) | Rust allocator doesn't have .NET's >1MB discard behavior |
| Object pooling for TX processing envs | Rust ownership model handles this differently; no GC pressure |
| `NoResizeClear` for ConcurrentDictionary | Rust `HashMap::clear()` already preserves capacity |
| Paprika custom DB engine | Too large a departure; MDBX is well-suited for Reth's architecture |
| HalfPath key scheme | Reth already uses flat `PlainState` tables for reads; trie is only for root computation |
| Compact receipt storage toggle | Reth already uses compact encoding for receipts |

View File

@@ -9,7 +9,7 @@ use clap::{
value_parser, Arg, Args, Command, Error,
};
use reth_db::{
mdbx::{MaxReadTransactionDuration, SyncMode},
mdbx::{DatabaseWarmupMode, MaxReadTransactionDuration, SyncMode},
ClientVersion,
};
use reth_storage_errors::db::LogLevel;
@@ -60,6 +60,17 @@ pub struct DatabaseArgs {
value_parser = value_parser!(SyncMode),
)]
pub sync_mode: Option<SyncMode>,
/// Warm up the database at startup by loading pages into the OS page cache.
///
/// - `off`: No warmup (default).
/// - `madvise`: Async prefetch via madvise(MADV_WILLNEED).
/// - `force`: Synchronously fault in all pages (recommended only with ≥128GB RAM).
#[arg(
long = "db.warmup",
value_parser = value_parser!(DatabaseWarmupMode),
default_value_t = DatabaseWarmupMode::Off,
)]
pub warmup: DatabaseWarmupMode,
}
impl DatabaseArgs {
@@ -89,6 +100,7 @@ impl DatabaseArgs {
.with_growth_step(self.growth_step)
.with_max_readers(self.max_readers)
.with_sync_mode(self.sync_mode)
.with_warmup(self.warmup)
}
}

View File

@@ -52,6 +52,44 @@ const DEFAULT_MAX_READERS: u64 = 32_000;
/// See [`reth_libmdbx::EnvironmentBuilder::set_handle_slow_readers`] for more information.
const MAX_SAFE_READER_SPACE: usize = 10 * GIGABYTE;
/// Controls how the database is warmed up at startup to prime the OS page cache.
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
pub enum DatabaseWarmupMode {
/// No warmup — rely on natural page faults (default).
#[default]
Off,
/// Async prefetch via `madvise(MADV_WILLNEED)`. Low overhead, non-deterministic.
Madvise,
/// Synchronously fault in all allocated pages. Higher I/O but deterministic warmup.
/// Recommended only for nodes with sufficient RAM (≥128 GB).
Force,
}
impl std::fmt::Display for DatabaseWarmupMode {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Off => write!(f, "off"),
Self::Madvise => write!(f, "madvise"),
Self::Force => write!(f, "force"),
}
}
}
impl std::str::FromStr for DatabaseWarmupMode {
type Err = String;
fn from_str(s: &str) -> Result<Self, Self::Err> {
match s.trim().to_ascii_lowercase().as_str() {
"off" => Ok(Self::Off),
"madvise" => Ok(Self::Madvise),
"force" => Ok(Self::Force),
_ => Err(format!(
"invalid value '{s}' for warmup mode. valid values: off, madvise, force"
)),
}
}
}
/// Environment used when opening a MDBX environment. RO/RW.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum DatabaseEnvKind {
@@ -119,6 +157,8 @@ pub struct DatabaseArguments {
/// environments). Choose `SafeNoSync` if performance is more important and occasional data
/// loss is acceptable (e.g., testing or ephemeral data).
sync_mode: SyncMode,
/// Controls database warmup at startup.
warmup_mode: DatabaseWarmupMode,
}
impl Default for DatabaseArguments {
@@ -143,6 +183,7 @@ impl DatabaseArguments {
exclusive: None,
max_readers: None,
sync_mode: SyncMode::Durable,
warmup_mode: DatabaseWarmupMode::Off,
}
}
@@ -215,6 +256,17 @@ impl DatabaseArguments {
self
}
/// Sets the database warmup mode.
pub const fn with_warmup(mut self, warmup_mode: DatabaseWarmupMode) -> Self {
self.warmup_mode = warmup_mode;
self
}
/// Returns the configured warmup mode.
pub const fn warmup(&self) -> DatabaseWarmupMode {
self.warmup_mode
}
/// Returns the client version if any.
pub const fn client_version(&self) -> &ClientVersion {
&self.client_version
@@ -504,6 +556,39 @@ impl DatabaseEnv {
self
}
/// Spawns a background thread that warms up the database according to the given mode.
///
/// Does nothing if `mode` is [`DatabaseWarmupMode::Off`].
pub fn start_warmup(&self, mode: DatabaseWarmupMode) {
let flags = match mode {
DatabaseWarmupMode::Off => return,
DatabaseWarmupMode::Madvise => ffi::MDBX_warmup_default,
DatabaseWarmupMode::Force => ffi::MDBX_warmup_force | ffi::MDBX_warmup_oomsafe,
};
let inner = self.inner.clone();
std::thread::Builder::new()
.name("reth-db-warmup".to_string())
.spawn(move || match inner.warmup(flags, 0) {
Ok(_) => {
reth_tracing::tracing::info!(
target: "reth::db",
?mode,
"Database warmup completed"
);
}
Err(err) => {
reth_tracing::tracing::warn!(
target: "reth::db",
%err,
?mode,
"Database warmup failed"
);
}
})
.ok();
}
/// Creates all the tables defined in [`Tables`], if necessary.
///
/// This keeps tracks of the created table handles and stores them for better efficiency.

View File

@@ -46,10 +46,12 @@ pub fn init_db_for<P: AsRef<Path>, TS: TableSet>(
args: DatabaseArguments,
) -> eyre::Result<DatabaseEnv> {
let client_version = args.client_version().clone();
let warmup = args.warmup();
let mut db = create_db(path, args)?;
db.create_and_track_tables_for::<TS>()?;
db.record_client_version(client_version)?;
drop_orphan_tables(&db);
db.start_warmup(warmup);
Ok(db)
}

View File

@@ -149,6 +149,22 @@ impl Environment {
f(self.env_ptr())
}
/// Warms up the database by loading pages into the OS page cache.
///
/// `flags` are `MDBX_warmup_flags_t` bitflags (e.g. `MDBX_warmup_default`,
/// `MDBX_warmup_force | MDBX_warmup_oomsafe`).
///
/// `timeout_seconds_16dot16` is a 16.16 fixed-point timeout in seconds (0 = no timeout).
pub fn warmup(
&self,
flags: ffi::MDBX_warmup_flags_t,
timeout_seconds_16dot16: u32,
) -> Result<bool> {
mdbx_result(unsafe {
ffi::mdbx_env_warmup(self.env_ptr(), ptr::null(), flags, timeout_seconds_16dot16)
})
}
/// Flush the environment data buffers to disk.
pub fn sync(&self, force: bool) -> Result<bool> {
mdbx_result(unsafe { ffi::mdbx_env_sync_ex(self.env_ptr(), force, false) })

View File

@@ -0,0 +1,346 @@
# DB File Warmer for Reth — Exploration & Implementation Plan
## Problem
After a restart, the OS page cache is cold. MDBX reads hit disk for weeks until naturally warmed.
Nethermind solved this with `Db.StateDbEnableFileWarmer`, which reads all DB files sequentially at startup to prime the OS page cache.
## Current State in Reth
**No file warmer exists.** A search for "warm", "fadvise", "madvise", "page cache", "readahead" found only:
- The `no_rdahead: true` flag set during DB open (line 427 of `crates/storage/db/src/implementation/mdbx/mod.rs`) — this *disables* readahead for random access patterns
- MDBX's vendored C code has `mdbx_env_warmup()` API (discussed below)
- No Rust-side warmup logic exists anywhere
## Key Findings
### 1. MDBX Has a Built-in Warmup API
The vendored libmdbx already exposes `mdbx_env_warmup()` in `mdbx.h` (line 3278) and it's available in the auto-generated FFI bindings (`ffi::mdbx_env_warmup`). This is **not currently wrapped** in the Rust `reth-libmdbx` crate.
```c
// mdbx.h:3278
LIBMDBX_API int mdbx_env_warmup(
const MDBX_env *env,
const MDBX_txn *txn,
MDBX_warmup_flags_t flags,
unsigned timeout_seconds_16dot16 // fixed-point 16.16 format
);
```
Available flags (from FFI bindings):
| Flag | Value | Description |
|------|-------|-------------|
| `MDBX_warmup_default` | 0 | Ask OS kernel to async prefetch pages (uses `madvise(MADV_WILLNEED)`) |
| `MDBX_warmup_force` | 1 | Force-read all allocated pages sequentially into memory |
| `MDBX_warmup_oomsafe` | 2 | Use syscalls instead of direct access to avoid OOM-killer |
| `MDBX_warmup_lock` | 4 | Lock pages in memory via `mlock()` |
| `MDBX_warmup_touchlimit` | 8 | Auto-adjust resource limits for lock |
| `MDBX_warmup_release` | 16 | Release a previous lock |
**This is the preferred approach** — it's already built into MDBX, handles the mmap'd file correctly, and respects the database's internal geometry (only warms allocated pages, not the full file).
### 2. How Nethermind Does It (for reference)
Nethermind uses RocksDB (not MDBX), so their approach is different. From `DbOnTheRocks.cs`:
```csharp
private void WarmupFile(string basePath, RocksDb db)
{
// Gets live SST file metadata from RocksDB
// Sorts by creation time (oldest first)
// Reads each file sequentially with 512KB buffer using Parallel.ForEach
// Logs progress as percentage
byte[] buffer = new byte[512.KiB()];
using FileStream stream = File.OpenRead(fullPath);
int readCount = buffer.Length;
while (readCount == buffer.Length)
{
readCount = stream.Read(buffer);
Interlocked.Add(ref totalRead, readCount);
}
}
```
Key details:
- Enabled via `Db.StateDbEnableFileWarmer: true`
- Recommended for systems with ≥128GB RAM
- Reads files in parallel
- 512KB read buffer
- Logs progress as percentage
### 3. MDBX DB File Locations
MDBX stores two files in the DB directory:
- `mdbx.dat` — the main data file (defined as `MDBX_DATANAME "/mdbx.dat"` in mdbx.h:802)
- `mdbx.lck` — the lock file (defined as `MDBX_LOCKNAME "/mdbx.lck"` in mdbx.h:787)
The DB directory path is resolved via:
- `ChainPath::db()``<datadir>/<chain_id>/db` (crates/node/core/src/dirs.rs:288)
- Passed to `init_db(path, args)` (crates/storage/db/src/mdbx.rs:38)
- Then to `DatabaseEnv::open(path, kind, args)` (crates/storage/db/src/implementation/mdbx/mod.rs:348)
### 4. Database Opening Flow
```
NodeCommand::run()
→ init_db(db_path, args) // crates/storage/db/src/mdbx.rs:38
→ create_db(path, args) // crates/storage/db/src/mdbx.rs:17
→ DatabaseEnv::open(path, RW, args) // crates/storage/db/src/implementation/mdbx/mod.rs:348
→ EnvironmentBuilder::open(path) // crates/storage/libmdbx-rs/src/environment.rs:611
→ mdbx_env_create() + mdbx_env_open()
```
## Implementation Plan
### Approach: Use MDBX's Native `mdbx_env_warmup()` API
This is superior to manual file reading because:
1. It understands MDBX's internal page layout (only warms allocated pages)
2. Works correctly with MDBX's mmap (the data is already mmap'd — we just need to fault in pages)
3. Handles edge cases (OOM safety, timeouts)
4. Already implemented and tested in libmdbx
5. ~30 lines of Rust code total
### Step 1: Add `warmup()` to `Environment` in `reth-libmdbx`
**File:** `crates/storage/libmdbx-rs/src/environment.rs`
Add a method to `Environment`:
```rust
impl Environment {
/// Warms up the database by loading pages into memory.
///
/// Uses `mdbx_env_warmup()` to ask the OS to prefetch database pages,
/// optionally force-loading them. This eliminates cold-start penalties
/// after a restart.
pub fn warmup(&self, flags: ffi::MDBX_warmup_flags_t, timeout_seconds: Option<u64>) -> Result<bool> {
// timeout is in fixed-point 16.16 format: upper 16 bits = seconds
let timeout = timeout_seconds.map(|s| (s as u32) << 16).unwrap_or(0);
mdbx_result(unsafe {
ffi::mdbx_env_warmup(
self.env_ptr(),
ptr::null(), // no specific txn
flags,
timeout,
)
})
}
}
```
### Step 2: Add `warmup_db` to `DatabaseEnv`
**File:** `crates/storage/db/src/implementation/mdbx/mod.rs`
```rust
impl DatabaseEnv {
/// Spawns a background task to warm up the database by loading pages into memory.
///
/// This uses MDBX's native `mdbx_env_warmup()` which asks the OS kernel to
/// prefetch database pages into the page cache, eliminating cold-start penalties.
pub fn warmup(&self) -> Result<(), DatabaseError> {
let inner = self.inner.clone();
std::thread::Builder::new()
.name("reth-db-warmup".to_string())
.spawn(move || {
let start = std::time::Instant::now();
reth_tracing::tracing::info!(
target: "reth::db",
"Starting database warmup..."
);
// MDBX_warmup_default uses madvise(MADV_WILLNEED) for async prefetch
// MDBX_warmup_force would synchronously touch every page
let flags = ffi::MDBX_warmup_force | ffi::MDBX_warmup_oomsafe;
match inner.warmup(flags, None) {
Ok(_) => {
reth_tracing::tracing::info!(
target: "reth::db",
elapsed = ?start.elapsed(),
"Database warmup complete"
);
}
Err(e) => {
reth_tracing::tracing::warn!(
target: "reth::db",
%e,
"Database warmup failed"
);
}
}
})
.map_err(|e| DatabaseError::Other(e.to_string()))?;
Ok(())
}
}
```
### Step 3: Add `enable_db_warmup` to `DatabaseArguments`
**File:** `crates/storage/db/src/implementation/mdbx/mod.rs`
Add to `DatabaseArguments`:
```rust
pub struct DatabaseArguments {
// ... existing fields ...
/// Whether to warm up the database by loading pages into memory at startup.
enable_warmup: bool,
}
```
With builder method:
```rust
impl DatabaseArguments {
pub const fn with_warmup(mut self, enable: bool) -> Self {
self.enable_warmup = enable;
self
}
}
```
### Step 4: Call warmup in `init_db`
**File:** `crates/storage/db/src/mdbx.rs`
After the database is opened in `init_db_for`, conditionally start warmup:
```rust
pub fn init_db_for<P: AsRef<Path>, TS: TableSet>(
path: P,
args: DatabaseArguments,
) -> eyre::Result<DatabaseEnv> {
let enable_warmup = args.enable_warmup;
let client_version = args.client_version().clone();
let mut db = create_db(path, args)?;
db.create_and_track_tables_for::<TS>()?;
db.record_client_version(client_version)?;
drop_orphan_tables(&db);
if enable_warmup {
if let Err(e) = db.warmup() {
reth_tracing::tracing::warn!(
target: "reth::db",
%e,
"Failed to start database warmup"
);
}
}
Ok(db)
}
```
### Step 5: Add CLI flag
Add `--db.warmup` flag to the database arguments, plumbed through the existing `DatabaseArgs` CLI struct.
**File:** `crates/node/core/src/args/database.rs` (or equivalent)
```rust
/// Enable database warmup at startup to prime the OS page cache.
#[arg(long = "db.warmup", default_value_t = false)]
pub warmup: bool,
```
## Alternative Approach: Manual File Reading
If we wanted to avoid touching the libmdbx wrapper, we could do what Nethermind does — manually read the `mdbx.dat` file:
```rust
fn warmup_file(path: PathBuf) {
std::thread::Builder::new()
.name("reth-db-warmup".to_string())
.spawn(move || {
let file_path = path.join("mdbx.dat");
let file = match std::fs::File::open(&file_path) {
Ok(f) => f,
Err(e) => {
warn!(target: "reth::db", %e, "Failed to open mdbx.dat for warmup");
return;
}
};
// Hint sequential access
#[cfg(unix)]
{
use std::os::unix::io::AsRawFd;
unsafe {
libc::posix_fadvise(file.as_raw_fd(), 0, 0, libc::POSIX_FADV_SEQUENTIAL);
}
}
let file_size = file.metadata().map(|m| m.len()).unwrap_or(0);
let mut reader = std::io::BufReader::with_capacity(1024 * 1024, file); // 1MB buffer
let mut buf = vec![0u8; 1024 * 1024];
let mut total_read: u64 = 0;
let start = std::time::Instant::now();
loop {
match std::io::Read::read(&mut reader, &mut buf) {
Ok(0) => break,
Ok(n) => {
total_read += n as u64;
if total_read % (1024 * 1024 * 1024) == 0 {
info!(
target: "reth::db",
progress = format!("{:.1}%", total_read as f64 / file_size as f64 * 100.0),
elapsed = ?start.elapsed(),
"Database warmup progress"
);
}
}
Err(e) => {
warn!(target: "reth::db", %e, "Database warmup read error");
break;
}
}
}
info!(
target: "reth::db",
elapsed = ?start.elapsed(),
size_gb = total_read / (1024 * 1024 * 1024),
"Database warmup complete"
);
})
.ok();
}
```
**However, the MDBX native approach is strongly preferred** because MDBX uses mmap — the data file is already memory-mapped, and `mdbx_env_warmup` correctly faults in only the allocated pages through the existing mapping, which is what the page cache actually needs.
## Recommendation
**Use the MDBX native `mdbx_env_warmup()` API.** It's:
- Already available in the FFI bindings (auto-generated from mdbx.h)
- Handles mmap'd files correctly (touches pages through the mapping)
- OOM-safe with `MDBX_warmup_oomsafe` flag
- Supports timeouts
- Ignores lock files and only warms the data file
- ~30 lines of new Rust code across 3 files
- Zero new dependencies
The recommended flags for production:
- `MDBX_warmup_force | MDBX_warmup_oomsafe` — synchronously touches all pages, OOM-safe
- For async prefetch only (lighter): `MDBX_warmup_default` (just calls `madvise(MADV_WILLNEED)`)
## Files to Modify
| File | Change |
|------|--------|
| `crates/storage/libmdbx-rs/src/environment.rs` | Add `warmup()` method to `Environment` |
| `crates/storage/db/src/implementation/mdbx/mod.rs` | Add `warmup()` to `DatabaseEnv`, add `enable_warmup` to `DatabaseArguments` |
| `crates/storage/db/src/mdbx.rs` | Call warmup in `init_db_for` |
| `crates/node/core/src/args/database.rs` | Add `--db.warmup` CLI flag |
## Risks
- **Very low risk.** The warmup runs in a background thread and doesn't affect DB operations.
- If warmup fails, it logs a warning and the node continues normally.
- OOM-safe flag prevents the process from being killed if memory is insufficient.
- The MDBX `mdbx_env_warmup` function is mature — used by mdbx_dump, mdbx_chk, and mdbx_copy tools.

View File

@@ -0,0 +1,187 @@
# Hot State LRU Cache Exploration
## 1. Current State Read Path
### EVM SLOAD → MDBX Read Trace
```
EVM SLOAD instruction
→ revm State<DB>::storage()
→ revm checks CacheAccount in-memory (per-block cache)
→ on miss: StateProviderDatabase<DB>::storage_ref()
→ StateProvider::storage(address, storage_key)
→ LatestStateProviderRef::storage() [latest.rs:155-167]
→ tx.cursor_dup_read::<PlainStorageState>()
→ seek_by_key_subkey(account, storage_key)
→ MDBX B-tree lookup + Compact decode
```
### Key files in the path:
- **StateProvider trait**: [`crates/storage/storage-api/src/state.rs#L34-L48`](file:///home/ubuntu/reth/crates/storage/storage-api/src/state.rs#L34-L48) — defines `fn storage(&self, account: Address, storage_key: StorageKey) -> ProviderResult<Option<StorageValue>>`
- **LatestStateProviderRef**: [`crates/storage/provider/src/providers/state/latest.rs#L155-L167`](file:///home/ubuntu/reth/crates/storage/provider/src/providers/state/latest.rs#L155-L167) — opens a dup cursor on `PlainStorageState` table, seeks by key+subkey
- **StateProviderDatabase**: [`crates/revm/src/database.rs#L105-L171`](file:///home/ubuntu/reth/crates/revm/src/database.rs#L105-L171) — bridges revm `Database` trait to reth `EvmStateProvider`
- **BasicBlockExecutor**: [`crates/evm/evm/src/execute.rs#L528-L541`](file:///home/ubuntu/reth/crates/evm/evm/src/execute.rs#L528-L541) — wraps DB in `State::builder().with_database(db).with_bundle_update().without_state_clear().build()`
### Account reads follow a similar path:
```
EVM basic() → StateProviderDatabase::basic_ref()
→ AccountReader::basic_account(address)
→ LatestStateProviderRef::basic_account() [latest.rs:40-42]
→ tx.get_by_encoded_key::<PlainAccountState>(address)
→ MDBX point lookup + Compact decode
```
## 2. Existing Caching Mechanisms
### 2a. Revm's Per-Block `State<DB>` Cache (intra-block)
The `BasicBlockExecutor` creates `State::builder().with_database(db).with_bundle_update().without_state_clear().build()`. Revm's `State` maintains a `CacheAccount` map internally — once an account/slot is read within a block execution, subsequent reads within the same block are served from memory. However, this cache is **per-block** and does not persist across blocks.
### 2b. `ExecutionCache` / `CachedStateProvider` (cross-block, engine tree)
**This is the primary existing cross-block cache.** Located in [`crates/engine/tree/src/tree/cached_state.rs`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/cached_state.rs).
Architecture:
- Uses `fixed-cache` crate (v0.1.7) — a concurrent, fixed-size hash map with O(1) epoch-based invalidation
- Three separate caches: accounts (`Address → Option<Account>`), storage (`(Address, StorageKey) → StorageValue`), bytecode (`B256 → Option<Bytecode>`)
- Default allocation: 88.88% storage, 5.56% accounts, 5.56% bytecode
- `CachedStateProvider` wraps any `StateProvider` and intercepts reads
- Cross-block persistence via `SavedCache` — tracks block hash and uses `Arc<()>` reference counting to ensure exclusive access
- `PayloadExecutionCache` (in `payload_processor/mod.rs`) stores a `SavedCache` behind `Arc<RwLock<Option<SavedCache>>>` and matches on `parent_hash` to reuse across sequential blocks
- After block execution, `insert_state(&BundleState)` updates the cache with all modified accounts/storage/code from the block
- On reorg (parent hash mismatch): clears and reuses the cache structure
**Invalidation strategy:**
- After each block: `insert_state()` overwrites entries for all modified accounts/slots
- On `SELFDESTRUCT` of a contract: clears entire account + storage caches (rare post-Dencun)
- On fork/reorg: `SavedCache` is cleared if parent hash doesn't match
- Epoch-based bulk invalidation via `fixed-cache`'s `EPOCHS` feature
### 2c. Overlay State Provider (trie computation)
[`crates/storage/provider/src/providers/state/overlay.rs`](file:///home/ubuntu/reth/crates/storage/provider/src/providers/state/overlay.rs) — uses `DashMap<BlockNumber, Overlay>` to cache trie updates and hashed post-state per block. This is for trie/proof computation, NOT for EVM state reads.
### 2d. Other caches in the codebase
- **RPC layer**: `schnellru::LruMap` for blocks/receipts/headers in [`crates/rpc/rpc-eth-types/src/cache/`](file:///home/ubuntu/reth/crates/rpc/rpc-eth-types/src/cache)
- **Networking**: `schnellru` for peer discovery, DNS, invalid headers
- **Precompile cache**: `DashMap` in [`crates/engine/tree/src/tree/precompile_cache.rs`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/precompile_cache.rs)
- **RPC bytecode**: `DashMap` code_store in `crates/storage/rpc-provider/`
## 3. Where Should a New Cache Live?
### Current situation: The cache ALREADY EXISTS
The `CachedStateProvider` + `ExecutionCache` system in `crates/engine/tree/src/tree/cached_state.rs` is already a hot state LRU cache for the engine's block execution path. It:
- Caches accounts, storage, and bytecode
- Persists across blocks via `SavedCache`
- Uses `fixed-cache` (open-addressing hash map, power-of-two sized, epoch-tagged)
- Is updated after each block with `BundleState` diffs
- Is cleared on reorg
### What's NOT cached
1. **RPC reads**: When an `eth_call` or `eth_getStorageAt` hits the latest state provider, it goes directly through `LatestStateProviderRef` → MDBX. The `CachedStateProvider` is only used in the engine tree's block execution path, not for RPC reads.
2. **Historical state reads**: `HistoricalStateProviderRef` always hits the DB + changesets.
3. **Staged sync reads**: During initial sync, the pipeline stages read directly from MDBX.
### Potential layers for an additional/shared cache
| Layer | Pros | Cons |
|-------|------|------|
| **Wrap `LatestStateProviderRef`** | Simple, catches all latest-state reads including RPC | Only helps latest state, not historical; must invalidate on every block commit |
| **Wrap `DatabaseProvider`** | Catches everything including RPC, historical state | Complex invalidation; historical reads are less cacheable |
| **Inside `ConsistentDbView`** | Natural boundary for trie computation | Too narrow scope; only used for proof generation |
| **Shared `ExecutionCache` exposed to RPC** | Reuses existing infrastructure; engine already populates it | Thread safety concerns; engine execution shouldn't be slowed by RPC reads; cache is currently tied to payload processor lifecycle |
### Recommendation
The highest-impact approach would be to **share the existing `ExecutionCache` with RPC providers**. The engine tree already populates the cache with every account/storage/code touched during block execution. If RPC `eth_call` and `eth_getStorageAt` could read from this same cache, many hot-state reads would be served from memory.
The second approach would be a **new cache layer wrapping `LatestStateProviderRef`** specifically for RPC, populated on read-through (miss → MDBX → insert).
## 4. Implementation Options
### Available libraries (already in Cargo.toml)
| Library | Version | Current usage |
|---------|---------|---------------|
| `schnellru` | 0.2 | Networking, RPC block/receipt cache, invalid headers |
| `dashmap` | 6.0 | Overlay cache, precompile cache, static file jars |
| `fixed-cache` | 0.1.7 | `ExecutionCache` (accounts, storage, bytecode) |
### Cache data structures comparison
| Structure | Thread-safe | Eviction | Best for |
|-----------|-------------|----------|----------|
| `fixed-cache` | Yes (lock-free) | Open-addressing collision | High-throughput concurrent reads/writes (engine) |
| `schnellru::LruMap` | No (needs external lock) | True LRU | Single-threaded or behind RwLock |
| `DashMap` | Yes (sharded) | None (manual) | Concurrent maps without size bounds |
| `parking_lot::RwLock<schnellru::LruMap>` | Yes | True LRU | Moderate-throughput shared cache |
### Cache key design
```rust
// Accounts
Key: Address (20 bytes)
Value: Option<Account> (nonce: u64, balance: U256, bytecode_hash: Option<B256>) 72 bytes
// Storage
Key: (Address, StorageKey) (20 + 32 = 52 bytes)
Value: StorageValue (U256, 32 bytes)
// Entry sizes (fixed-cache, 128-byte aligned):
// Account entry: 128 bytes
// Storage entry: 128 bytes
```
### Sizing estimates
| Cache | Entries | Memory |
|-------|---------|--------|
| Accounts | 10K | ~1.3 MB |
| Storage slots | 100K | ~12.8 MB |
| Bytecode | 1K | Variable (up to 24KB per contract × 1K ≈ 24 MB) |
| **Total** | — | **~14-38 MB** |
The existing `ExecutionCache` already allows configurable total size.
## 5. Risk Analysis
### Cache invalidation correctness
| Scenario | Current handling in `ExecutionCache` |
|----------|--------------------------------------|
| Normal block execution | `insert_state(&BundleState)` overwrites touched entries |
| Reorg to different fork | `SavedCache::clear()` if parent hash mismatch |
| `SELFDESTRUCT` (pre-Dencun) | Clears entire account + storage cache |
| RPC reads concurrent with execution | NOT handled — cache is engine-only today |
### Key risks of sharing the cache with RPC
1. **Stale reads during block execution**: If RPC reads from the cache while a block is mid-execution, it could see partially-updated state. Mitigation: epoch tagging in `fixed-cache` + only expose the cache *after* block commit.
2. **Cache pollution from RPC workloads**: RPC `eth_call` traces could evict hot engine entries. Mitigation: read-only access for RPC (no insertion on miss), or separate read-through cache.
3. **Memory pressure**: Adding RPC reads increases cache pressure but doesn't increase memory usage if the cache size is fixed.
4. **Consensus correctness**: The cache is advisory — every miss falls through to MDBX which is the source of truth. Cache corruption would cause performance degradation, not consensus failures, as long as reads always fall through on miss.
### Thread safety
`fixed-cache` is lock-free and designed for concurrent access. `DashMap` uses sharded locks. Both are safe for multi-threaded use. The `SavedCache` lifecycle management (via `Arc<()>` guards) ensures exclusive mutation access.
## 6. Conclusion
**Reth already has a sophisticated hot state cache** in `ExecutionCache` / `CachedStateProvider`. The gap is:
1. **RPC reads don't benefit from it** — the cache lives in the engine tree and is not exposed to the RPC provider layer
2. **The cache is tied to the `PayloadProcessor` lifecycle** — it's created/destroyed with the payload execution context
The highest-value next step would be to:
1. Make the `ExecutionCache` (or a read-only view of it) accessible to `LatestStateProviderRef` when serving RPC requests
2. Or create a second, read-through `schnellru`-based cache at the `LatestStateProviderRef` level that serves RPC reads, sized at ~10K accounts + ~100K storage slots (~15 MB)
Either approach avoids touching consensus-critical code — the cache is purely advisory, and all misses fall through to MDBX.

View File

@@ -0,0 +1,307 @@
# Nethermind Engine Pipeline Performance Optimizations — Catalog
> Research date: 2026-02-06
> Target: NethermindEth/nethermind (C#/.NET)
> Goal: Identify techniques portable to Reth (`crates/engine/tree/`)
---
## Table of Contents
1. [State Prewarming (Parallel Transaction Pre-Execution)](#1-state-prewarming)
2. [Sender-Grouped Parallel Prewarming](#2-sender-grouped-parallel-prewarming)
3. [Address & Access List Prewarming](#3-address--access-list-prewarming)
4. [Parallel Bloom Calculation](#4-parallel-bloom-calculation)
5. [GC-Aware Block Processing Scheduling](#5-gc-aware-block-processing-scheduling)
6. [Engine API Memory Compaction & Decommit](#6-engine-api-memory-compaction--decommit)
7. [Object Pooling for Transaction Processing Envs](#7-object-pooling-for-transaction-processing-envs)
8. [PreBlockCaches with Clear-on-New-Block](#8-preblockcaches-with-clear-on-new-block)
9. [In-Memory State Pruning with Unpersisted Block Budget](#9-in-memory-state-pruning-with-unpersisted-block-budget)
10. [Dirty Node Sharded Cache](#10-dirty-node-sharded-cache)
11. [DB File Warmer (OS Page Cache Priming)](#11-db-file-warmer)
12. [Adaptive DB Tuning Modes](#12-adaptive-db-tuning-modes)
13. [HalfPath State Database Optimization](#13-halfpath-state-database-optimization)
14. [Array-Based LRU Cache (Allocation-Free)](#14-array-based-lru-cache)
15. [Large Array Pool (Oversized Buffer Reuse)](#15-large-array-pool)
16. [Recovery Queue Bypass for Single Blocks](#16-recovery-queue-bypass)
17. [Main Processing Thread Priority Elevation](#17-main-processing-thread-priority-elevation)
18. [Precompile Result Caching](#18-precompile-result-caching)
19. [Compact Receipt Storage](#19-compact-receipt-storage)
20. [Parallel Block Validation Work](#20-parallel-block-validation-work)
---
## 1. State Prewarming
- **Area**: Block processing / Payload handling
- **Technique**: `PreWarmStateOnBlockProcessing` — parallel speculative transaction execution
- **Description**: Before the main block processing loop begins, Nethermind spawns a background task that speculatively executes all transactions in the incoming block against read-only copies of the world state. This populates the node-storage cache and OS page cache with the trie nodes and storage slots the main processor will need. The main execution then hits warm caches instead of cold SSD reads. Nethermind's docs claim **up to 2x speed-up** on main-loop block processing. The prewarming runs concurrently with — and is cancelled when — the main processing completes.
- **Code location**: `src/Nethermind/Nethermind.Consensus/Processing/BlockCachePreWarmer.cs``PreWarmCaches()`, `WarmupTransactions()`
- **Evidence of impact**:
- Nethermind docs: "up to 2x speed-up in the main loop block processing" ([Performance Tuning](https://docs.nethermind.io/fundamentals/performance-tuning/))
- Enabled by default (`Blocks.PreWarmStateOnBlockProcessing = true`)
- Nethermind leads GigaGas benchmarks at 697 MGas/s mean throughput ([Blog](https://www.nethermind.io/blog/getting-ethereum-ready-for-gigagas))
- **Applicability to Reth**: Reth already has `crates/engine/tree/src/tree/payload_processor/prewarm.rs`. The key difference is that Nethermind groups transactions by sender (see #2) and actually runs full `TransactionProcessor.Warmup()` calls — not just address warming. Reth's issue [#17833](https://github.com/paradigmxyz/reth/issues/17833) proposes reusing prewarming results, which Nethermind does implicitly via its pre-block caches.
## 2. Sender-Grouped Parallel Prewarming
- **Area**: Block processing / Prefetching
- **Technique**: Group-by-sender parallelism with sequential intra-sender execution
- **Description**: Transactions are grouped by `SenderAddress`. Different sender groups are warmed in parallel (via `ParallelUnbalancedWork.For`), but transactions within the same sender are executed sequentially. This ensures that balance/nonce/storage changes from tx[N] are visible to tx[N+1] for the same sender, while still maximizing parallelism across independent senders.
- **Code location**: `BlockCachePreWarmer.cs``WarmupTransactions()`, `GroupTransactionsBySender()`
- **Evidence of impact**: Without sender grouping, prewarming of same-sender transaction chains would produce incorrect state reads, leading to cache misses on the main thread. This is a correctness optimization that makes prewarming effective rather than just noisy.
- **Applicability to Reth**: Reth's prewarm.rs should consider this sender-grouping pattern. Currently transaction prewarming may not properly handle intra-sender dependencies, reducing cache hit rates on the main execution thread.
## 3. Address & Access List Prewarming
- **Area**: Block processing / Prefetching
- **Technique**: Parallel warming of sender/recipient addresses, withdrawal addresses, and EIP-2930 access lists
- **Description**: Separate from full transaction prewarming, Nethermind runs an `AddressWarmer` as a `ThreadPool` work item that warms up:
1. All sender and recipient addresses (account trie nodes)
2. Withdrawal addresses
3. System transaction access lists (beacon root, blockhash store)
4. EIP-2930 access lists from each transaction
This runs concurrently with the transaction prewarming via `ThreadPool.UnsafeQueueUserWorkItem`.
- **Code location**: `BlockCachePreWarmer.cs``AddressWarmer` class, `WarmupAddresses()`, `WarmupWithdrawals()`
- **Evidence of impact**: Address warming is a lightweight operation that ensures account-level trie nodes are in cache before the heavier transaction execution touches them.
- **Applicability to Reth**: Reth's prewarming could separate address/trie-node warming from full EVM execution warming. Address warming is cheap and has very high hit rates.
## 4. Parallel Bloom Calculation
- **Area**: Block processing
- **Technique**: `ParallelUnbalancedWork.For` for receipt bloom computation
- **Description**: After transactions are processed, Nethermind calculates bloom filters for all receipts in parallel using `ParallelUnbalancedWork.For(0, receipts.Length, ...)`. Each receipt's bloom is computed independently.
- **Code location**: `BlockProcessor.cs``CalculateBlooms()`
- **Evidence of impact**: For blocks with hundreds of transactions, this parallelizes a CPU-bound operation that was previously serial.
- **Applicability to Reth**: Reth already computes receipt roots in a background task (`receipt_root_task.rs`). Bloom computation could similarly be parallelized with rayon if not already done.
## 5. GC-Aware Block Processing Scheduling
- **Area**: Engine pipeline
- **Technique**: `GCScheduler` integration with block processing loop
- **Description**: Nethermind's `BlockchainProcessor` integrates with a `GCScheduler` that controls background garbage collection timing:
- When a block arrives, background GC timer is switched **off** (`SwitchOffBackgroundGC`)
- After processing completes and queue is empty, background GC timer is switched **on** (`SwitchOnBackgroundGC`)
This prevents GC pauses during active block processing while allowing cleanup during idle periods between blocks.
- **Code location**: `BlockchainProcessor.cs``RunProcessingLoop()` calls to `GCScheduler.Instance`
- **Evidence of impact**: .NET GC pauses can be 10-100ms. Avoiding them during the 12-second block slot is critical for attestation timing.
- **Applicability to Reth**: Rust doesn't have GC, but the principle maps to: avoid heavy background work (compaction, flushing, persistence) during active block processing. Reth's persistence state machine could be more careful about when it triggers heavy DB operations.
## 6. Engine API Memory Compaction & Decommit
- **Area**: Engine pipeline / Memory management
- **Technique**: Configurable LOH compaction and memory decommit tied to Engine API call frequency
- **Description**: Nethermind has three config knobs under `Merge.*`:
- `CollectionsPerDecommit` (default 25): After N engine API calls, request the OS to release process memory
- `CompactMemory` (default Yes): Compact the Large Object Heap (LOH) after GC
- `SweepMemory` (implicit Gen2): Controls which GC generation triggers compaction
This addresses .NET's tendency to hold onto freed memory, keeping RSS high.
- **Code location**: `Nethermind.Merge` configuration, engine API handler post-processing
- **Evidence of impact**: Without this, Nethermind nodes can show 40GB+ memory usage due to LOH fragmentation ([Issue #8020](https://github.com/NethermindEth/nethermind/issues/8020)).
- **Applicability to Reth**: Rust doesn't have LOH/GC but the equivalent concern is jemalloc arena fragmentation and mmap retention. Reth could periodically call `jemalloc_ctl::epoch` and `purge` after processing bursts.
## 7. Object Pooling for Transaction Processing Envs
- **Area**: Block processing / Memory management
- **Technique**: `ObjectPool<IReadOnlyTxProcessorSource>` for prewarming environments
- **Description**: Prewarming needs to create multiple isolated world-state views and transaction processors. Rather than allocating these per-block, Nethermind uses `ObjectPool` (`_envPool`) to reuse `IReadOnlyTxProcessorSource` instances. Each parallel warmup thread Gets from the pool, uses it, and Returns it.
- **Code location**: `BlockCachePreWarmer.cs``_envPool`, `ReadOnlyTxProcessingEnvPooledObjectPolicy`
- **Evidence of impact**: Avoids heavy per-block allocation of state provider instances. With blocks arriving every 12 seconds, this prevents significant allocation pressure.
- **Applicability to Reth**: Reth could pool `EvmState` or database cursor objects used during prewarming rather than recreating them per block.
## 8. PreBlockCaches with Clear-on-New-Block
- **Area**: Caching / Block processing
- **Technique**: Dedicated pre-block caches populated by prewarming, consumed by main processing, cleared per block
- **Description**: Nethermind maintains `PreBlockCaches` that sit between the prewarming pass and the main execution. Prewarming fills these caches (account state, storage, code, RLP-encoded trie nodes). The main processor reads from them. On a new block, `ClearCaches()` is called immediately. A separate `NodeStorageCache` captures RLP-encoded trie node reads and is also cleared and toggled per block. If caches are non-empty when a new block arrives, a warning is logged.
- **Code location**: `BlockCachePreWarmer.cs``PreWarmCaches()` entry point, `ClearCaches()`
- **Evidence of impact**: This is the mechanism that makes prewarming effective — without the cache bridge, prewarming would only help the OS page cache.
- **Applicability to Reth**: Reth's `cached_state.rs` provides similar per-block caching. The key insight is Nethermind's explicit clear-on-new-block + warning-if-stale pattern, which catches bugs in cache lifecycle management.
## 9. In-Memory State Pruning with Unpersisted Block Budget
- **Area**: Caching / Persistence
- **Technique**: `Pruning.MaxUnpersistedBlockCount` — keep up to 297 blocks of state diffs in memory before flushing
- **Description**: Nethermind keeps state diffs for recent blocks in memory (default: 297 blocks ≈ 1 hour of mainnet). State is only persisted to DB when this budget is exceeded. The `MinUnpersistedBlockCount` prevents over-eager flushing. This reduces SSD writes by batching state commits.
- **Code location**: Nethermind.Pruning configuration, state commit logic
- **Evidence of impact**: Combined with `Pruning.CacheMb` (default 1280 MB), this reduces SSD write amplification by ~3x when increased to 2000 MB.
- **Applicability to Reth**: Reth's persistence state machine flushes after configurable intervals. The insight is that keeping MORE state in memory (hundreds of blocks) with explicit budgeting dramatically reduces I/O pressure during block processing.
## 10. Dirty Node Sharded Cache
- **Area**: Caching / Block processing
- **Technique**: `Pruning.DirtyNodeShardBit` — shard the dirty trie node cache into 2^N segments
- **Description**: The dirty node cache (nodes modified but not yet persisted) is sharded by key hash into multiple segments (default 2^8 = 256 shards). This reduces lock contention when multiple threads write to the cache concurrently during block processing and prewarming.
- **Code location**: Nethermind.Pruning configuration (`DirtyNodeShardBit = 8`)
- **Evidence of impact**: Reduces lock contention on the hot path of trie modification during parallel operations.
- **Applicability to Reth**: Reth's in-memory state uses `HashMap`-based structures. Sharding these by key prefix (similar to `DashMap` or striped locks) would reduce contention in multi-threaded scenarios like parallel state root computation.
## 11. DB File Warmer
- **Area**: Caching / Startup
- **Technique**: `Db.StateDbEnableFileWarmer` — sequentially read all DB files at startup to prime OS page cache
- **Description**: On startup, Nethermind optionally reads through all state database files to force them into the OS page cache. Without this, natural cache warming can take **weeks**. This is recommended for machines with 128GB+ RAM where the entire state DB can fit in OS cache.
- **Code location**: Nethermind.Db configuration
- **Evidence of impact**: Eliminates cold-start SSD latency that can persist for weeks after restart.
- **Applicability to Reth**: Reth could implement a startup DB warming pass using `madvise(MADV_WILLNEED)` or sequential reads. This is particularly valuable for MDBX which is memory-mapped. A simple background thread doing sequential reads of the state DB would achieve this.
## 12. Adaptive DB Tuning Modes
- **Area**: Persistence / Block processing
- **Technique**: `Sync.TuneDbMode` — switch RocksDB configuration between sync/processing modes
- **Description**: Nethermind dynamically adjusts RocksDB parameters based on the current operation:
- `Default`: Balanced for general use
- `HeavyWrite`: Larger memtables, delayed compaction (used during snap sync)
- `AggressiveHeavyWrite`: Even larger write buffers
- `DisableCompaction`: No background compaction (fastest sync, higher memory)
Additional per-mode options include write buffer sizes (`StateDbWriteBufferSize`), compression settings, index types, and MMAP configuration.
- **Code location**: Nethermind.Db and Nethermind.Sync configuration
- **Evidence of impact**: Snap sync from 25 minutes with optimal tuning. Block processing benefits from non-partitioned index (`kBinarySearch`) for lower latency.
- **Applicability to Reth**: Reth uses MDBX which has different tuning knobs, but the principle of mode-switching (sync vs. follow-chain) applies. During tip-following, MDBX could be configured for lower write amplification; during initial sync, for maximum write throughput.
## 13. HalfPath State Database Optimization
- **Area**: Block processing / State access
- **Technique**: Store leaf node data separately from trie structure for O(1) leaf access
- **Description**: Instead of traversing the full Merkle Patricia Trie path for every state read, the HalfPath optimization stores leaf node data (account state, storage values) in a flat key-value mapping alongside the trie. Reads first check the flat mapping (O(1) hash lookup) and only fall back to trie traversal for cache misses. The trie is still maintained for state root computation.
- **Code location**: Nethermind experimental release, state DB layer
- **Evidence of impact**:
- **40-50% faster block processing**
- **80-100% faster state reads**
- ([Reddit announcement](https://www.reddit.com/r/ethstaker/comments/19an48s/the_nethermind_client_experimental_release_is/))
- **Applicability to Reth**: Reth already uses a similar pattern with separate account/storage tables in MDBX (not pure trie traversal for reads). The equivalent concern is ensuring the hot path for state reads bypasses trie traversal entirely, which Reth's `PlainState` tables already achieve.
## 14. Array-Based LRU Cache
- **Area**: Caching / Memory management
- **Technique**: Replace LinkedList-based LRU with array-of-structs + int indices
- **Description**: Nethermind replaced its LRU cache implementation from `Dictionary<K, LinkedListNode<V>>` + `LinkedList` to `Dictionary<K, int>` + `Node[]` where `Node` is a struct with prev/next as int indices. This eliminates object headers and pointer indirection, saving 40 bytes per entry on 64-bit systems and improving cache locality.
- **Code location**: `LruCache<TKey, TValue>` ([PR #2497](https://github.com/NethermindEth/nethermind/pull/2497))
- **Evidence of impact**: 40MB saved per million cached entries. ~20% faster cache operations due to better CPU cache locality. ([Blog](https://blog.scooletz.com/2020/11/23/improving-Nethermind-performance))
- **Applicability to Reth**: Reth's caches (e.g., in the engine tree's `cached_state.rs`) should use flat-array or `Vec`-backed LRU structures rather than `LinkedHashMap` for better cache line utilization. The `lru` crate in Rust already uses a similar array-based approach.
## 15. Large Array Pool
- **Area**: Memory management
- **Technique**: Custom `LargerArrayPool` for 1-8MB buffers
- **Description**: .NET's default `ArrayPool` discards buffers >1MB, causing allocation spikes during heavy EVM calls. Nethermind added a custom pool layer that reuses buffers in the 1-8MB range.
- **Code location**: `LargerArrayPool` ([PR #2493](https://github.com/NethermindEth/nethermind/pull/2493))
- **Evidence of impact**: Eliminated GC pressure spikes from large EVM memory allocations.
- **Applicability to Reth**: Rust's allocator doesn't have this specific issue, but the principle applies to revm's memory expansion. Reusing `SharedMemory` buffers across transaction executions within a block avoids repeated large allocations.
## 16. Recovery Queue Bypass
- **Area**: Engine pipeline
- **Technique**: Skip recovery queue when processing queue is empty
- **Description**: Nethermind's `BlockchainProcessor` has a two-stage pipeline: recovery queue (sender address recovery/signature verification) → processing queue. When `_queueCount == 1` (only one block, the current one), the block bypasses the recovery channel entirely and goes directly to the processing queue. This eliminates channel overhead and scheduling latency for the common case of tip-following.
- **Code location**: `BlockchainProcessor.cs``Enqueue()`: `if (_queueCount > 1) { _recoveryQueue... } else { _blockQueue... }`
- **Evidence of impact**: Reduces latency by one channel hop for the critical single-block-at-tip case.
- **Applicability to Reth**: Reth's engine tree processes `engine_newPayload` synchronously in the handler. The equivalent optimization is ensuring that the hot path for a single new payload doesn't go through unnecessary queuing/channel abstractions.
## 17. Main Processing Thread Priority Elevation
- **Area**: Engine pipeline
- **Technique**: `Thread.CurrentThread.SetHighestPriority()` during block processing
- **Description**: When the processing loop picks up a block, it temporarily elevates the thread priority to highest. This ensures the block processing thread isn't preempted by less critical work (RPC, networking, etc.) during the critical 12-second window.
- **Code location**: `BlockchainProcessor.cs``RunProcessingLoop()`: `using var handle = Thread.CurrentThread.SetHighestPriority();`
- **Evidence of impact**: Prevents scheduling jitter from impacting attestation timing.
- **Applicability to Reth**: Reth could use `libc::sched_setscheduler` or `nice` to elevate the block processing thread's priority. This is particularly relevant on busy nodes running RPC alongside consensus.
## 18. Precompile Result Caching
- **Area**: Block processing / EVM
- **Technique**: Cache results of precompile calls within a block
- **Description**: When identical precompile inputs appear repeatedly within a block (common with ModExp, ecrecover), the first result is cached and reused. This was discovered during gas benchmarking when repetitive precompile calls dominated worst-case blocks.
- **Code location**: EVM precompile execution layer (gas benchmarks context)
- **Evidence of impact**: Significant improvement on repetition-heavy blocks. Nethermind's gas benchmarking blog notes this was later excluded from official benchmarks to ensure fairness, but it remains a production optimization.
- **Applicability to Reth**: Reth already has `precompile_cache.rs` in the engine tree. The key is ensuring the cache key includes the full input and that it's cleared per-block. Reth's implementation appears to already follow this pattern.
## 19. Compact Receipt Storage
- **Area**: Persistence
- **Technique**: `Receipt.CompactReceiptStore` and `Receipt.CompactTxIndex`
- **Description**: Nethermind offers compact receipt encoding that trades RPC query performance for reduced database size. The compact format omits redundant fields that can be reconstructed from the block.
- **Code location**: Nethermind.Receipt configuration
- **Evidence of impact**: Significant reduction in receipt DB size, which reduces I/O pressure during block processing when receipts are stored.
- **Applicability to Reth**: Reth already uses compact encoding for receipts. The insight is that receipt storage should be configurable — validators who don't serve RPC can use minimal encoding.
## 20. Parallel Block Validation Work
- **Area**: Block processing
- **Technique**: Parallelize independent validation steps
- **Description**: Nethermind marks the main processing thread via `IsMainProcessingThread` (AsyncLocal) and only performs certain work (like `GetAccountChanges()`) on the main thread. This implies that validation work that doesn't need main-thread state can be offloaded.
- **Code location**: `BlockProcessor.cs``if (BlockchainProcessor.IsMainProcessingThread) { block.AccountChanges = ... }`
- **Evidence of impact**: Reduces work on the critical path by deferring non-essential operations.
- **Applicability to Reth**: Reth's engine tree already parallelizes state root computation (`sparse_trie.rs`, `multiproof.rs`). The pattern of explicitly tracking "is this the critical path?" helps identify work that can be deferred.
---
## Top 5 Most Impactful Techniques (Ranked by Expected Performance Gain for Reth)
### 1. 🥇 State Prewarming with Sender-Grouped Parallelism (#1, #2, #3)
**Expected gain: 30-50% block processing latency reduction**
Nethermind's prewarming is their single biggest performance differentiator, delivering up to 2x speedup. The key insights Reth can adopt:
- **Group by sender**: Execute same-sender transactions sequentially, different senders in parallel
- **Reuse results**: Reth issue [#17833](https://github.com/paradigmxyz/reth/issues/17833) already identifies this — track reads during prewarming and feed them into the main execution as a warm cache
- **Separate address warming**: Run a lightweight address/trie-node warming pass concurrently with heavier EVM prewarming
- Maps to: `crates/engine/tree/src/tree/payload_processor/prewarm.rs`
### 2. 🥈 DB File Warming & Adaptive Tuning (#11, #12)
**Expected gain: 20-40% reduction in cold-start and steady-state I/O latency**
Nethermind's file warmer eliminates weeks-long cache warmup periods. For Reth:
- Implement `madvise(MADV_WILLNEED)` scan of MDBX state tables at startup
- Consider mode-switching MDBX parameters between sync and tip-following
- Maps to: MDBX configuration in `crates/storage/`
### 3. 🥉 In-Memory State Budget with Deferred Persistence (#9, #10)
**Expected gain: 15-30% reduction in SSD write pressure during block processing**
Keeping 297 blocks of state in memory before flushing, combined with sharded dirty-node caches:
- Increase the in-memory state budget to absorb write bursts
- Shard concurrent state caches (DashMap or striped-lock patterns)
- Maps to: `crates/engine/tree/src/tree/persistence_state.rs`, `cached_state.rs`
### 4. 🏅 Processing Thread Priority & GC/Background Work Scheduling (#5, #17)
**Expected gain: 5-15% reduction in tail latency / missed attestations**
Ensuring block processing isn't interrupted:
- Elevate thread priority during `engine_newPayload` processing
- Defer heavy persistence/compaction to idle periods between blocks
- Maps to: `crates/engine/tree/src/tree/mod.rs` engine loop
### 5. 🏅 PreBlockCaches Bridge (Prewarming → Main Execution) (#8, #7)
**Expected gain: 10-20% effective cache hit rate improvement**
The explicit cache bridge between prewarming and main execution, with pool-based environment reuse:
- Pool revm `EvmState` / database cursor objects
- Use a dedicated per-block cache that prewarming populates and main execution consumes
- Clear-and-warn pattern catches lifecycle bugs
- Maps to: `crates/engine/tree/src/tree/cached_state.rs`
---
## Sources
| Source | URL |
|--------|-----|
| Nethermind Performance Tuning Docs | https://docs.nethermind.io/fundamentals/performance-tuning/ |
| Nethermind Configuration Reference | https://docs.nethermind.io/fundamentals/configuration/ |
| GigaGas Benchmark Blog | https://www.nethermind.io/blog/getting-ethereum-ready-for-gigagas |
| Gas Benchmarking Framework Blog | https://www.nethermind.io/blog/measuring-ethereums-execution-limits-the-gas-benchmarking-framework |
| BlockProcessor.cs | https://github.com/NethermindEth/nethermind/blob/master/src/Nethermind/Nethermind.Consensus/Processing/BlockProcessor.cs |
| BlockCachePreWarmer.cs | https://github.com/NethermindEth/nethermind/blob/master/src/Nethermind/Nethermind.Consensus/Processing/BlockCachePreWarmer.cs |
| BlockchainProcessor.cs | https://github.com/NethermindEth/nethermind/blob/master/src/Nethermind/Nethermind.Consensus/Processing/BlockchainProcessor.cs |
| Improving Nethermind Performance (Scooletz) | https://blog.scooletz.com/2020/11/23/improving-Nethermind-performance |
| LRU Cache PR #2497 | https://github.com/NethermindEth/nethermind/pull/2497 |
| LargeArrayPool PR #2493 | https://github.com/NethermindEth/nethermind/pull/2493 |
| HalfPath Announcement | https://www.reddit.com/r/ethstaker/comments/19an48s/ |
| Halfpath Blog | https://medium.com/nethermind-eth/nethermind-client-3-experimental-approaches-to-state-database-change-8498e3d89771 |
| Reth Prewarming Reuse Issue | https://github.com/paradigmxyz/reth/issues/17833 |
| Nethermind Memory Config Issue | https://github.com/NethermindEth/nethermind/issues/8433 |
| Execution Payloads Benchmarks Repo | https://github.com/NethermindEth/execution-payloads-benchmarks |

View File

@@ -0,0 +1,326 @@
# Nethermind State / Trie / DB Performance Optimizations - Catalog for Reth
**Date**: 2026-02-06
**Source codebase**: [NethermindEth/nethermind](https://github.com/NethermindEth/nethermind) (C#/.NET)
**Target codebase**: [paradigmxyz/reth](https://github.com/paradigmxyz/reth) (Rust)
**Context**: Nethermind consistently leads execution throughput benchmarks (697 MGas/s on real mainnet blocks per the [GigaGas benchmark](https://www.nethermind.io/blog/getting-ethereum-ready-for-gigagas), Dec 2025). Reth showed 2x performance degradation on merged 100-block payloads. This document catalogs the techniques behind Nethermind's lead.
---
## 1. State Prewarming (Parallel Speculative Execution)
- **Area**: state / caching
- **Technique**: `PreWarmStateOnBlockProcessing`
- **Description**: When a new block arrives, Nethermind speculatively executes all transactions in the block **in parallel** on background threads. This doesn't produce valid state (conflicts are discarded), but populates four concurrent caches (`PreBlockCaches`):
- `StorageCache`: `ConcurrentDictionary<StorageCell, byte[]>` — pre-reads storage slots
- `StateCache`: `ConcurrentDictionary<AddressAsKey, Account>` — pre-reads account data
- `RlpCache`: `ConcurrentDictionary<NodeKey, byte[]?>` — pre-reads trie node RLP from disk
- `PrecompileCache`: `ConcurrentDictionary<PrecompileCacheKey, Result<byte[]>>` — caches precompile results
The main sequential block processing then reads from these warm caches instead of hitting disk. Documented as providing "up to 2x speed-up in the main loop block processing."
- **Code location**: `Nethermind.State/PreBlockCaches.cs`, `Nethermind.Trie/PreCachedTrieStore.cs`
- **Evidence of impact**: Official docs: "can lead to an up to 2x speed-up." Enabled by default (`Blocks.PreWarmStateOnBlockProcessing = True`).
- **Applicability to Reth**: Reth already has parallel state root computation but does not pre-warm state *before* sequential EVM execution. The key insight is to use access lists / speculative execution to populate a `ConcurrentHashMap<Address, Account>` and `ConcurrentHashMap<StorageSlot, Value>` before the main execution loop runs. This would hide SSD latency for SLOAD/account reads. Most impactful for validators where block processing time = attestation delay.
---
## 2. HalfPath Key Scheme (Path-Prefixed DB Keys)
- **Area**: DB / trie
- **Technique**: HalfPath node storage
- **Description**: Instead of using only the node's Keccak hash as the database key (32 bytes), Nethermind prefixes keys with the trie path. The key structure is:
**State nodes** (42 bytes):
```
[section_byte | 8 bytes from path | path_length_byte | 32 byte hash]
```
**Storage nodes** (74 bytes):
```
[section_byte | 32 byte address | 8 bytes from path | path_length_byte | 32 byte hash]
```
Key innovations:
1. **Sorted by trie locality**: Nodes near each other in the trie are near each other on disk, dramatically improving RocksDB block cache hit rate
2. **Section separation**: Top-level state nodes (path ≤ 5) get section byte `0`, deeper state gets `1`, storage gets `2`. This isolates hot upper nodes from cold leaves
3. **Unique keys per path**: Because path is part of the key, different canonical versions of the same node at the same path don't share keys. This enables real-time pruning of old nodes (no longer sharing)
4. **Readahead-friendly**: Sequential trie traversals benefit from OS/RocksDB readahead since keys are ordered by path
- **Code location**: `Nethermind.Trie/NodeStorage.cs` — `GetHalfPathNodeStoragePathSpan()`
- **Evidence of impact**: [PR #6331](https://github.com/NethermindEth/nethermind/pull/6331) — "speeds up block processing by almost 50%", "database more compressible, shrinking size by about 25%", "database size only grew by 1.28% vs usual 14.62% in 12 days"
- **Applicability to Reth**: Reth's `crates/trie/` uses hash-based keys in MDBX. Implementing path-prefixed keys would:
- Improve MDBX page cache locality during trie traversal
- Enable the readahead hint optimization (sequential reads along paths)
- Reduce database growth by enabling real-time deletion of superseded nodes
- This is the single highest-impact technique in this catalog
---
## 3. Sharded Dirty Node Cache with Memory Budget
- **Area**: trie / caching
- **Technique**: Sharded in-memory dirty cache with memory-budgeted eviction
- **Description**: Nethermind maintains an in-memory cache of all recently modified ("dirty") trie nodes. Key features:
- **256 shards** (`_shardedDirtyNodeCount = 256`) to reduce lock contention
- Each shard is a `ConcurrentDictionary` storing `NodeRecord(TrieNode node, long lastCommit)`
- Configurable memory budget (default 1 GB, recommended 2-4 GB): `Pruning.CacheMb`
- Dirty nodes are kept until finalized, then batch-persisted to disk
- **Commit buffer**: When memory pruning is active, new commits go to a buffer to avoid blocking
- Memory tracking per-node: `node.GetMemorySize(false) + KeyMemoryUsage`
- Persisted-node pruning: runs incrementally (`PrunePersistedNodePortion = 0.05`) to avoid spikes
The cache absorbs 1+ hours of blocks (default `MaxUnpersistedBlockCount = 297`), reducing DB writes by 3x with 2 GB cache.
- **Code location**: `Nethermind.Trie/Pruning/TrieStore.cs`, `Nethermind.Trie/Pruning/TrieStoreDirtyNodesCache.cs`
- **Evidence of impact**: Docs: "reducing total SSD writes by roughly a factor of 3" with 2 GB cache. Metrics: `nethermind_state_skipped_writes` tracks avoided writes.
- **Applicability to Reth**: Reth's trie operations go through `crates/trie/`. A sharded write cache that batches dirty nodes and persists only at finalization boundaries would significantly reduce MDBX write amplification. The 256-shard approach maps well to Rust's `DashMap` or an array of `RwLock<HashMap>`.
---
## 4. Parallel Trie Commit (Concurrent Hashing)
- **Area**: trie
- **Technique**: Multi-level parallel trie commit
- **Description**: During `PatriciaTree.Commit()`, Nethermind decides how many levels of the trie to parallelize based on the number of dirty writes:
```
> 4 * 16 * 16 => parallelize at 3 top levels (up to 4096 tasks)
> 4 * 16 => parallelize at 2 top levels (up to 256 tasks)
> 4 => parallelize at 1 top level (up to 16 tasks)
<= 4 => sequential (no parallelism)
```
Each branch child at the parallelization boundary is committed in a separate `Task.Run()` with `Task.Yield()` for async scheduling. A concurrency quota mechanism prevents over-subscription.
- **Code location**: `Nethermind.Trie/PatriciaTree.cs` — `Commit()`, `CreateTaskForPath()`
- **Evidence of impact**: Key contributor to the 697 MGas/s throughput. The adaptive thresholds avoid parallelism overhead for small updates.
- **Applicability to Reth**: Reth already has parallel state root computation in `crates/trie/trie-parallel/`. However, Nethermind's adaptive threshold approach (based on actual dirty write count rather than fixed configuration) is worth adopting. Additionally, Nethermind parallelizes the *commit* (encoding + hashing), not just the root computation.
---
## 5. Parallel Trie Persistence
- **Area**: trie / DB
- **Technique**: Parallel `PersistNodeStartingFrom` with batched write disposal
- **Description**: When persisting a finalized block's trie changes to disk, Nethermind identifies "parallel start nodes" (subtree roots) and persists each subtree in parallel via `Task.Run()`. Write batches are disposed through a bounded channel with dedicated disposal tasks, preventing write batch disposal from becoming a bottleneck.
- **Code location**: `Nethermind.Trie/Pruning/TrieStore.cs` — `ParallelPersistBlockCommitSet()`
- **Evidence of impact**: `Metrics.SnapshotPersistenceTime` tracks this. Parallelism is critical for large state updates.
- **Applicability to Reth**: Reth writes trie nodes sequentially during state root computation. Parallelizing the DB writes for independent subtrees (using MDBX's multi-cursor support or batched writes) would reduce persistence latency.
---
## 6. DB File Warmer (OS Cache Priming)
- **Area**: DB
- **Technique**: `StateDbEnableFileWarmer`
- **Description**: On startup, Nethermind reads through all SST files of the state database sequentially, populating the OS page cache. Without this, the OS cache can take "several weeks to warm up naturally." This is a simple but highly effective optimization for nodes with 128+ GB RAM.
- **Code location**: `Nethermind.Db.Rocks/` — RocksDB configuration
- **Evidence of impact**: Recommended for 128+ GB systems. Eliminates weeks of cold-start performance degradation.
- **Applicability to Reth**: Trivially implementable. At startup, spawn a background thread that reads all MDBX data files sequentially with `posix_fadvise(SEQUENTIAL)`. This would warm the page cache in minutes instead of weeks.
---
## 7. Adaptive RocksDB Tuning (TuneDbMode)
- **Area**: DB
- **Technique**: Dynamic RocksDB configuration switching
- **Description**: Nethermind dynamically adjusts RocksDB settings based on the current workload phase:
- **HeavyWrite** (default during sync): Larger write buffers, more memtables, reduced compaction frequency
- **AggressiveHeavyWrite**: Even more aggressive write buffering for slow SSDs
- **DisableCompaction**: Completely disables compaction during snap sync, doing one big compaction at the end (~10 min pause)
- **Normal**: Standard settings for steady-state block processing
Additional per-memory-tier tuning:
- 32 GB: Switch from partitioned index to `kBinarySearch` (saves lookup latency, costs ~500 MB RAM)
- 128 GB: Enable file warmer
- 350 GB: Disable compression + enable MMAP reads (skip RocksDB block cache entirely)
- **Code location**: `Nethermind.Db.Rocks/`, `Sync.TuneDbMode` configuration
- **Evidence of impact**: DisableCompaction provides "lowest total writes possible" during sync. Compression removal at 350 GB provides "more CPU-efficient encoding."
- **Applicability to Reth**: Reth uses MDBX, not RocksDB, but the principle of workload-adaptive configuration applies. During initial sync, MDBX's `NOSYNC` + larger map sizes could be used, switching to normal settings for steady state.
---
## 8. Struct-Based LRU Cache (Zero-Allocation Cache)
- **Area**: caching
- **Technique**: Array-of-structs LRU replacing LinkedList
- **Description**: Nethermind replaced the standard `Dictionary + LinkedList<LruCacheItem>` LRU cache with an array-of-structs design:
- `LinkedListNode` + `LruCacheItem` = 48 bytes overhead per entry (64-bit)
- Array struct with `int prev/next` = 8 bytes overhead per entry
- Savings: **40 bytes per entry** → **40 MB saved for 1M entries**
- Also ~20% faster due to reduced pointer chasing (better cache locality)
- **Code location**: `Nethermind.Core/Caching/` — LRU cache implementations
- **Evidence of impact**: [PR #2497](https://github.com/NethermindEth/nethermind/pull/2497) — saves 40MB per million entries, 20% response time improvement
- **Applicability to Reth**: Reth uses various caches (e.g., in `crates/trie/`). Rust's struct layout is already compact, but reviewing cache implementations for unnecessary heap allocations and ensuring struct-of-arrays / arena patterns are used where possible would yield similar gains.
---
## 9. Trie Write Deduplication (Skip Unchanged Writes)
- **Area**: trie
- **Technique**: Net-new write tracking
- **Description**: Nethermind tracks and skips trie node writes that would write unchanged data. Metrics `nethermind_state_skipped_writes` and `nethermind_storage_skipped_writes` count how many writes were avoided. When a trie node's content hasn't changed (same RLP), the write is skipped entirely.
- **Code location**: Trie commit logic in `TrieStore.cs`
- **Evidence of impact**: Visible in Nethermind metrics. For blocks with many SLOADs but few SSTOREs, this avoids significant wasted I/O.
- **Applicability to Reth**: Reth could compare trie node RLP before writing to MDBX. A simple hash comparison or byte comparison at the leaf level would avoid unnecessary DB writes, especially for storage tries where contracts are read but not modified.
---
## 10. Paprika: Custom Page-Based Storage Engine
- **Area**: DB (experimental/future)
- **Technique**: Paprika custom database
- **Description**: A ground-up replacement for RocksDB with Ethereum-specific design:
1. **4 KB page-based storage** with memory-mapped files (inspired by PostgreSQL/LMDB)
2. **Copy-on-Write concurrency**: Lock-free readers, single writer
3. **Path-based access**: Direct account/storage lookup by address, no trie traversal for reads
4. **Deferred Merkleization**: State root computed as a separate phase after all state changes, not inline
5. **Finality-aware flushing**: Non-finalized blocks kept in memory (unmanaged memory to avoid GC pressure), flushed to disk only on finality
6. **SlottedArray pages**: PostgreSQL-style slot arrays for variable-length key-value storage within fixed-size pages
7. **Combined Keccak+RLP**: Custom implementations that compute Keccak over RLP without intermediate allocations
8. **Page reuse via abandoned pages**: COW'd pages tracked and recycled after reorg depth threshold
- **Code location**: [NethermindEth/Paprika](https://github.com/NethermindEth/Paprika) — `docs/design.md`
- **Evidence of impact**: Blog: "cut down on the time the EVM spends on data retrieval." Still experimental but represents Nethermind's R&D direction.
- **Applicability to Reth**: Several Paprika ideas are directly applicable:
- **Deferred Merkleization**: Compute state root only when needed (end of block), not during execution. Reth already does this to some extent.
- **Finality-aware caching**: Keep N blocks of state diffs in memory, only flush to MDBX at finality (Reth could use this via its `ExExs` or engine pipeline)
- **Combined Keccak+RLP**: Avoid allocating intermediate RLP buffers — compute Keccak directly from RLP encoding streaming
---
## 11. Path-Based Storage (Flat State Access)
- **Area**: DB / state
- **Technique**: Path-based storage for direct leaf access
- **Description**: Uses the path to the node (i.e., the account address or storage key) as the database key instead of the node hash. Benefits:
- **O(1) leaf access**: No need to traverse the trie from root to leaf for reads
- **Natural pruning**: Old values at the same path are overwritten, no pruning needed
- **Snap serving**: Path-based layout enables efficient range queries for snap sync serving
- **Code location**: [PR #6499](https://github.com/NethermindEth/nethermind/pull/6499) (in development)
- **Evidence of impact**: Blog: "Shorter data access time as reading leaf nodes of MPT does not require traversing the trie."
- **Applicability to Reth**: This is conceptually similar to Reth's existing "flat" account/storage tables (`PlainAccountState`, `PlainStorageState`). Reth already has flat tables for EVM execution. The Nethermind approach confirms that this pattern is correct and could be extended (e.g., using flat tables as the primary state access path, with trie nodes only for state root computation).
---
## 12. CachedTrieStore (Read-Ahead Cache for Traversals)
- **Area**: trie / caching
- **Technique**: Per-traversal trie node cache
- **Description**: For read-only trie traversals (e.g., during proofs or full DB scans), Nethermind wraps the trie store in a `CachedTrieStore` that caches resolved nodes in a `ConcurrentDictionary<(TreePath, Hash256), TrieNode>`. This prevents redundant lookups when multiple operations traverse the same trie regions. For HalfPath scheme, `ReadFlags.HintReadAhead` is used since nodes are ordered.
- **Code location**: `Nethermind.Trie/CachedTrieStore.cs`
- **Evidence of impact**: Essential for operations like `eth_getProof` that traverse multiple storage paths sharing common ancestors.
- **Applicability to Reth**: When computing proofs or doing multi-key lookups, caching resolved trie nodes in a per-operation hash map would avoid redundant MDBX reads.
---
## 13. ReadAhead Hints with Key-Scheme Awareness
- **Area**: DB
- **Technique**: ReadFlag-based readahead separation
- **Description**: When performing full DB scans (pruning, snap serving), Nethermind sets read flags based on the key scheme:
- **HalfPath**: Sets `HintReadAhead` (nodes are path-ordered, sequential read benefits from prefetching). Further separated into `HintReadAhead2` (deep state) and `HintReadAhead3` (storage) for different RocksDB column families or iterators
- **Hash-based**: Sets `HintCacheMiss` (nodes are randomly ordered, don't pollute block cache with scan data)
- **Code location**: `Nethermind.Trie/PatriciaTree.cs` — `Accept()` method, `Nethermind.Trie/NodeStorage.cs`
- **Evidence of impact**: Prevents block cache pollution during full pruning, maintains read performance for block processing
- **Applicability to Reth**: MDBX supports `NORDAHEAD` flag and cursor operations. Reth should use `madvise(MADV_SEQUENTIAL)` for trie scans and avoid polluting the page cache during pruning operations.
---
## 14. TrackPastKeys (Incremental Pruning)
- **Area**: trie / pruning
- **Technique**: Real-time obsolete key tracking
- **Description**: With `Pruning.TrackPastKeys = true` (default), Nethermind tracks which trie node keys become obsolete when new nodes replace them at the same path. Combined with HalfPath (where paths uniquely identify nodes), this enables:
- Real-time deletion of old nodes as new ones are written
- ~90% pruning effectiveness without full pruning
- Database growth of only 1.28% over 12 days vs 14.62% without
Operates via `_persistedHashes` (per-shard `ConcurrentDictionary<HashAndTinyPath, Hash256?>`) that tracks the last persisted hash at each path. When a new hash is persisted at the same path, the old one can be deleted.
- **Code location**: `Nethermind.Trie/Pruning/TrieStore.cs` — `PersistedNodeRecorder()`, `CanDelete()`
- **Evidence of impact**: [PR #6331](https://github.com/NethermindEth/nethermind/pull/6331) — 90% reduction in database growth
- **Applicability to Reth**: Reth's pruning operates via pipeline stages. Implementing per-path tracking of superseded nodes would reduce the need for full trie traversal pruning. This is especially important as Ethereum state grows.
---
## 15. NoResizeClear for ConcurrentDictionary Reuse
- **Area**: caching / memory
- **Technique**: Clear-without-resize pattern
- **Description**: Nethermind's `PreBlockCaches` and `NodeStorageCache` use `NoResizeClear()` — a custom extension that clears the dictionary contents without deallocating the underlying hash table buckets. This avoids repeated allocation/deallocation cycles for per-block caches that are reused every 12 seconds.
- **Code location**: `Nethermind.Core.Collections/CollectionExtensions.cs`, `Nethermind.State/PreBlockCaches.cs`
- **Evidence of impact**: Reduces GC pressure in .NET (analogous to avoiding reallocation in Rust)
- **Applicability to Reth**: Rust equivalent: reuse `HashMap` by calling `.clear()` (which already preserves capacity) or use pre-allocated `Vec`-based maps. Ensure per-block caches in Reth are reused rather than dropped and recreated.
---
## 16. Precompile Result Caching
- **Area**: state / EVM
- **Technique**: Cross-transaction precompile cache
- **Description**: `PreBlockCaches` includes a `PrecompileCache` that caches precompile results (`ConcurrentDictionary<PrecompileCacheKey, Result<byte[]>>`). The key is `(Address, data_hash)`. If the same precompile is called with identical inputs within the same block (common for ecrecover, bn256 pairing), the result is reused.
- **Code location**: `Nethermind.State/PreBlockCaches.cs`
- **Evidence of impact**: [GigaGas benchmark article](https://www.nethermind.io/blog/getting-ethereum-ready-for-gigagas): "precompile caching... eliminated redundant computation" on repetition-heavy blocks
- **Applicability to Reth**: Reth's EVM execution in `crates/evm/` could add a per-block precompile cache. For blocks with many ecrecover calls to the same signature or repeated BN256 pairings, this avoids expensive crypto operations.
---
## 17. Large Array Pool (LargerArrayPool)
- **Area**: memory
- **Technique**: Custom pooling for 1-8 MB buffers
- **Description**: .NET's default `ArrayPool<byte>.Shared` only pools buffers up to 1 MB. Larger allocations are allocated and discarded, causing GC pressure. Nethermind added a `LargerArrayPool` that pools 1-8 MB buffers (sized based on CPU count to correlate with expected parallelism).
- **Code location**: Various pooling utilities
- **Evidence of impact**: [PR #2493](https://github.com/NethermindEth/nethermind/pull/2493) — eliminated GC spikes from large EVM memory allocations
- **Applicability to Reth**: Reth should use arena allocators or `Vec` pools for large temporary buffers in the EVM (e.g., contract memory expansion beyond 1 MB). Rust's allocator story is different from .NET, but the principle applies: pool large buffers rather than allocating/deallocating.
---
## 18. Capped Array Pool for Trie Operations (TrackingCappedArrayPool)
- **Area**: trie / memory
- **Technique**: Size-capped buffer pool for RLP encoding
- **Description**: `TrackingCappedArrayPool` provides pooled byte arrays for trie node RLP encoding/decoding operations. It caps the maximum size to prevent unbounded memory growth and tracks usage for diagnostics.
- **Code location**: `Nethermind.Trie/TrackingCappedArrayPool.cs`
- **Evidence of impact**: Reduces allocation pressure during trie commit operations
- **Applicability to Reth**: Reth's RLP encoding in trie operations should use buffer pools. This is especially important during parallel state root computation where many threads encode trie nodes simultaneously.
---
## Top 5 Most Impactful Techniques (Ranked by Expected Gain for Reth)
### 1. **State Prewarming** (Optimization #1)
**Expected gain**: 30-100% block processing speedup for validator workloads
**Why**: This is Nethermind's single biggest advantage. Reth processes blocks sequentially, hitting SSD for each SLOAD. Prewarming hides all SSD latency by populating caches in parallel before execution. For 12-second slots with ~200 transactions, this is transformative.
**Implementation effort**: Medium-high. Requires running EVM speculatively on background threads, collecting read sets into concurrent caches, and wrapping the state provider to check caches first.
### 2. **HalfPath Key Scheme** (Optimization #2)
**Expected gain**: 30-50% block processing speedup + 25% DB size reduction
**Why**: The most impactful structural change. By ordering DB keys by trie path, every MDBX page cache hit rate improves. This compounds: better cache → fewer I/O → faster blocks → less memory pressure. Additionally enables real-time pruning.
**Implementation effort**: High. Requires changes to `crates/trie/` key encoding and a DB migration path.
### 3. **Sharded Dirty Node Cache** (Optimization #3)
**Expected gain**: 3x reduction in DB writes, smoother block processing
**Why**: Absorbing ~1 hour of trie mutations in memory and batch-persisting at finality boundaries dramatically reduces write amplification. The 256-shard design eliminates lock contention. Combined with memory budget awareness, this prevents OOM while maximizing cache effectiveness.
**Implementation effort**: Medium. Reth could implement this as a layer between trie computation and MDBX writes.
### 4. **DB File Warmer** (Optimization #6)
**Expected gain**: Eliminates cold-start problem (weeks of suboptimal perf → minutes)
**Why**: Trivial to implement, enormous bang-for-buck. A single background thread reading MDBX data files sequentially at startup populates the OS page cache. Essential for any node restart.
**Implementation effort**: Low. ~50 lines of Rust code.
### 5. **Precompile Result Caching** (Optimization #16)
**Expected gain**: Variable, up to 10-50% for ecrecover-heavy / BN256-heavy blocks
**Why**: Many real-world blocks contain repeated precompile calls (e.g., DEX routers calling ecrecover for multiple signatures). Caching is cheap and the payoff per avoided computation is high (ecrecover ~3000 gas, BN256 pairing ~45000+ gas).
**Implementation effort**: Low. A per-block `HashMap<(Address, Hash), Vec<u8>>` in the EVM executor.
---
## Additional References
- [Nethermind: 3 Experimental Approaches to State Database Change](https://www.nethermind.io/blog/nethermind-client-3-experimental-approaches-to-state-database-change) (Jan 2024)
- [Nethermind Performance Tuning Docs](https://docs.nethermind.io/fundamentals/performance-tuning/)
- [Getting Ethereum Ready for GigaGas](https://www.nethermind.io/blog/getting-ethereum-ready-for-gigagas) (Dec 2025)
- [Measuring Ethereum's Execution Limits](https://www.nethermind.io/blog/measuring-ethereums-execution-limits-the-gas-benchmarking-framework) (Nov 2025)
- [Improving Nethermind Performance (LRU Cache, ArrayPool)](https://blog.scooletz.com/2020/11/23/improving-Nethermind-performance) (2020)
- [Paprika Design Document](https://github.com/NethermindEth/Paprika/blob/main/docs/design.md)
- HalfPath PR: [#6331](https://github.com/NethermindEth/nethermind/pull/6331)
- Path-Based Storage PR: [#6499](https://github.com/NethermindEth/nethermind/pull/6499)

399
findings_reth_db_storage.md Normal file
View File

@@ -0,0 +1,399 @@
# Reth Storage & Database Performance Analysis
**Date:** 2026-02-06
**Scope:** `crates/storage/` — MDBX database, codecs, providers, static files
**Goal:** Identify bottlenecks and optimization opportunities for Nethermind-style performance gains
---
## 1. Architecture Overview
```
┌────────────────────────────────────────────────────────────────┐
│ Engine (newPayload / FCU) │
└───────────────────────────┬────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ ConsistentProvider / BlockchainProvider │
│ (snapshot of in-memory + disk state at creation time) │
│ crates/storage/provider/src/providers/consistent.rs │
└───────────┬──────────────────┬────────────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌─────────────────┐ ┌──────────────────────┐
│ In-Memory State │ │ DatabaseProvider│ │ StaticFileProvider │
│ (CanonicalIMSt) │ │ (MDBX via DbTx)│ │ (NippyJar / mmap) │
│ chain_state │ │ provider.rs │ │ manager.rs │
└──────────────────┘ └────────┬────────┘ └──────────┬───────────┘
│ │
┌────────┴────────┐ ┌────────┴───────────┐
│ DatabaseEnv │ │ NippyJar (zstd) │
│ (MDBX wrapper) │ │ Static segments: │
│ mdbx/mod.rs │ │ Headers, Txs, │
│ │ │ Receipts, Senders │
│ Tx<RO> / Tx<RW> │ │ AccountChangeSets │
│ Cursor<K, T> │ │ StorageChangeSets │
└────────┬────────┘ └────────────────────┘
┌────────┴────────┐
│ libmdbx (C) │
│ B+ tree engine │
│ mmap'd file │
└─────────────────┘
```
### Storage backends:
- **MDBX** — Primary key-value store for mutable state (accounts, storage, trie, changesets, indices)
- **Static Files (NippyJar)** — Immutable, append-only compressed segments for historical data
- **RocksDB** — Optional secondary store for specific tables (history indices, tx hash lookups)
- **In-memory overlay** — `CanonicalInMemoryState` holds recent blocks not yet persisted
### Key tables (defined in `crates/storage/db-api/src/tables/mod.rs`):
| Table | Key | Value | Type | Purpose |
|-------|-----|-------|------|---------|
| `PlainAccountState` | `Address` | `Account` | Table | Current account state |
| `PlainStorageState` | `Address` | `StorageEntry` (SubKey: `B256`) | DupSort | Current storage slots |
| `HashedAccounts` | `B256` | `Account` | Table | Hashed accounts for trie |
| `HashedStorages` | `B256` | `StorageEntry` (SubKey: `B256`) | DupSort | Hashed storage for trie |
| `AccountChangeSets` | `BlockNumber` | `AccountBeforeTx` (SubKey: `Address`) | DupSort | Account history |
| `StorageChangeSets` | `BlockNumberAddress` | `StorageEntry` (SubKey: `B256`) | DupSort | Storage history |
| `AccountsHistory` | `ShardedKey<Address>` | `BlockNumberList` | Table | Sharded account history index |
| `StoragesHistory` | `StorageShardedKey` | `BlockNumberList` | Table | Sharded storage history index |
| `AccountsTrie` | `StoredNibbles` | `BranchNodeCompact` | Table | Merkle trie nodes |
| `StoragesTrie` | `B256` | `StorageTrieEntry` (SubKey: `StoredNibblesSubKey`) | DupSort | Storage trie nodes |
---
## 2. Transaction Pattern Analysis
### 2.1 Transaction Creation
- **File:** `crates/storage/db/src/implementation/mdbx/mod.rs` L241-261
- `tx()``begin_ro_txn()` — creates read-only MDBX snapshot
- `tx_mut()``begin_rw_txn()` — creates read-write transaction (exclusive write lock)
- All transactions clone `Arc<HashMap<&str, MDBX_dbi>>` for DBI cache — cheap pointer copy
### 2.2 Transaction Lifetimes
- **Long-lived read transactions** are dangerous: they prevent MDBX from reclaiming pages. A safety mechanism at `crates/storage/db/src/implementation/mdbx/tx.rs` L27 sets `LONG_TRANSACTION_DURATION = 60s` and logs backtraces.
- `ConsistentProvider` (`consistent.rs` L57-77) takes a snapshot: `head_state` + `database_provider_ro()`. The RO transaction lives for the entire duration of the provider.
- **FCU/newPayload path**: `save_blocks()` at `provider.rs` L509-693 holds a single RW transaction for the entire batch of blocks, which is correct — one commit per batch.
### 2.3 Write Transaction Scope
- `save_blocks()` is the main entry point for persisting execution results. One RW tx processes N blocks:
1. Insert block structure (headers, body indices, senders, tx hashes)
2. Write state changes (per-block: plain state, bytecodes, storage)
3. Write hashed state (batched across all blocks — good optimization)
4. Write trie updates (batched across all blocks — good optimization)
5. Update history indices (batched across all blocks)
6. Commit
### 2.4 No Nested Transactions
MDBX supports nested transactions but Reth does **not** use them. All writes within `save_blocks` share a single RW transaction.
---
## 3. Write Path Analysis
### 3.1 Full Write Path: Engine → Disk
```
engine::newPayload
→ execute block (EVM)
→ ExecutedBlock { bundle_state, hashed_state, trie_updates }
→ save_blocks() [provider.rs L509]
├── Static file thread: headers, txs, senders, receipts
├── RocksDB thread (optional): tx hashes, history indices
└── MDBX main thread:
├── insert_block_mdbx_only [L699]
│ ├── TransactionSenders (cursor append)
│ ├── HeaderNumbers (put)
│ ├── BlockBodyIndices (cursor append)
│ ├── TransactionBlocks (cursor append)
│ └── Ommers/Withdrawals (put)
├── write_state (per block) [L623]
│ ├── write_state_changes [L2365]
│ │ ├── PlainAccountState (cursor upsert/delete)
│ │ ├── Bytecodes (cursor upsert)
│ │ └── PlainStorageState (seek+delete+upsert per slot)
│ ├── write_receipts [L2246-2283]
│ └── write_state_reverts [L2286]
│ ├── StorageChangeSets (append via EitherWriter)
│ └── AccountChangeSets (append via EitherWriter)
├── write_hashed_state (batched) [L2427]
│ ├── HashedAccounts (upsert/delete)
│ └── HashedStorages (seek+delete+upsert per slot)
├── write_trie_updates_sorted (batched) [L656]
│ ├── AccountsTrie (upsert/delete)
│ └── StoragesTrie (upsert/delete via dup cursor)
└── update_history_indices [L3081]
├── AccountsHistory (seek+upsert, sharded)
└── StoragesHistory (seek+upsert, sharded)
→ commit() [static files → RocksDB → MDBX]
```
### 3.2 Data Written Per Block (Approximate)
For a block with 200 transactions touching 500 accounts and 2000 storage slots:
| Table | Operations | Data per op | Total |
|-------|-----------|-------------|-------|
| PlainAccountState | 500 upserts | ~40B | ~20KB |
| PlainStorageState | 2000 seek+delete+upsert | ~64B | ~128KB |
| HashedAccounts | 500 upserts | ~40B | ~20KB |
| HashedStorages | 2000 seek+delete+upsert | ~64B | ~128KB |
| AccountChangeSets | 500 appends | ~40B | ~20KB |
| StorageChangeSets | 2000 appends | ~84B | ~168KB |
| AccountsHistory | ~500 seek+upsert | variable | ~50KB |
| StoragesHistory | ~2000 seek+upsert | variable | ~200KB |
| AccountsTrie | ~100 upserts | variable | ~10KB |
| StoragesTrie | ~500 upserts | variable | ~50KB |
| TransactionSenders | 200 appends | 20B | ~4KB |
| HeaderNumbers | 1 put | 40B | ~40B |
| BlockBodyIndices | 1 append | 16B | ~16B |
| **Total** | | | **~800KB** |
**Write amplification**: Effective state change ≈ 150KB (accounts + storage). Total written ≈ 800KB. **Write amplification ≈ 5.3x** due to changesets, hashed state copies, history indices, and trie updates.
---
## 4. Read Path Analysis
### 4.1 Latest State Access (EVM execution)
```
EVM SLOAD(address, slot)
→ LatestStateProviderRef [latest.rs L38-43]
→ tx.get_by_encoded_key::<PlainAccountState>(address)
→ MDBX point lookup: O(log N) B-tree traversal
→ cursor_dup_read::<PlainStorageState>()
→ seek_by_key_subkey(address, slot)
→ MDBX dup-sort lookup: O(log N) for key + O(log M) for subkey
```
### 4.2 Historical State Access
```
eth_getBalance(address, block=N)
→ HistoricalStateProviderRef [historical.rs L103]
→ account_history_lookup(address)
→ seek AccountsHistory for ShardedKey(address, ?)
→ find block > N in shard → InChangeset(block_number)
→ seek AccountChangeSets at block_number
→ seek_by_key_subkey(block_number, address)
→ return pre-state value
```
This involves **3 B-tree lookups** minimum (history index → changeset → optional plain state fallback).
### 4.3 Provider Hierarchy
```
ConsistentProvider
├── In-memory blocks (recent, not yet persisted)
│ └── MemoryOverlayStateProviderRef
├── DatabaseProvider (MDBX)
│ ├── LatestStateProviderRef (current tip)
│ └── HistoricalStateProviderRef (past blocks)
└── StaticFileProvider (immutable historical data)
└── NippyJar (mmap'd, zstd-compressed)
```
---
## 5. Bottleneck Shortlist
### B1: DupSort Storage Writes — Seek-Delete-Upsert Pattern [HIGH]
- **File:** `crates/storage/provider/src/providers/database/provider.rs` L2395-2421
- **Problem:** For every storage slot update in `PlainStorageState`, the code does:
1. `seek_by_key_subkey(address, entry.key)` — random B-tree lookup
2. `delete_current()` — if found
3. `upsert(address, &entry)` — reinsert
Same pattern repeated at L2440-2461 for `HashedStorages`.
- **Impact:** 4000 cursor operations per block (2000 plain + 2000 hashed) for storage-heavy blocks.
- **Severity:** HIGH — This is the hottest write path during block execution.
- **Proposed fix:** Batch storage writes: sort all entries by (address, slot), open cursor once, and use sequential seek to avoid random lookups. For blocks where an account's storage is wiped, skip individual deletes and use `delete_current_duplicates()` first.
### B2: Per-Block State Writing Instead of Batched [MEDIUM-HIGH]
- **File:** `crates/storage/provider/src/providers/database/provider.rs` L609-637
- **Problem:** `write_state()` is called **per block** inside the loop (L623), while `write_hashed_state()` and `write_trie_updates()` are batched across all blocks (L641-658). This means `PlainAccountState`, `PlainStorageState`, `Bytecodes`, and changeset tables open new cursors N times instead of once.
- **Impact:** For 128-block batches, this creates 128x cursor open/close overhead and prevents sequential write ordering.
- **Proposed fix:** Batch `write_state_changes` and `write_state_reverts` across all blocks, similar to how hashed state is already batched. Merge all `StateChangeset` objects first, then write once with a single cursor pass.
### B3: History Index Sharding — Expensive Seek on Every Update [MEDIUM]
- **File:** `crates/storage/provider/src/providers/database/provider.rs` L1228-1276
- **Problem:** `append_history_index()` does `cursor.seek_exact(last_key)` for **every** account/storage key to find the last shard. For 500 changed accounts per block, that's 500 random B-tree lookups in `AccountsHistory`.
- **Impact:** MEDIUM — History tables grow large on mainnet (100M+ entries), making seeks expensive.
- **Proposed fix:** Sort updates by key and use cursor walking to amortize seeks. When processing multiple updates for the same key, batch them before the shard lookup.
### B4: Transaction Hash Sorting Before Insert [LOW-MEDIUM]
- **File:** `crates/storage/provider/src/providers/database/provider.rs` L576-606
- **Problem:** All tx hashes are collected into a Vec, then `sort_unstable_by_key` is called to sort by hash for "optimal MDBX insertion performance". This is O(n log n) for potentially thousands of transactions.
- **Impact:** LOW-MEDIUM — This is already a good optimization (sorted inserts are O(1) in MDBX vs O(log N) for random), but the sort itself is not free.
- **Proposed fix:** Consider using a radix sort for B256 keys, which would be O(n) instead of O(n log n).
### B5: Compact Codec Vec Serialization Overhead [LOW]
- **File:** `crates/storage/codecs/src/lib.rs` L235-260
- **Problem:** `Compact for &[T]` allocates a temporary `Vec<u8>` with capacity 64 for **each element** during serialization (L247). For lists with many elements, this creates many small allocations.
- **Impact:** LOW — Hot path for transaction/receipt serialization but allocation is reused via `clear()`.
- **Proposed fix:** Accept the compaction buffer as a parameter to avoid per-element allocation, or use a `SmallVec` to keep small elements on the stack.
### B6: Metrics Clone on Every Cursor Operation [LOW]
- **File:** `crates/storage/db/src/implementation/mdbx/cursor.rs` L50-61
- **Problem:** `execute_with_operation_metric()` clones `self.metrics` (an `Option<Arc<DatabaseEnvMetrics>>`) on every cursor operation. While `Arc::clone()` is cheap, it's an atomic increment/decrement on the hottest path.
- **Impact:** LOW — Atomic operations are ~5-15ns each, but cursor ops happen millions of times per block.
- **Proposed fix:** Use a reference instead of clone. The cursor already borrows the metrics via `Arc`; use `as_ref()` to avoid the atomic ref-count bump.
### B7: No Read Cache in Provider Layer [MEDIUM]
- **Problem:** There is no application-level cache for frequently-accessed accounts or storage slots. Every `SLOAD` during EVM execution goes to MDBX. While MDBX has its own page cache, the decode overhead (Compact → Account/StorageEntry) is paid on every read.
- **Impact:** MEDIUM — Hot accounts (e.g., WETH, Uniswap routers) are read thousands of times per block.
- **Proposed fix:** Add an LRU cache at the `LatestStateProviderRef` level for account lookups and hot storage slots. Nethermind uses aggressive caching of hot state.
### B8: WriteMap Mode Always Enabled for RW [LOW-MEDIUM]
- **File:** `crates/storage/db/src/implementation/mdbx/mod.rs` L366-367
- **Problem:** `write_map()` is always enabled for RW mode. While this offers faster writes by modifying pages in-place via mmap, it means dirty pages are held in the OS page cache longer. On memory-constrained systems, this can cause page eviction of read-hot pages.
- **Impact:** LOW-MEDIUM — Generally beneficial, but worth monitoring on nodes with <16GB RAM.
### B9: Readahead Disabled Globally [LOW]
- **File:** `crates/storage/db/src/implementation/mdbx/mod.rs` L425-427
- **Problem:** `no_rdahead: true` is set for all environments. This is correct for random access patterns (EVM execution) but harmful during initial sync where sequential table scans dominate.
- **Impact:** LOW — Only affects initial sync, not live following.
- **Proposed fix:** Make readahead configurable, or enable it during staged sync and disable during live following.
---
## 6. Optimization Candidates (Ranked by Impact)
### Rank 1: Batch Plain State + Changeset Writes Across Blocks
- **Current:** `write_state()` called per block in `save_blocks` loop (L609-637)
- **Target:** Merge all `StateChangeset` and `PlainStateReverts` across the block batch, then write once
- **Expected gain:** 2-5x reduction in cursor open/close overhead for multi-block batches
- **Effort:** Medium — Requires restructuring `write_state` to accept merged changesets
- **Evidence:** `write_hashed_state` and `write_trie_updates` already do this successfully (L641-658)
### Rank 2: Optimize DupSort Storage Write Pattern
- **Current:** Seek-delete-upsert per slot in PlainStorageState/HashedStorages
- **Target:** Batch all slots per address, open cursor once, walk sequentially
- **Expected gain:** 30-50% reduction in MDBX operations for storage-heavy blocks
- **Effort:** Medium — Sort slots, then walk cursor forward instead of random seeks
- **Files:** `provider.rs` L2395-2421 (PlainStorageState), L2440-2461 (HashedStorages)
### Rank 3: Add Hot State Cache
- **Current:** Every EVM SLOAD goes through MDBX B-tree lookup + Compact decode
- **Target:** LRU cache for top ~10K accounts and ~100K storage slots
- **Expected gain:** 20-40% reduction in state reads during execution (Zipf distribution of account access)
- **Effort:** Low — Add `DashMap` or sharded `LruCache` to `LatestStateProviderRef`
- **Risk:** Cache invalidation complexity; must be cleared on reorg
### Rank 4: Reduce Write Amplification via Deferred History
- **Current:** History indices (AccountsHistory, StoragesHistory) updated synchronously in `save_blocks`
- **Target:** Defer history index updates to a background task, or batch them across multiple `save_blocks` calls
- **Expected gain:** 15-25% reduction in `save_blocks` latency
- **Effort:** Medium-High — Requires background worker + crash recovery considerations
- **Evidence:** Metrics show `update_history_indices` is a significant portion of `save_blocks` time
### Rank 5: Cursor Metrics Optimization
- **Current:** `Arc::clone()` per cursor operation + HashMap lookup per metric recording
- **Target:** Store metric references directly in cursor, avoid atomic refcount
- **Expected gain:** 5-10% reduction in cursor operation overhead
- **Effort:** Low — Refactor `Cursor` to hold `&DatabaseEnvMetrics` instead of `Option<Arc<...>>`
- **File:** `cursor.rs` L50-61
### Rank 6: Compact Codec Improvements
- **Current:** Per-element temporary buffer allocation in Vec<T> serialization
- **Target:** Thread-local or caller-provided buffer for Compact encoding
- **Expected gain:** Minor — reduces allocation pressure during heavy writes
- **Effort:** Low
- **File:** `codecs/src/lib.rs` L247
### Rank 7: Configurable Readahead for Sync Stages
- **Current:** `no_rdahead: true` always
- **Target:** Enable readahead during `SenderRecovery`, `Execution` stages where access is sequential
- **Expected gain:** 10-20% improvement for initial sync disk I/O
- **Effort:** Low — Pass flag through `DatabaseArguments`
- **File:** `mdbx/mod.rs` L427
---
## 7. MDBX Configuration Analysis
**File:** `crates/storage/db/src/implementation/mdbx/mod.rs` L344-499
| Parameter | Value | Analysis |
|-----------|-------|----------|
| Map size | 0..8TB | Reasonable for mainnet (~2TB current) |
| Growth step | 4GB | Good — avoids frequent remaps |
| Page size | `default_page_size()` (4KB on most systems) | Standard; 8KB could reduce tree depth for large values |
| Max readers | 32,000 | Conservative but safe |
| Sync mode | `Durable` (default) | Safest; `SafeNoSync` would give 10x write speedup at risk of data loss |
| WriteMap | Enabled for RW | Good — in-place modification via mmap |
| no_rdahead | `true` | Correct for random access; harmful for sync |
| coalesce | `true` | Good — merges adjacent free pages |
| rp_augment_limit | 256K pages | Good — prioritizes freelist reuse over file growth |
### Configuration recommendations:
1. **Consider `SafeNoSync` mode** for live following if the node can recover from unclean shutdown via re-executing from last checkpoint. This alone could provide 5-10x write throughput improvement.
2. **Investigate 8KB page size** for tables with large values (StorageChangeSets, AccountsHistory) to reduce overflow pages.
---
## 8. Existing Metrics Coverage
The storage layer has **good metrics instrumentation**:
### Database-level metrics (`crates/storage/db/src/metrics.rs`):
- Per-table, per-operation call counts
- Large value (>4KB) operation durations
- Transaction open/close durations
- Commit latency breakdown (preparation, GC, audit, write, sync)
### Provider-level metrics (`crates/storage/provider/src/providers/database/metrics.rs`):
- `save_blocks` total, MDBX, static file, RocksDB durations
- Per-sub-step histograms: insert_block, write_state, write_hashed_state, write_trie_updates, update_history_indices
- Last-value gauges for real-time monitoring
### Overlay metrics (`crates/storage/provider/src/providers/state/overlay.rs` L31-48):
- Provider creation, trie/hashed state retrieval durations
- Cache miss counters
### Missing metrics:
- **No per-table write volume tracking** (bytes written per table per block)
- **No decode/encode duration tracking** (Compact codec overhead)
- **No cache hit/miss rates** (because there's no application-level cache)
- **No MDBX page fault tracking** (would reveal cold vs hot page access patterns)
---
## 9. Static File Interaction
### Architecture
- Static files use NippyJar (zstd-compressed, mmap'd) for immutable historical data
- Segments: Headers, Transactions, TransactionSenders, Receipts, AccountChangeSets, StorageChangeSets
- Files are organized in fixed-size ranges (default 500K blocks per file)
- Provider at `crates/storage/provider/src/providers/static_file/manager.rs`
### Parallelism
- `save_blocks()` parallelizes static file writes with MDBX writes using `thread::scope` (L554-692)
- Static files written on dedicated OS thread via `spawn_scoped_os_thread`
- RocksDB also on a dedicated thread when enabled
### Redundancy concerns
- **No redundancy** for active state tables — data exists in either MDBX or static files, controlled by `EitherWriter`
- The `EitherWriterDestination` mechanism (`either_writer.rs`) routes writes to the appropriate backend based on `StorageSettings`
- Changesets can be in MDBX, static files, or RocksDB — clean separation
---
## 10. Summary of Top Recommendations
| Priority | Optimization | Expected Speedup | Effort |
|----------|-------------|-------------------|--------|
| P0 | Batch plain state writes across blocks | 2-5x for multi-block saves | Medium |
| P0 | Optimize DupSort seek-delete-upsert pattern | 30-50% fewer MDBX ops | Medium |
| P1 | Hot state LRU cache | 20-40% fewer state reads | Low |
| P1 | Deferred history index updates | 15-25% faster save_blocks | Medium-High |
| P2 | Cursor metrics refactoring | 5-10% less overhead | Low |
| P2 | Configurable readahead for sync | 10-20% faster initial sync | Low |
| P3 | SafeNoSync mode option | 5-10x write throughput | Config change |
| P3 | Compact codec buffer reuse | Minor allocation savings | Low |

View File

@@ -0,0 +1,371 @@
# Reth Engine Pipeline — Architecture Analysis & Performance Findings
## 1. Architecture Overview
### Pipeline Flow Diagram
```
┌─────────────────────────────────────────────┐
│ Consensus Layer (CL) │
└──────────────┬──────────────────────────────┘
│ BeaconEngineMessage
┌──────────────────────────────┐
│ EngineHandler<T, S, D> │ (engine.rs)
│ - Routes CL messages │
│ - Manages block downloader │
└──────────────┬───────────────┘
│ crossbeam channel
┌───────────────────────────────────────────────────────────────────────┐
│ EngineApiTreeHandler (tree/mod.rs) │
│ Runs on dedicated OS thread │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Main Event Loop │ │
│ │ wait_for_event() ─► select_biased! { │ │
│ │ persistence_rx ─► on_persistence_complete() │ │
│ │ incoming ─► on_engine_message() │ │
│ │ } │ │
│ │ advance_persistence() ─► should_persist() │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ on_new_payload(payload) │
│ ├─ find_invalid_ancestor() │
│ ├─ try_insert_payload(payload) / try_buffer_payload(payload) │
│ │ ├─ payload_validator.convert_payload_to_block() │
│ │ ├─ state_provider_builder(parent_hash) │
│ │ ├─ payload_validator.validate_payload(payload, ctx) │
│ │ │ ├─ PayloadProcessor::spawn() ──────────────────────────┐ │
│ │ │ │ ├─ Prewarm task (rayon) │ │
│ │ │ │ ├─ Multi-proof task (tokio blocking) │ │
│ │ │ │ ├─ Sparse trie task (std::thread) │ │
│ │ │ │ └─ Receipt root task │ │
│ │ │ ├─ Execute transactions sequentially │ │
│ │ │ ├─ Feed state updates → multi-proof → sparse trie │ │
│ │ │ └─ Await state_root from sparse trie │ │
│ │ ├─ insert_executed(block) → TreeState │ │
│ │ └─ emit CanonicalBlockAdded / ForkBlockAdded │ │
│ └─ try_connect_buffered_blocks() │ │
│ │
│ on_forkchoice_updated(state, attrs, version) │
│ ├─ validate_forkchoice_state() │
│ ├─ handle_canonical_head() — already canonical, process attrs │
│ ├─ apply_chain_update() │
│ │ ├─ on_new_head() — build NewCanonicalChain (Commit/Reorg) │
│ │ └─ on_canonical_chain_update() │
│ │ ├─ update_chain() on CanonicalInMemoryState │
│ │ ├─ notify_canon_state() │
│ │ └─ emit CanonicalChainCommitted │
│ └─ handle_missing_block() — download needed │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ PersistenceHandle │ │
│ │ save_blocks(Vec<ExecutedBlock>) ─────────────► │ │
│ │ remove_blocks_above(num) ────────────────────► │ │
│ └──────────────────┬───────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
│ std::sync::mpsc
┌──────────────────────────────────────────────┐
│ PersistenceService (persistence.rs) │
│ Runs on dedicated OS thread │
│ │
│ on_save_blocks(blocks) │
│ ├─ provider.save_blocks(blocks, Full) │
│ ├─ save finalized/safe block numbers │
│ ├─ provider.commit() │
│ └─ prune_before() if needed │
│ │
│ on_remove_blocks_above(tip) │
│ ├─ remove_block_and_execution_above(tip) │
│ └─ provider.commit() │
└────────────────────────────────────────────────┘
```
### Key Threading Model
| Thread | Role |
|--------|------|
| Engine thread (`spawn_os_thread("engine")`) | Main event loop: processes CL messages, executes blocks, manages state |
| Persistence thread (`spawn_os_thread("persistence")`) | Writes blocks/state to DB, runs pruner |
| Rayon pool | Prewarm tasks, sparse trie parallelism, overlay preparation |
| Tokio blocking pool | Multi-proof computation, I/O-heavy proof tasks |
---
## 2. Bottleneck Shortlist
### B-01: `is_canonical()` is O(n) chain walk on every `remove_canonical_until`
**File**: [`crates/engine/tree/src/tree/state.rs#L216-L230`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/state.rs#L216-L230)
**Severity**: Medium
**Description**: `is_canonical()` walks the entire in-memory chain from head to genesis on every call. It is called in `remove_canonical_until()` which itself runs after every persistence completion. For a chain with N in-memory blocks, this is O(N) per call.
**Proposed fix**: Maintain a `HashSet<B256>` of canonical block hashes as blocks are inserted/removed. Reduces `is_canonical()` to O(1).
---
### B-02: `BlockBuffer::remove_block()` uses `VecDeque::retain` — O(n) per removal
**File**: [`crates/engine/tree/src/tree/block_buffer.rs#L158`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/block_buffer.rs#L158)
**Severity**: Medium
**Description**: `remove_block()` calls `self.block_queue.retain(|h| h != hash)` which scans the entire `VecDeque` (up to `max_blocks` entries). This is called in a loop by `remove_children()` and `remove_old_blocks()`, making bulk removals O(n²).
**Proposed fix**: Replace `VecDeque` with a linked `HashMap` or simply skip the retain and let eviction handle stale entries. Alternatively, use a `HashSet` for O(1) membership check and a separate `VecDeque` for ordering — remove from the `HashSet` immediately, and lazily skip stale entries during eviction.
---
### B-03: `get_canonical_blocks_to_persist()` walks backwards from head every call
**File**: [`crates/engine/tree/src/tree/mod.rs#L1827-L1871`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L1827-L1871)
**Severity**: Low
**Description**: This walks the entire canonical chain from head backwards to find blocks to persist, cloning each one. With many in-memory blocks, this is both CPU and allocation heavy. It also walks past `target_number` to `last_persisted_number` just to break.
**Proposed fix**: Use `blocks_by_number` (the BTreeMap) to directly iterate the range `(last_persisted_number..target_number]` and filter for canonical blocks, avoiding the hash-chain walk.
---
### B-04: `on_new_head()` reorg detection clones every block during chain walk
**File**: [`crates/engine/tree/src/tree/mod.rs#L733-L822`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L733-L822)
**Severity**: Medium
**Description**: `on_new_head()` clones every `ExecutedBlock` (which is `Arc`-wrapped, so cheap) while walking the new chain. However, `canonical_block_by_hash()` (line 782, 799) for the OLD chain does a full DB fetch + trie computation for each block that isn't in memory. During reorgs against persisted blocks, this means N full DB reads + N trie computations.
**Proposed fix**: Cache recently persisted `ExecutedBlock`s so reorgs against just-persisted blocks don't trigger full reconstructions. Also, the trie computation in `canonical_block_by_hash()` (line 1928) could be deferred or skipped for reorg-detection purposes.
---
### B-05: `canonical_block_by_hash()` recomputes trie updates from DB every call
**File**: [`crates/engine/tree/src/tree/mod.rs#L1904-L1955`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L1904-L1955)
**Severity**: High
**Description**: When a block is not in memory but needed (e.g., for reorgs or FCU on canonical ancestors), this method fetches the block, its execution output, hashed post state, AND recomputes trie updates via `reth_trie_db::compute_block_trie_updates()`. This is an expensive operation that opens a read-only DB provider and walks the trie. It's called from `on_new_head()` and `apply_canonical_ancestor_via_reorg()`.
**Proposed fix**: Cache recently persisted `ExecutedBlock`s (the last N blocks). Alternatively, store pre-computed trie updates alongside blocks in the database to avoid recomputation.
---
### B-06: `StateProviderBuilder::build()` clones the overlay `Vec<ExecutedBlock>` every time
**File**: [`crates/engine/tree/src/tree/mod.rs#L126`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L126)
**Severity**: Low
**Description**: `build()` does `self.overlay.clone()` which clones the `Vec<ExecutedBlock<N>>`. While `ExecutedBlock` is `Arc`-wrapped internally (so it's reference counting, not deep copies), the vector itself is reallocated. For parallel execution contexts where `build()` may be called multiple times, this is wasteful.
**Proposed fix**: Store the overlay as `Arc<Vec<ExecutedBlock<N>>>` so cloning is just an Arc increment.
---
### B-07: `reinsert_reorged_blocks()` clones the entire `Vec<ExecutedBlock>` parameter
**File**: [`crates/engine/tree/src/tree/mod.rs#L2404-L2405`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L2404-L2405)
**Severity**: Low
**Description**: `on_canonical_chain_update()` calls `self.reinsert_reorged_blocks(new.clone())` and `self.reinsert_reorged_blocks(old.clone())`. These clone the entire block vectors. Since `reinsert_reorged_blocks` only reads the blocks to check if they exist in `TreeState`, it could take a reference instead.
**Proposed fix**: Change `reinsert_reorged_blocks` to accept `&[ExecutedBlock<N>]`.
---
### B-08: Persistence is single-threaded and blocks on commit
**File**: [`crates/engine/tree/src/persistence.rs#L96-L136`](file:///home/ubuntu/reth/crates/engine/tree/src/persistence.rs#L96-L136)
**Severity**: Medium
**Description**: The persistence service processes one action at a time in a sequential loop. `on_save_blocks()` calls `provider_rw.commit()` which is a blocking fsync. While this is happening, no new persistence actions can be processed. The engine thread can continue processing payloads, but persistence is the bottleneck for draining in-memory blocks.
**Proposed fix**: Pipeline persistence — start preparing the next batch while the current one is being committed. Or use write-ahead logging to defer the fsync.
---
### B-09: `blocks_by_hash()` in `TreeState` clones every block while walking the chain
**File**: [`crates/engine/tree/src/tree/state.rs#L87-L97`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/state.rs#L87-L97)
**Severity**: Low
**Description**: `blocks_by_hash()` clones every `ExecutedBlock` while walking the chain from the given hash back to the anchor. This is used in `state_provider_builder()`, `prepare_canonical_overlay()`, and other hot paths. While `ExecutedBlock` internals are `Arc`-wrapped, the clone still involves atomic reference count bumps for multiple `Arc` fields.
**Proposed fix**: Return references where possible, or cache the chain walk result.
---
### B-10: `insert_executed()` in `TreeState` clones the block for `blocks_by_number`
**File**: [`crates/engine/tree/src/tree/state.rs#L160-L174`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/state.rs#L160-L174)
**Severity**: Low
**Description**: `insert_executed()` inserts the block into both `blocks_by_hash` and `blocks_by_number`, cloning it for the second insertion. This means every inserted block gets its Arc ref counts bumped.
**Proposed fix**: Store the block only in `blocks_by_hash` and have `blocks_by_number` store `B256` hashes instead of full `ExecutedBlock`s. This saves memory and eliminates clones.
---
### B-11: `MeteredStateHook::on_state()` iterates all accounts/storage on every state change
**File**: [`crates/engine/tree/src/tree/mod.rs#L232-L244`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L232-L244)
**Severity**: Low
**Description**: On every transaction execution, this hook counts accounts (`.keys().len()`), storage slots (`.values().map(...).sum()`), and bytecodes. For large state changes, this is O(n) work on every transaction.
**Proposed fix**: The counts could be tracked incrementally during execution rather than recomputed from scratch.
---
### B-12: `find_disk_reorg()` does up to 2*N `sealed_header_by_hash` lookups
**File**: [`crates/engine/tree/src/tree/mod.rs#L2340-L2382`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L2340-L2382)
**Severity**: Low
**Description**: This function walks both the canonical and persisted chains backwards to find a common ancestor. Each step calls `sealed_header_by_hash()` which may hit the database. In the worst case (long persisted chain divergence), this is many DB reads.
**Proposed fix**: Most of the time there is no disk reorg. Short-circuit earlier by checking if the persisted tip hash matches a block in the canonical chain before walking.
---
### B-13: `InsertExecutedBlock` path clones block 3 times
**File**: [`crates/engine/tree/src/tree/mod.rs#L1462-L1486`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L1462-L1486)
**Severity**: Low
**Description**: When handling `InsertExecutedBlock`, the block is cloned for:
1. `set_pending_block(block.clone())` (line 1478)
2. `insert_executed(block.clone())` (line 1481)
3. `on_inserted_executed_block(block.clone())` (line 1482)
4. `CanonicalBlockAdded(block, ...)` (line 1485)
That's 3 clones. While `ExecutedBlock` is Arc-wrapped, each clone bumps ~5 Arc counters.
**Proposed fix**: Use the original `block` for the event emission (last use), and clone only when needed earlier. Or restructure to pass `&ExecutedBlock` where ownership isn't needed.
---
## 3. Quick Wins (< 50 lines each)
### QW-01: Store hashes instead of blocks in `blocks_by_number`
Change `blocks_by_number: BTreeMap<BlockNumber, Vec<ExecutedBlock<N>>>` to `BTreeMap<BlockNumber, Vec<B256>>` and look up the actual block from `blocks_by_hash`. This eliminates the clone in `insert_executed()` and reduces memory usage. **~20 lines changed** across `state.rs`.
### QW-02: Make `reinsert_reorged_blocks` take a reference
```rust
fn reinsert_reorged_blocks(&mut self, new_chain: &[ExecutedBlock<N>]) {
```
Change callers from `new.clone()` / `old.clone()` to `&new` / `&old` since the method only reads blocks. **~5 lines changed** in `mod.rs`.
### QW-03: Add fast-path to `find_disk_reorg()`
Add an early return when the persisted tip is in the canonical chain:
```rust
if self.state.tree_state.blocks_by_hash.contains_key(&persisted.hash) {
return Ok(None); // persisted tip is part of in-memory canonical chain
}
```
**~3 lines added** to `mod.rs`.
### QW-04: Replace `VecDeque::retain` with `HashSet` tracking in `BlockBuffer`
Add a `HashSet<BlockHash>` for O(1) membership checks during eviction. In `remove_block()`, remove from the set instead of calling `retain`. **~15 lines changed** in `block_buffer.rs`.
### QW-05: Wrap overlay in `Arc` in `StateProviderBuilder`
```rust
overlay: Option<Arc<Vec<ExecutedBlock<N>>>>,
```
This makes `build()` cloning O(1) instead of O(n). **~10 lines changed**.
---
## 4. Medium Refactors (50-500 lines)
### MR-01: Maintain a canonical hash set for O(1) `is_canonical()`
Add a `HashSet<B256>` to `TreeState` that tracks canonical block hashes. Update it on `insert_executed`, `remove_by_hash`, `set_canonical_head`, and reorgs. This turns `is_canonical()` from O(chain_length) to O(1), benefiting `remove_canonical_until()` and `get_canonical_blocks_to_persist()`.
**Impact**: Eliminates O(n) walks on every persistence completion.
**Estimated effort**: ~100 lines across `state.rs` and `mod.rs`.
### MR-02: Cache recently-persisted `ExecutedBlock`s
Keep the last 64 `ExecutedBlock`s that were persisted to disk in a bounded LRU cache. This prevents `canonical_block_by_hash()` from doing expensive DB reads + trie recomputation during reorgs or FCU-to-canonical-ancestor operations.
**Impact**: Critical for reorg performance. Currently, reverting to a just-persisted block requires full DB reconstruction.
**Estimated effort**: ~80 lines, new cache struct + integration in `on_save_blocks` and `canonical_block_by_hash`.
### MR-03: Use `blocks_by_number` BTreeMap for `get_canonical_blocks_to_persist()`
Instead of walking backwards from the head hash, use:
```rust
for (num, blocks) in self.state.tree_state.blocks_by_number.range(start..=end) {
// find the canonical block at this height
}
```
This requires a way to identify the canonical block at each height (see MR-01's canonical hash set). Eliminates the backwards walk and avoids touching non-canonical blocks.
**Impact**: Persistence triggering becomes O(batch_size) instead of O(chain_length).
**Estimated effort**: ~50 lines.
### MR-04: Pipeline persistence with double-buffering
Allow the persistence service to accept a new batch while the previous one is being committed. Use two alternating write providers:
1. While batch A is committing (fsync), prepare batch B's data
2. When batch A completes, start committing batch B
**Impact**: Increases persistence throughput by overlapping I/O with preparation.
**Estimated effort**: ~200 lines in `persistence.rs`.
---
## 5. Missing Optimizations
### MO-01: No pre-fetching of parent state for anticipated payloads
The engine knows the canonical head and can predict that the next payload will build on it. State pre-fetching (warming the MDBX page cache and execution cache) before the payload arrives would reduce latency.
**Current state**: The `prepare_canonical_overlay()` (state.rs:107) pre-computes the trie overlay, which is good. But there's no pre-fetching of account/storage state that will likely be needed.
**Recommendation**: After canonicalizing a block, identify the top N accounts/slots that were touched and pre-warm the execution cache with their current values for the next block.
### MO-02: No batching of state root updates during multi-block connect
When `try_connect_buffered_blocks()` connects multiple buffered blocks (e.g., after backfill), each block goes through full independent execution and state root computation. State roots could be deferred: execute all blocks, then compute the final state root once.
**Current limitation**: Each block needs its own state root for validation. But the multiproof/sparse trie infrastructure could be reused across blocks.
### MO-03: Persistence doesn't batch prune with save
The persistence service first saves blocks, then checks if pruning is needed and runs it as a separate operation. Both operations open their own read-write provider and commit separately, causing two fsyncs.
**File**: [`crates/engine/tree/src/persistence.rs#L108-L126`](file:///home/ubuntu/reth/crates/engine/tree/src/persistence.rs#L108-L126)
**Recommendation**: Combine save + prune into a single transaction when possible, saving one fsync per cycle.
### MO-04: No parallel execution of independent fork chains
When multiple fork chains exist (e.g., two competing heads), they are processed sequentially in the engine thread. Since they have independent state, they could be executed in parallel on separate threads with separate state providers.
### MO-05: `CanonicalInMemoryState` is updated multiple times per FCU
During `on_canonical_chain_update()`, the canonical in-memory state is updated with `update_chain()`, then `set_canonical_head()`, then `notify_canon_state()`. Each of these may involve locks and notifications. Batching these into a single atomic update would reduce contention.
**File**: [`crates/engine/tree/src/tree/mod.rs#L2387-L2423`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L2387-L2423)
### MO-06: Changeset cache eviction is conservative
The changeset cache evicts below `min(finalized_block, last_persisted - 64)`. On L2s where finalized is often unset, it falls back to `last_persisted - 64`. This means at most 64 blocks of changesets are cached, which may be too small for reorg scenarios.
**File**: [`crates/engine/tree/src/tree/mod.rs#L1397-L1417`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/mod.rs#L1397-L1417)
**Recommendation**: Make the retention configurable and consider adaptive sizing based on observed reorg depth.
### MO-07: No block hash → number index for fast lookups
The `TreeState` has `blocks_by_hash` and `blocks_by_number`, but when given a hash, finding its number requires a lookup in `blocks_by_hash` first. The `find_canonical_header()` method (line 2768) checks `canonical_in_memory_state` and then falls back to the database provider. A simple in-memory `hash → number` map would speed up many operations.
### MO-08: Execution cache miss on non-prewarm path doesn't populate cache
**File**: [`crates/engine/tree/src/tree/cached_state.rs#L310-L325`](file:///home/ubuntu/reth/crates/engine/tree/src/tree/cached_state.rs#L310-L325)
When `is_prewarm()` is false (the normal execution path), cache misses just go to the state provider without populating the cache. The cache is only populated after execution via `insert_state()`. This means if the same account is read multiple times within a single block execution (e.g., by different transactions), it will hit the DB each time.
**Mitigation**: The EVM's internal `State` database already caches within block execution, so this is less impactful than it appears. But for cross-transaction reads of the same cold account within a block, there's a redundant DB hit.
---
## Summary
| Category | Count | Top Priority |
|----------|-------|-------------|
| Critical | 0 | — |
| High | 1 | B-05: `canonical_block_by_hash()` recomputes trie from DB |
| Medium | 4 | B-01, B-02, B-04, B-08 |
| Low | 8 | B-03, B-06, B-07, B-09, B-10, B-11, B-12, B-13 |
| Quick Wins | 5 | QW-02, QW-03 (< 5 lines each) |
| Medium Refactors | 4 | MR-01 (canonical hash set), MR-02 (persisted block cache) |
| Missing Optimizations | 8 | MO-01 (state pre-fetch), MO-03 (batch save+prune) |
The engine pipeline is well-architected with good use of dedicated threads, Arc-wrapped shared state, and lazy/deferred computation (e.g., `LazyOverlay`, `DeferredTrieData`). The main performance concerns are:
1. **O(n) chain walks** where O(1) lookups are possible (B-01, B-03, B-12)
2. **Expensive DB reconstruction** of just-persisted blocks during reorgs (B-04, B-05)
3. **Sequential persistence** without pipelining (B-08, MO-03)
4. **Unnecessary cloning** in hot paths, mitigated by Arc but still adding atomic contention (B-07, B-13)

View File

@@ -0,0 +1,276 @@
# Reth Trie & State Root Performance Analysis
## 1. Architecture Overview
Reth's trie system is split across six crates in `crates/trie/`:
### Crate Map
| Crate | Purpose |
|-------|---------|
| `trie/common/` | Shared types: `Nibbles`, `PrefixSet`, `HashBuilder`, `BranchNodeCompact`, RLP encoding |
| `trie/trie/` | Core sequential algorithms: `StateRoot`, `StorageRoot`, `TrieWalker`, `TrieNodeIter`, proof generation |
| `trie/db/` | MDBX-backed cursor factories (`DatabaseTrieCursorFactory`, `DatabaseHashedCursorFactory`), changeset cache |
| `trie/sparse/` | In-memory sparse trie (`SerialSparseTrie`, `SparseStateTrie`), lazy node revealing, proof-based approach |
| `trie/sparse-parallel/` | `ParallelSparseTrie` — splits trie into upper (depth <2) + 256 lower subtries for parallel hashing |
| `trie/parallel/` | Parallel state root calculator, parallel proof generation, worker pools (`ProofWorkerHandle`) |
### End-to-End Flow: State Root After Block Execution
The modern (preferred) path uses the **sparse trie + proof-based approach**, orchestrated from the engine:
1. **Block execution** proceeds transaction-by-transaction in `crates/engine/tree/src/tree/payload_processor/`.
2. After each transaction (or batch), a `MultiProofMessage::StateUpdate` is sent with the touched accounts/storage.
3. A **multiproof task** (`multiproof.rs`) converts these into proof targets, dispatches to `ProofWorkerHandle` worker pools.
4. Workers compute multiproofs (account + storage) using DB cursors, return `ProofResultMessage`.
5. A **sparse trie task** (`sparse_trie.rs`) receives proofs, reveals nodes into `SparseStateTrie`, applies leaf updates.
6. After all transactions, `root_with_updates()` is called on the sparse trie to compute the final state root.
The **legacy path** (`ParallelStateRoot` in `parallel/src/root.rs`) walks the entire trie using `TrieWalker` + `TrieNodeIter`, pre-computing storage roots in parallel via `tokio::spawn_blocking`. This is explicitly documented as a **fallback** (line 37).
### Two Computation Strategies
| Strategy | File | When Used |
|----------|------|-----------|
| **Sparse Trie (preferred)** | `sparse_trie.rs`, `SparseStateTrie` | Engine pipeline, modern payload processing |
| **Full Trie Walk (fallback)** | `parallel/src/root.rs`, `trie/src/trie.rs` | Initial sync, fallback scenarios |
---
## 2. Parallel Root Analysis
### 2.1 Legacy Parallel Root (`parallel/src/root.rs`)
**Strategy**: Spawn storage root computations as `tokio::spawn_blocking` tasks, then walk the account trie sequentially.
**Bottlenecks**:
1. **Sequential account trie walk** (L136-199): The account trie walker runs on a single thread. Storage roots are parallelized, but the account trie iteration is strictly serial.
2. **Sorted iteration forces ordering** (L103-104): `storage_root_targets.into_iter().sorted_unstable_by_key(...)` sorts all targets before spawning. This is O(n log n) and delays the first spawn.
3. **mpsc::sync_channel(1) per account** (L110): Each storage root gets its own channel. When encountering a leaf, the main thread blocks on `rx.recv()` (L155). If the storage root computation hasn't finished, the account trie walk stalls.
4. **Missed leaves cause synchronous fallback** (L164-176): When a leaf has no pre-computed storage root, it falls back to synchronous `StorageRoot::calculate()` on the main thread, completely blocking progress.
5. **No prefetching**: The tokio runtime's thread pool is used, but there's no explicit prefetching of DB pages or control over thread count beyond `thread_keep_alive(15s)`.
### 2.2 Sparse Trie Approach (`SparseTrieCacheTask`)
**Strategy**: Interleave proof computation with block execution. Uses `crossbeam_channel::select_biased!` to process updates and proof results concurrently.
**Architecture** (from `sparse_trie.rs` L360-418):
```
Execution → MultiProofMessage → SparseTrieCacheTask
├── on_state_update() → encode leaf updates
├── process_leaf_updates() → apply to trie (parallel via par_bridge)
├── dispatch_pending_targets() → send to ProofWorkerHandle
└── on_proof_result() → reveal_decoded_multiproof_v2()
```
**Key parallelism points**:
1. **Worker pools** (`proof_task.rs` L128-277): Pre-spawned storage + account worker pools with dedicated DB transactions. Workers are long-lived, avoiding per-request DB transaction creation overhead.
2. **Storage proof parallelism** (L186-228): Storage workers process proofs independently. Availability tracked via `AtomicUsize` counters.
3. **Parallel storage reveals** (`state.rs` L292-351): When revealing multiproofs, storage trie nodes are revealed in parallel using `par_bridge_buffered()`.
4. **Parallel leaf updates** (`sparse_trie.rs` L585-600): Storage leaf updates applied via `par_bridge_buffered()`.
**Bottlenecks**:
1. **Single-threaded account trie reveal** (`state.rs` L466-499): Account trie nodes are revealed serially — only storage tries get parallel reveals.
2. **Final root computation is serial** (`sparse_trie.rs` L422-426): `root_with_updates()` hashes the entire trie on a single thread. This is the critical path bottleneck.
3. **`select_biased!` ordering** (`sparse_trie.rs` L364): Updates are prioritized over proof results, which can delay applying proofs when execution is fast.
### 2.3 ParallelSparseTrie (`sparse-parallel/src/trie.rs`)
**Strategy**: Split trie into upper subtrie (depth < 2) + 256 lower subtries. Hash lower subtries in parallel with rayon.
**Implementation** (L837-893):
- `update_subtrie_hashes()` takes changed lower subtries, updates their hashes via `into_par_iter().map(...)`.
- Has configurable `ParallelismThresholds` (min_revealed_nodes, min_updated_nodes) to fall back to serial when parallelism overhead exceeds benefit.
- Parallel reveal of nodes: groups by subtrie index, zips with `into_par_iter()` (L268-285).
**Bottleneck**: Upper subtrie hashing is always serial. With depth=2, the upper subtrie contains at most ~256 branch children, so this is typically fast.
---
## 3. Node Access Patterns
### 3.1 DB Read Patterns
**Trie cursors** (`db/src/trie_cursor.rs`): Thin wrappers around MDBX cursors.
- `seek_exact()` and `seek()` are the primary operations — one DB call per trie node lookup.
- Each `TrieWalker::advance()` does one `seek` or `seek_exact` (L286-296 in walker.rs).
- **No batching**: Nodes are fetched one at a time. No read-ahead or batch prefetch.
**Hashed cursors** (`db/src/hashed_cursor.rs`): Direct MDBX cursor operations.
- `seek()` + `next()` pattern for iterating hashed entries.
- **Optimization exists**: `TrieNodeIter` caches `last_seeked_hashed_entry` and `last_next_result` to avoid redundant seeks (L124-157 in node_iter.rs).
### 3.2 Caching
1. **ChangesetCache** (`db/src/changesets.rs`): LRU cache for trie changesets keyed by block number. Used during reorgs. Bounded memory with explicit eviction (L287).
2. **Sparse trie hash memoization**: `SparseNode` stores `hash: Option<B256>` on every node (L1873-1916 in trie.rs). When a node's subtree hasn't changed (not in prefix set), the cached hash is returned immediately — **this is the incremental hashing mechanism**.
3. **Cached storage roots** (`value_encoder.rs`): `DashMap<B256, B256>` shared across workers caches computed storage roots (L69, 227). Avoids recomputing roots for unchanged accounts.
4. **Proof node deduplication** (`state.rs` L471-499): `revealed_account_paths` / `revealed_paths` track which proof paths were already revealed, skipping duplicate work.
5. **No general trie node cache**: There is no LRU/FIFO cache for trie nodes between blocks. Each new state root computation starts with fresh DB cursors.
### 3.3 Prefetching
- **Prewarm/prefetch via multiproofs** (`multiproof.rs` L56-62): `PREFETCH_MAX_BATCH_TARGETS = 512`, `PREFETCH_MAX_BATCH_MESSAGES = 16`. Prefetch proofs are dispatched ahead of execution.
- **No MDBX-level prefetch**: The code does not use `madvise()` or similar to prefetch DB pages.
- **Block Access List (EIP-7928)**: Support exists (`MultiProofMessage::BlockAccessList`) but is `todo!()` at L441.
---
## 4. Bottleneck Shortlist
### B1: Serial Final State Root Computation
- **File**: `crates/engine/tree/src/tree/payload_processor/sparse_trie.rs` L182-183
- **Also**: `crates/trie/sparse/src/state.rs` L886-903
- **Description**: `root_with_updates()` calls `revealed.root()` which runs the full `rlp_node()` loop serially on a single thread.
- **Severity**: **CRITICAL** — This is on the critical path of every block. For blocks with many state changes, this can take 10s-100s of milliseconds.
- **Proposed Fix**: Use `ParallelSparseTrie` which hashes 256 lower subtries in parallel. The infrastructure exists in `sparse-parallel/` but integration into the engine pipeline needs completion.
### B2: Account Trie Reveals Are Serial
- **File**: `crates/trie/sparse/src/state.rs` L466-499
- **Description**: Account multiproof nodes are revealed serially while storage nodes get parallel reveals via `par_bridge_buffered()`.
- **Severity**: **HIGH** — Account trie can have thousands of nodes to reveal per block.
- **Proposed Fix**: Apply the same `par_bridge_buffered()` pattern used for storage reveals (L292-351).
### B3: Per-Node DB Seeks in Walker
- **File**: `crates/trie/trie/src/walker.rs` L286-296
- **Description**: Each `TrieWalker::node()` call does a single `cursor.seek()` or `cursor.seek_exact()`. No batch reads.
- **Severity**: **MEDIUM** — In the legacy path, this dominates I/O time. In the sparse path, it affects proof workers.
- **Proposed Fix**: Implement batch-seek or cursor prefetch. MDBX supports `mdbx_cursor_get` with `MDBX_GET_MULTIPLE`.
### B4: Sorted Iteration Before Spawning Storage Roots
- **File**: `crates/trie/parallel/src/root.rs` L103-104
- **Description**: `sorted_unstable_by_key` sorts all storage root targets before spawning any tasks. Delays first spawn.
- **Severity**: **LOW** (legacy path only) — Could start spawning immediately in hash order.
- **Proposed Fix**: Remove sort or spawn in parallel using `par_iter()` instead of sequential loop.
### B5: HashMap Allocation in Sparse Trie
- **File**: `crates/trie/sparse/src/trie.rs` L337, L346
- **Description**: `SerialSparseTrie` stores nodes in `HashMap<Nibbles, SparseNode>` and values in `HashMap<Nibbles, Vec<u8>>`. Each insert may trigger a rehash. Nibbles keys are small but numerous.
- **Severity**: **MEDIUM** — HashMap overhead is significant with 10K+ nodes per trie.
- **Proposed Fix**: Consider a more cache-friendly data structure (e.g., sorted Vec, BTreeMap, or a custom trie-aware allocator). The `clear()` + reuse pattern already helps (L1014-1023).
### B6: Redundant RLP Encoding
- **File**: `crates/trie/trie/src/trie.rs` L670 (`alloy_rlp::encode_fixed_size`)
- **Also**: `crates/trie/sparse/src/trie.rs` L1584 (leaf RLP), L1606 (extension RLP), L1763 (branch RLP)
- **Description**: RLP encoding happens inline during hash computation. Each `rlp_node()` call clears and refills `rlp_buf`. For nodes not in the prefix set, the cached hash skips encoding — but dirty nodes always re-encode.
- **Severity**: **LOW** — The `rlp_buf` reuse is already good. Could pre-allocate based on expected size.
- **Proposed Fix**: Pool/arena allocation for RLP buffers across parallel subtrie hashing.
### B7: DeferredDrops Accumulate Memory
- **File**: `crates/trie/sparse/src/state.rs` L26-34
- **Description**: `DeferredDrops` collects proof node buffers to avoid expensive deallocations during root computation. But these can accumulate significant memory across many proof reveals.
- **Severity**: **LOW** — Trading memory for latency is intentional.
- **Proposed Fix**: Consider periodic flushing when memory pressure is high.
### B8: Blinded Node Fallback During Leaf Operations
- **File**: `crates/trie/sparse/src/trie.rs` L656, L780-782
- **Description**: When updating or removing a leaf, if a blinded `SparseNode::Hash` is encountered, the operation fails with `BlindedNode` error. This requires an additional proof fetch round-trip.
- **Also**: `reveal_remaining_child_on_leaf_removal` (L1336-1402) does synchronous DB lookups via `provider.trie_node()`.
- **Severity**: **MEDIUM** — Causes extra proof fetch cycles, delaying final root computation.
- **Proposed Fix**: Pre-reveal neighbor nodes when initially revealing proofs (expand proof targets to include siblings).
### B9: BlockAccessList Integration Incomplete
- **File**: `crates/engine/tree/src/tree/payload_processor/sparse_trie.rs` L441
- **Description**: `MultiProofMessage::BlockAccessList(_) => todo!()` — EIP-7928 block access lists could enable perfect prefetching of all required trie nodes before execution begins.
- **Severity**: **HIGH** (opportunity cost) — This is a major optimization opportunity.
- **Proposed Fix**: Implement BAL-based prefetch to pre-reveal all required trie paths before execution starts.
---
## 5. Optimization Candidates (Ranked by Expected Impact)
### Rank 1: Parallel Subtrie Hashing in Engine Pipeline
**Expected impact**: 30-50% reduction in final root computation time
**Current state**: `ParallelSparseTrie` exists in `sparse-parallel/` with rayon-based parallel hashing of 256 lower subtries. However, the engine pipeline's `SparseTrieCacheTask` uses `SerialSparseTrie`.
**Implementation**: Replace `SerialSparseTrie` with `ParallelSparseTrie` in `SparseStateTrie<A, S>` type parameters in `SparseTrieCacheTask`.
**Key files**: `sparse-parallel/src/trie.rs` L837-893, `engine/tree/src/tree/payload_processor/sparse_trie.rs`
### Rank 2: Block Access List (EIP-7928) Prefetching
**Expected impact**: 20-40% reduction in total state root time by eliminating proof fetch round-trips
**Current state**: Infrastructure exists (`MultiProofMessage::BlockAccessList`) but handler is `todo!()`.
**Implementation**: Parse BAL at block start, convert to proof targets, dispatch all proofs before execution begins. This eliminates the iterative reveal-execute-reveal cycle.
**Key files**: `sparse_trie.rs` L441, `multiproof.rs` L3-4
### Rank 3: Parallel Account Trie Reveals
**Expected impact**: 10-20% reduction in proof reveal time
**Current state**: Storage reveals are parallel (`par_bridge_buffered()`), account reveals are serial.
**Implementation**: Apply same parallel reveal pattern for account proof nodes. Requires making account trie node reveal lock-free or batching into chunks.
**Key file**: `sparse/src/state.rs` L466-499
### Rank 4: Incremental Hash Caching Across Blocks
**Expected impact**: 10-30% reduction in per-block root computation for consecutive blocks
**Current state**: The sparse trie already caches hashes in `SparseNode::hash` fields, and `SparseTrieCacheTask` supports trie reuse across payloads via `into_trie_for_reuse()` with pruning. However, cache effectiveness depends on prune depth.
**Implementation**: Tune prune depth and max_storage_tries parameters. Consider adaptive pruning based on block content (keep hot paths longer).
**Key files**: `sparse/src/state.rs` L1162-1175, `sparse_trie.rs` L318-330
### Rank 5: Batch DB Reads for Proof Workers
**Expected impact**: 5-15% reduction in proof computation I/O time
**Current state**: Each `TrieWalker::node()` call does one MDBX seek. Workers doing storage proofs may seek hundreds of nodes.
**Implementation**: Implement a prefetching cursor wrapper that reads ahead N nodes when a seek lands on a trie branch with many children. MDBX's `MDBX_GET_MULTIPLE` could be used for dup-sorted tables.
**Key files**: `db/src/trie_cursor.rs` L65-98, `trie/src/walker.rs` L286-296
### Rank 6: Reduce HashMap Overhead in Sparse Trie
**Expected impact**: 5-10% reduction in allocation-related overhead
**Current state**: `SerialSparseTrie` uses `HashMap<Nibbles, SparseNode>` which has non-trivial per-entry overhead.
**Implementation**: Consider a `Vec`-backed structure with path indexing, or a `BTreeMap` for better cache locality. For `ParallelSparseTrie`, the subtries are already smaller so this matters less per-subtrie.
**Key file**: `sparse/src/trie.rs` L337
### Rank 7: Smarter Proof Target Expansion
**Expected impact**: 5-10% reduction in blinded-node fallback round-trips
**Current state**: Proofs are fetched exactly for touched paths. Sibling nodes may not be revealed, causing `BlindedNode` errors during leaf operations.
**Implementation**: When generating proof targets, include sibling paths for accounts with storage changes. This pre-reveals nodes that `update_leaf` and `remove_leaf` might need.
**Key file**: `sparse/src/trie.rs` L634-768 (update_leaf), L770-1402 (remove_leaf)
### Rank 8: Async Storage Root Computation Pipeline
**Expected impact**: Already partially implemented via `AsyncAccountValueEncoder`
**Current state**: `value_encoder.rs` implements async storage root computation with pre-dispatched proofs, cached roots, and sync fallback. Stats show `dispatched_count`, `from_cache_count`, `sync_count`.
**Implementation**: Monitor `sync_count` and `dispatched_missing_root_count` metrics to identify when dispatch coverage is insufficient and tune pre-dispatch logic.
**Key file**: `parallel/src/value_encoder.rs` L21-37
### Rank 9: Arena Allocation for RLP Buffers
**Expected impact**: 2-5% reduction in allocation overhead during hashing
**Current state**: Each `rlp_node()` call reuses a single `rlp_buf`, but `branch_value_stack_buf` and `branch_child_buf` are reallocated per branch.
**Implementation**: Use `RlpNodeBuffers` with pre-sized allocations based on trie depth. The `update_actions_buffers` pool in `ParallelSparseTrie` (L123) already demonstrates this pattern.
**Key file**: `sparse/src/trie.rs` L1545-1825
---
## 6. Key Observations
### What's Already Well-Optimized
1. **Incremental hashing via prefix sets**: The `PrefixSet` tracks which paths changed, so `rlp_node()` skips unchanged subtrees by checking `hash.filter(|_| !prefix_set_contains(&path))`. This is the most important optimization — it avoids rehashing the entire trie.
2. **Deferred drops**: `DeferredDrops` (L26-34 in state.rs) avoids expensive deallocations during the critical root computation path.
3. **Trie reuse across payloads**: `into_trie_for_reuse()` (L318-330 in sparse_trie.rs) preserves the sparse trie between payloads, with smart pruning based on subtrie heat.
4. **Worker pools with dedicated transactions**: Pre-spawned workers in `ProofWorkerHandle` avoid per-proof transaction creation overhead.
5. **V2 proof format**: The newer `DecodedMultiProofV2` uses `Vec<ProofTrieNode>` instead of `HashMap`, reducing allocation overhead and enabling simpler processing.
6. **Seek deduplication**: `TrieNodeIter` caches the last seeked entry and last next result (L121-172 in node_iter.rs) to avoid redundant DB operations.
### Architecture Gaps vs. Nethermind
1. **Nethermind's half-path optimization**: Nethermind computes state root incrementally during execution, not as a separate step afterward. Reth's architecture separates execution and state root computation, which adds a proof-fetch round-trip cycle.
2. **Nethermind's paprika tree**: Uses a custom page-based trie storage that eliminates DB cursor overhead. Reth uses MDBX with generic cursor abstractions, adding indirection.
3. **Nethermind's parallel witness generation**: Computes witnesses for all touched accounts in parallel before execution. Reth's `MultiProofMessage::PrefetchProofs` is the closest equivalent but is less integrated.