Instead of iterating all ancestors from scratch in merge_ancestors_into_overlay,
we now find the most recent ancestor that has a cached anchored_trie_input and
use that as the base overlay. This reduces O(N) work to O(N - cached_depth).
This optimization is particularly helpful at persist/reorg boundaries where
the slow path is triggered. If any ancestor in the chain already has a cached
overlay, we skip rebuilding everything before it.
Also adds Arc::make_mut clone tracking metrics to help identify when CoW
clones are triggered (when strong_count > 1).