diff --git a/rlog/2024-12-30-merkle-tree.mdx b/rlog/2024-12-30-merkle-tree.mdx new file mode 100644 index 00000000..7a7b55ef --- /dev/null +++ b/rlog/2024-12-30-merkle-tree.mdx @@ -0,0 +1,214 @@ +--- +title: 'Vac 101: Climbing Merkle Trees' +date: 2024-12-30 12:00:00 +authors: marvin +published: true +slug: climbing-merkle-trees +categories: research + +toc_min_heading_level: 2 +toc_max_heading_level: 4 +--- + +In this post, we introduce a crucial data structure used throughout web3. + + + +## Introduction + +A large amount of data is swapped between users on a blockchain in the form of transactions. +Over the entire life of a blockchain, +the storage space required to maintain a copy of every transaction becomes untenable for most users. +However, the integrity of a blockchain relies on a large pool of users that can validate the blockchain's history from its inception to its present state. +The data representing the blockchain's state is compressed. +This compression addresses the issue of scalability that would otherwise greatly restrict the pool of users. + +Data compression alone is not the end goal. +As mentioned, it is essential for users to be able to validate the blockchain's history. +The property of compression and validation was solved in Bitcoin by the use of Merkle trees. +Merkle trees were introduced first by Ralph Merkle in his dissertation [[1](https://www.ralphmerkle.com/papers/Thesis1979.pdf)]. +A Merkle tree is a data structure that compresses a digest of data to a constant size while still providing a method for proving membership of elements of the digest. +A previous rlog[[2](https://vac.dev/rlog/rln-light-verifiers/)] described how Merkle trees with their proof of membership could be used for lightweight clients for RLN. + +## Tree structure + +A tree is a special data structure that organizes nodes so that there is exactly one path between any two nodes. +The trees that we consider can be arranged in layers with multiple nodes (children) merged into a single node (parent) in the preceding layer. +A single node exists in the base layer; +this special node is called the root node. +The highest level of the tree consists of childless nodes called leaves. + +![Image 1](/img/vac101/vac101_tree.png) + +A binary tree has one additional property: +each nonleaf node has exactly two children nodes. +That is, we assume that nodes in a binary tree are either a parent node with two children or a leaf. +As strange as it sounds, each child node has exactly one parental node. + +![Image 2](/img/vac101/vac101_binary_tree.png) + +A binary tree with $2^n$ leaves consists of $n+1$ layers. +Additionally, such a tree has $2^{n+1}-1$ nodes. + +## Merkle trees + +A Merkle tree is a specialized tree in which each node contains the evaluation of a hash function. +Merkle trees are usually taken to have a binary tree structure. +As such, the presentation we provide in this section will be for binary trees. + +### Construction +In this section, we show how Merkle trees are constructed to compress a digest $D$. +Suppose that the digest $D$ consists of $2^n$ entries; +we assume that the digest $D$ has this many entries since a Merkle tree is a binary tree. +Additionally, each digest can be padded to ensure that $D$ has the desired number of entries. + +Each leaf of the Merkle tree contains the hash of a digest entry. +Each parent node contains the hash of the concatenation of their child nodes. +Through this iterative construction, we reach the root of the tree. +The value contained in the root node is called the root hash. +The root hash is a compressed representation of the digest $D$. + +![Image 3](/img/vac101/vac101_merkle_tree.png) + +Each node in the Merkle tree is computed by taking a hash. +Since a binary tree with $2^n$ leaves has $2^{n+1}-1$ nodes, +then we need to evaluate $2^{n+1}-1$ hashes to construct the Merkle tree. + +### Merkle tree intregrity + +A large quantity of data can be compressed to a single hash value. +A natural question to ask is: could a clever party find another digest that yields a Merkle tree with the same root hash? +If possible, this would compromise the ledger since the blockchain's history could be altered. +Fortunately, Merkle trees are quite secure. +In fact, Merkle trees can be used to both bind and hide a digest. + +The Merkle tree is able to bind a digest with one of the properties of hash functions (see our previous Vac 101 [[3](https://vac.dev/rlog/vac101-fiat-shamir#hash-functions)] for information on hash functions). +A hash function is collision resistant; it is infeasible for a malicious party to find two values share the same hash value. + +This collision resistance property, essentially, fixes the input to each leaf and into their parent, their parent's parent, and so on. + +In certain applications, +it may be desirable for the digest of a Merkle tree to be kept confidential. +This is achieved with the preimage resistant property of hash functions. +A hash function is preimage resistant provided that it is difficult to reverse the hashing operation. +It would be necessary for a malicious party to find preimages to each node starting from the root node to determine the original digest. + +Now, we see that Merkle trees are secured structures that are tamper resistant. + +### Proof of membership +An interesting and critical property of Merkle trees is their ability to prove that any piece of data is part of its digest. +This can be done with logarithmic storage and logarithmic computation time. + +Suppose that we want to show that data $\ell$ is part of the Merkle tree's digest. +Additionally, suppose that $\mathsf{hash}$ is the hash function used to construct the tree. +We assume that the hash function $\mathsf{hash}$ can be computed in constant-time for any input. + +Suppose that a prover provides data $\ell$ to a verifier, +and tells the verifier that $\ell$ corresponds to the $i$th leaf of the Merkle tree. +For the verifier to be convinced that $\ell$ is part of the digest, he needs to be able to construct the tree's root hash using $\mathsf{hash}$, $i$, $\ell$ and some additional information from the prover. +Specifically, the prover must provide the sibling hashes for each value that the verifier can compute. +This enables the verifier to compute the parents of the siblings that the prover provides and the values that he was able to produce himself. +The last of the computed parents is the root. + +The leaf index $i$ indicates whether a hash value provided by the prover is a left or right sibling. +This is done by looking at the binary expansion of $i$. + +The verifier can compute the leaf $h_0 = \mathsf{hash}(\ell)$. +Next, using $h_0$'s sibling, $h'_0$, provided by the prover, +the verifier can compute $h_1 = \mathsf{hash}(h_0 \|h'_0)$ or $h_1 = \mathsf{hash}(h'_0 \| h_0)$ +depending on whether $h'_0$ is a left or right sibling. +This pathing continues until the verifier either successfully computes the root hash (in $n+1$ hashes) or fails to do so. + +The prover has to provide $n$ sibling nodes for the proof of membership. + +There is a key detail that is essential for the proof of membership to be secure. +The root hash has to be provided to the verifier prior to the selection of data $\ell$. +Otherwise, the prover could generate a series of hash values (with the corresponding root hash) to forge a proof of membership. + +#### Capped proof of membership +Polygon provides an implementation [[4](https://github.com/0xPolygonZero/plonky2/blob/main/plonky2/src/hash/merkle_tree.rs)] of a shortened proof of membership with a slight modification. +A specific layer of the Merkle tree is published instead of just the root hash. +By doing this, a capped proof of membership is just the path from leaf to the published layer. + +## Extensions of Merkle trees + +Merkle trees can be extended in multiple ways. +In this section, we explore a select few of these extensions. + +### Sparse Merkle trees + +A sparse Merkle tree (SMT) is a special Merkle tree that can be used to represent digests with nonconsecutive entries. +Specifically, each digest entry has a particular leaf index. +For simplicity, we assume that the index value is computed by taking the hash of the entry. +We note that this is a sorted SMT. + +Let $n$ denote the number of bits that a hash value can possess. This means that our SMT can have at most $2^n$ leaves. + +An SMT is treated as a Merkle tree in which each entry is placed in the leaf corresponding to its hash value, and the other entries have a $\mathsf{null}$ marker inserted in. +This means that we can prove membership in the way described. +However, we can also prove nonmembership of an element by showing that $\mathsf{null}$ is located in the element's hash location. +The crucial difference between a sorted and unsorted SMT is that the unsorted variant cannot be used to prove nonmembership. + +We can take advantage of the sparse nature of SMTs to provide shortened proofs. +Specifically, it is unlikely for entries to cluster together. +Thus, it is efficient to maintain a list of values: + +|Null values | +|---| +|$d_0 := \mathsf{Hash(null)},$| +|$d_1 := \mathsf{Hash(d_0 \|\| d_0)},$| +|$d_2 := \mathsf{Hash(d_1 \|\| d_1)},$| +|$\vdots$| +|$d_{n-1} := \mathsf{Hash(d_{n-2} \|\| d_{n-2})}.$| + +Each of the $d_i$'s represents the root hash of a Merkle tree with $2^i$ leaves containing $\mathsf{null}$. +These values can be used to shorten the time needed to construct an SMT and the length of proofs. + +### Proof of nonmembership + +In the first Vac 101 [[5](https://vac.dev/rlog/vac101-membership-with-bloom-filters-and-cuckoo-filters)], we examined Bloom and Cuckoo filters that could be used for proof of membership and nonmembership. +However, the proof of membership may result in false positives due to collisions. +This would affect nonmembership proofs as well. +Sparse Merkle trees can be adapted to provide greater assurance that a given piece of data is not a member of the digest. + +Why is sorting essential? +The sorting mechanism of data can be arbitrarily chosen. +However, it is essential that there are no gaps in the ordering. +The maximum number of elements that could ever exist in the digest must be known. +A simple method for this is to use a hash function to provide fingerprints to the data. +Each hash using either SHA-256 or Keccak has 256-bits. +Our entire digest could consist of a maximum of $2^{256}$ entries. +This assumes that our digest does not contain collisions. + +The fingerprint of a piece of data $\ell$ indicates which leaf of the SMT it is contained in. +This means that a nonmembership of $\ell$ in the SMT becomes a matter of proving that $\mathsf{null}$ is contained in $\ell$'s location. + +It is crucial for the SMT to be sorted. +Otherwise, a malicious party can append the entry $\ell$ to a random location. +This allows for the malicious party to provide contradictory proofs that prove both membership and nonmembership. +We note that the requirement that an SMT is sorted may be too strong of an assumption in centralized cases. +However, sortedness is a necessary property of SMTs for decentralized systems. + +### Verkle Trees +A proof of membership grows in length as the Merkle tree grows. +The most obvious approach to remedy this scalability issue is to use Merkle trees in which each node has more than two children. +However, this does not fix the issue. +A proof of membership in a $k$-nary Merkle tree [[6](https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf)] (each node has $k$ children) has a proof size $\log_k(n)(k-1)$. +The multiple $k-1$ is the number of silbings that a node has on each layer. +Hence, the proof size grows faster than a logarithmic function of the digest size. + +An alternate approach is to use a different data structure: Verkle trees [[6](https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf)]. +A Verkle tree replaces hash functions with polynomial commitments [[7](https://ethresear.ch/t/using-polynomial-commitments-to-replace-state-roots/7095), [8](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html)]. +We will explore Verkle trees in a future Vac 101 edition. + +## References + +- 1. [Secrecy, Authentication, and Public Key Systems](https://www.ralphmerkle.com/papers/Thesis1979.pdf) +- 2. [Verifying RLN Proofs in Light Clients with Subtrees](https://vac.dev/rlog/rln-light-verifiers/) +- 3. [Vac 101: Transforming an Interactive Protocol to a Noninteractive Argument](https://vac.dev/rlog/vac101-fiat-shamir#hash-functions) +- 4. [Capped merkle tree in Plonky2](https://github.com/0xPolygonZero/plonky2/blob/main/plonky2/src/hash/merkle_tree.rs) +- 5. [Vac 101: Membership with Bloom Filters and Cuckoo Filters](https://vac.dev/rlog/vac101-membership-with-bloom-filters-and-cuckoo-filters) +- 6. [Verkle Trees](https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf) +- 7. [Using polynomial commitments to replace state roots](https://ethresear.ch/t/using-polynomial-commitments-to-replace-state-roots/7095) +- 8. [KZG polynomial commitments](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html) +- 9. [O1 labs' Verkle Tree repo](https://github.com/o1-labs/verkle-tree) \ No newline at end of file diff --git a/static/img/vac101/vac101_binary_tree.png b/static/img/vac101/vac101_binary_tree.png new file mode 100644 index 00000000..e09efd1e Binary files /dev/null and b/static/img/vac101/vac101_binary_tree.png differ diff --git a/static/img/vac101/vac101_merkle_tree.png b/static/img/vac101/vac101_merkle_tree.png new file mode 100644 index 00000000..cc446578 Binary files /dev/null and b/static/img/vac101/vac101_merkle_tree.png differ diff --git a/static/img/vac101/vac101_tree.png b/static/img/vac101/vac101_tree.png new file mode 100644 index 00000000..bda48be8 Binary files /dev/null and b/static/img/vac101/vac101_tree.png differ