From 83179e8bbe0f591168e17f8b3dc31d4b7f676574 Mon Sep 17 00:00:00 2001 From: jonesmarvin8 <83104039+jonesmarvin8@users.noreply.github.com> Date: Fri, 19 Jul 2024 10:17:08 -0400 Subject: [PATCH] feat(bloomfilters): init rlog (#146) > --- rlog/2024-07-19-bloomfilter.mdx | 266 ++++++++++++++++++++++++++++++++ rlog/authors.yml | 6 +- 2 files changed, 271 insertions(+), 1 deletion(-) create mode 100644 rlog/2024-07-19-bloomfilter.mdx diff --git a/rlog/2024-07-19-bloomfilter.mdx b/rlog/2024-07-19-bloomfilter.mdx new file mode 100644 index 00000000..10db5000 --- /dev/null +++ b/rlog/2024-07-19-bloomfilter.mdx @@ -0,0 +1,266 @@ +--- +title: 'Membership with Bloom Filters and Cuckoo Filters' +date: 2024-07-19 12:00:00 +authors: marvin +published: false +slug: membership-with-bloom-filters-and-cuckoo-filters +categories: research + +toc_min_heading_level: 2 +toc_max_heading_level: 4 +--- + +We examine two data structures: Bloom filters and Cuckoo filters. + + + +## Membership with Bloom Filters and Cuckoo Filters + +The ability to efficiently query the membership of an element in a given data set is crucial. +In certain applications, it is more important to output a result quickly than to have a 'perfect' result. +In particular, false positives may be an acceptable tradeoff for speed. +In this blog, we examine [Bloom](https://dl.acm.org/doi/10.1145/362686.362692) and [Cuckoo](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) data filters. +Both of these filters are data structures that can be used for membership proofs. + +Everyone is familiar with the process of creating a new account for various websites, whether it is an e-mail account or a social media account. +Consider when you enter your desired username. +Many sites provide real-time feedback, as you type, on the availability of a given string. +In this scenario, it is necessary that the result is seemingly instant, regardless of the number of existing accounts. +However, it is not important that the usernames that are flagged as unavailable are, in fact, in use. +That is, it is sufficient to have a probabilistic check for membership. + +**Bloom filters** and **Cuckoo filters** are data structures that can be used to accumulate data with a fixed amount of space. +The associated filter $F$ for a digest of data $D$ can be queried to determine whether an element is (possibly) a member of $D$: + +- **0:** The queried element is definitely not a member of digest $D$. +- **1:** The entry is possibly a member of the digest $D$. + +The algorithms associated with Bloom filters and Cuckoo filters, which we will discuss shortly, are deterministic. +The possibility of false positives arises from the query algorithm. + + +## Bloom filters +A **Bloom filter** is a data structure that can be used to accumulate an arbitrary amount of data with a fixed amount of space. +Bloom filters have been a popular data structure for proof of non-membership due to their small storage size. +Specifically, a Bloom filter consists of a binary string ${\bf{v}} \in \{0,1\}^n$ and $k$ hash functions $\{h_i: \{0,1\}^* \rightarrow \{0,\dots,n-1\}\}_{i=0}^{k-1}$. +We note that each hash function $h_i$ is used to determine an index of our binary string ${\bf{v}}$ to flip the associated bit to 1. +The binary string ${\bf{v}}$ is initialized with every entry as 0. +The hash functions do not need to be cryptographic hash functions. + + +- **Append:** Suppose that we wish to add the element $x$ to the Bloom filter. + - Define the vector ${\bf{b}} \in \{0,\dots,n-1\}^k$ so that ${\bf{b}}[i] := h_i(x)$ for each $i \in \{0,\dots,k-1\}$. + - Update the binary string ${\bf{v}}[{\bf{b}}[i]] \leftarrow 1$ for each $i \in \{0,\dots,k-1\}$. + +- **Query:** Suppose that we wish to query the Bloom filter for element $y$. + - Return 1 provided ${\bf{v}}[h_i(y)] = 1$ for every $i \in \{0,\dots,k-1\}$. Otherwise, return 0. + +The algorithm **Query** will output 1 for every element $y$ that has been added to the Bloom filter. +This is a consequence of the **Append** algorithm. +However, due to potential collisions over a set of hash functions, it is possible for false positives to occur. +Moreover, the possibility of collisions makes it impossible to remove elements from the Bloom filter. + +### Complexity +The storage of a Bloom filter requires constant space. +Specifically, the Bloom filter uses $n$ bits regardless of the size of the digest. +So, regardless of the number of elements that we append, the Bloom filter will use $n$ bits. +Further, if we assume that each of the $k$ hash functions runs in constant time, then we can append/query an entry in $O(k)$. + +### Example +Suppose that $k = 3$ and $n = 10$. +Our Bloom filter is initialized as $\bf{v} = \begin{pmatrix}0&0&0&0&0&0&0&0&0&0\end{pmatrix}.$ +Now, we will append the words $add$, $sum$, and $equal$. +Suppose that + +$\begin{matrix} +h_0(add) = 1 & h_1(add) = 4 & h_2(add) = 7\\ +h_0(sum) = 9 & h_1(sum) = 2 & h_2(sum) = 1\\ +h_0(equal) = 5 & h_1(equal) = 8 & h_2(equal) = 0. +\end{matrix}$ + +After appending these words, the Bloom filter is $\bf{v} = \begin{pmatrix}1&1&1&0&1&1&0&1&1&1\end{pmatrix}.$ + +Now, suppose that we query the words $subtract$ and $multiple$ so that + +$\begin{matrix} h_0(subtract) = 3 & h_1(subtract) = 5 & h_2(subtract) = 1\\ h_0(multiple) = 7 & h_1(multiple) = 1 & h_2(multiple) = 4\\ +\end{matrix}$. + +The query for $subtract$ returns 0 since ${\bf{v}}[3]=0$. +On the other hand, the query for $multiple$ returns 1 since ${\bf{v}}[1]=1, {\bf{v}}[4] = 1$, and ${\bf{v}}[7]=1$. +Even though $multiple$ was not used to generate the Bloom filter ${\bf{v}}$, our query returns the false positive. + + +### Probability of false positives +For our analysis, we will assume that the probabilities that arise in our analysis are independent. +However, this assumption can be removed to gain the same approximation. + +We note that for a single hash function, the probability that a specific bit is flipped to 1 is $1/n$. +So, the probability that the specific bit is not flipped by the hash function is $1-1/n$. +Applying our assumption that the $k$ hash functions are 'independent,' +the probability that the specific bit is not flipped by any of the hash functions is +$(1-1/n)^k$. + +Recall the calculus fact $\lim_{\infty} (1-1/n)^n = e^{-1}$. +That is, as we increase the number of bits that our Bloom filter uses, the approximate probability that a given bit is not flipped by any of the $k$ hash functions is $e^{-k/n}$. + +Suppose that $\ell$ entries have been added to the Bloom filter. +The probability that a specific bit is still 0 after the $\ell$ entries have been added is approximately $e^{-\ell k/n}$. +The probability that a queried element is erroneously claimed as a member of the digest is approximately +$(1-e^{-\ell k/n})^k$. + +The following table provides concrete values for these approximations. + +| $n$ | $k$ | $\ell$ | $(1-e^{-\ell k/n})^k$| +| -------- | -------- | -------- | --- | +| 32 | 3 | 3 | 0.01474 | +| 32 | 3 | 7 | 0.11143 | +| 32 | 3 | 12 |0.30802 | +| 32 | 3 | 17 |0.50595 | +| 32 | 3 | 28 |0.79804 | + +Notice that the probability of false positives increases as the number of elements ($\ell$) that have been added to the digest increases. + +### Sliding-Window Bloom filter +Our toy example and table illustrated an issue concerning Bloom filters. +The number of entries that can be added to a Bloom filter is restricted by our choice of $k$ and $n$. +Not only does the probability that false positives will occur increase, +but it is possible that our vector ${\bf{v}}$ can be a string of all 1s. +[Szepieniec and Værge](https://eprint.iacr.org/2023/1208.pdf) proposed a modification to Bloom filters to handle this. + +Instead of having a fixed number of bits for our Bloom filter, we dynamically allot memory based on the number of entries that have been added to the filter. +Given a predetermined threshold ($b$) for the number of entries, we shift our 'window' of flipping bits by $s$ bits. +Note that this means that it is necessary to keep track of when a given entry is added to the digest. +This means that querying the Sliding-Window Bloom filter will yield different results when different timestamps are used. + +This can be done with $k$ hash functions as we used earlier. +Alternatively, Szepieniec and Værge proposed using the same hash function but to produce $k$ entries in the current window. +Specifically, we obtain the bits we wish to flip to 1s by computing $h(X || i)$ for each $i \in \{0,\dots, k-1\}$ and $X$ as we will define next. +For Sliding-Window Bloom filters, $X$ is more than just the element we wish to append to the filter. +Instead, $X$ consists of the element $x$ and a timestamp $t$. +The timestamp $t$ is used to locate the correct window for bits, as we see below: + +- **Append:** Suppose that we wish to add the element $x$ with timestamp $t$ to the Sliding-Window Bloom filter. + - Define the vector ${\bf{b}} \in \{0,\dots,n-1\}^k$ so that ${\bf{b}}[i] := h(x||t||i)$ for each $i \in \{0,\dots,k-1\}$. + - Update the binary string ${\bf{v}}[{\bf{b}}[i]+\lfloor t/b \rfloor s] \leftarrow 1$ for each $i \in \{0,\dots,k-1\}$. + +- **Query:** Suppose that we wish to query the Bloom filter for element $y$ with timestamp $t$. + - Return 1 provided ${\bf{v}}[h(y||t||i) + \lfloor t/b \rfloor s] = 1$ for every $i \in \{0,\dots,k-1\}$. Otherwise, return 0. + + +By incorporating a shifting window, we maintain efficient querying and appending at the cost of constant space. +However, by losing constant space, we gain 'infinite' scalability. + +## Cuckoo filters +A Cuckoo filter is a data structure for probabilistic membership proofs based on Cuckoo hash tables. +The specific design goal for Cuckoo filters is to address the inability to remove elements from a Bloom Filter. +This is done by replacing a list of bits with a list of 'fingerprints.' +A fingerprint can be thought of as the hash value for an entry in the digest. +A Cuckoo filter is a fixed-length list of 'fingerprints.' +If the maximum number of entries that a Cuckoo filter can hold is $n$ and a fingerprint occupies $f$ bits, +then the Cuckoo filter occupies $nf$ bits. + +Now, we describe the algorithms associated with the Cuckoo filter $C$ with hash function $hash(X)$ and fingerprint function $fingerprint(X)$. + +- **Append:** Suppose that we wish to add the element $x$ to the Cuckoo filter. + - If either position $i_x := hash(x)$ or $j_x := i \otimes hash(fingerprint(x))$ of $C$ is empty, + then $fingerprint(x)$ is inserted into an empty position. + - If both $i_x$ and $j_x$ are occupied with a fingerprint that is distinct from $fingerprint(x)$, + then we select either $i_x$ or $j_x$ to insert $fingerprint(x)$. + The fingerprint that had previously occupied this position cannot be discarried. + Instead, we insert this fingerprint into its alternate location. + This reshuffling process either ends with fingerprints all having their own bucket or one that cannot be inserted. + In the case that we have a fingerprint that cannot be inserted, then the Cuckoo filter is overfilled. + +- **Query:** Suppose that we wish to query the Cuckoo filter for element $y$. + - Return 1 provided $fingerprint(y)$ is either in position $i_y$ or $j_y$. + +- **Delete:** Suppose that we wish to delete the element $y$ from the Cuckoo filter. + - If $y$ has been added to the Cuckoo filter, then $fingerprint(y)$ is either in position $i_y$ or $j_y$. + We remove $fingerprint(y)$ from the appropriate position. + +We note that false positives in Cuckoo filters only occur when an element shares a fingerprint and hash with a value that has already been added to the Cuckoo filter. + +### Example +In this example, we will append the words $add$, $sum$, and $equal$ to a Cuckoo filter with 8 slots. + +For each word $x$, we compute two indices: +$i_x := hash(x) \text{ and } j_x := hash(x) \otimes hash(fingerprint(x)).$ +Suppose that we have the following values for +our words: + +| word | $i_x$ | $j_x$| +|---|---|---| +|$add$| $(0,1,0)$ | $(1,0,0)$ | +|$sum$| $(1,0,1)$ | $(1,1,0)$ | +|$equal$| $(0,1,0)$ | $(1,0,1)$ | + +For clarity of the example, we append the words directly to the buckets instead of fingerprints of our data. + +| |0 | 1 | 2 | 3 | 4| 5| 6| 7| +|---|---|---|---|---|---|---|---|---| +|append $add$| ||$add$||||| | +|append $sum$| ||$add$|||$sum$|| | + +Notice that both of the buckets (2 and 5) that $equal$ can map to are occupied. +So, we select one of these buckets (say 2) to insert $equal$ into. +Then, we have to insert $add$ to its possible bucket (1). +This leaves us with the Cuckoo filter: + +|0 | 1 | 2 | 3 | 4| 5| 6| 7| +|---|---|---|---|---|---|---|---| +| |$add$|$equal$|||$sum$|| | + +### Complexity +Notice that deletions and queries to Cuckoo filters are done in constant time. +Specifically, only two locations need to be checked for any data $x$. +Appends may require shuffling previously added elements to their alternate locations. +As such, the append does not run in constant time. + +## Bloom filters vs Cuckoo filters +The design of Bloom filters is focused on space efficiency and quick query time. +Even though they occupy constant space, +Cuckoo filters require significantly more space for $n$ items than Bloom filters. +The worst-case append in a Cuckoo filter is slower than the append in a Bloom filter. +However, an append that does not require any shuffling in a Cuckoo filter can be quicker than appends in Bloom filters. +Cuckoo filters make up for these disadvantages with quicker query time and the ability to delete entries. +Further, the probability of false positives in Cuckoo filters is lower than the probability of false positives in Bloom filters. + + +## Combining Filters with RLN +In a series of posts ([1](https://vac.dev/rlog/rln-anonymous-dos-prevention),[2](https://vac.dev/rlog/rln-v3/),[3](https://vac.dev/rlog/rln-light-verifiers)), +various versons of rate limiting nullifiers (RLN) that are used by Waku has been discussed. +RLN uses a sparse Merkle tree for the membership set. +The computational power required to construct the Merkle tree prevent light clients from participating in verifying membership proofs. +In [Verifying RLN Proofs in Light Clients with Subtrees](https://vac.dev/rlog/rln-light-verifiers), +it was proposed to move the membership set on-chain so that it would not be necessary for a light client to construct the entire Merkle tree locally. +Unfortunately, the naive approach is not practical as the gas limit for a single call is too restrictive for an appropriately sized tree. +Instead, it was proposed to make utilize of subtrees. +In this section, we provide a discussion of an alternate solution for light clients by using filters for the membership set. +The two [parts of RLN](https://rate-limiting-nullifier.github.io/rln-docs/rln_in_details.html) that we will focus on are user registration and deletion. + +Both Bloom and Cuckoo filters support user registration as this is can be done as an append. +The fixed size of these filters would restrict the total number of users that can register. +This can be migitated by using Sliding-Window Bloom filter as this supports system growth. +The Sliding-Window can be adapted to Cuckoo filters as well. +In the case of a Sliding-Window filter, an user would maintain the epoch of when they registered. +The registration of new users to Bloom filters can be done in constant time which is a significant improvement over appending to subtrees. +Unfortunately, the complexity of registration to Cuckoo filters cannot be as easily computed. + +A user could be slashed from the RLN by sending too many messages in a given epoch. +Unfortunately, Bloom filters do not support the deletion of members. +Luckily, Cuckoo filters allow for deletions that can performed in constant time. + +Cuckoo filter that use Sliding-Window could be used so that light clients are able to verify proofs of membership in the RLN. +These proofs are not a substitute to the usual proofs that a heavy client can verify due to the allowance of false positives. +However, with the allowance of false positives, a light client can participate in verification RLN proofs in an efficient manner. + + +### References +- [Space/Time Trade-offs in Hash Coding with Allowable Errors](https://dl.acm.org/doi/10.1145/362686.362692) +- [David Wagner's Lecture Notes on Bloom filters](https://people.eecs.berkeley.edu/~daw/teaching/cs170-s03/Notes/lecture10.pdf) +- [Mutator Sets and their Application to Scalable Privacy](https://eprint.iacr.org/2023/1208) +- [Cuckoo Filter: Practically Better than Bloom](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) +- [Strengthening Anonymous DoS Prevention with Rate Limiting Nullifiers in Waku](https://vac.dev/rlog/rln-anonymous-dos-prevention) +- [RLN-v3: Towards a Flexible and Cost-Efficient Implementation](https://vac.dev/rlog/rln-v3/) +- [Verifying RLN Proofs in Light Clients with Subtrees](https://vac.dev/rlog/rln-light-verifiers) +- [RLN in details](https://rate-limiting-nullifier.github.io/rln-docs/rln_in_details.html) diff --git a/rlog/authors.yml b/rlog/authors.yml index b8d6edee..7e8e2803 100644 --- a/rlog/authors.yml +++ b/rlog/authors.yml @@ -54,4 +54,8 @@ p1ge0nh8er: farooq: name: 'Umar Farooq' - github: 'ufarooqstatus' \ No newline at end of file + github: 'ufarooqstatus' + +marvin: + name: 'Marvin' + github: 'jonesmarvin8' \ No newline at end of file