Add minimal Kademlia DHT spec (#108)

Specifies the Kademlia Distributed Hash Table (DHT) subsystem in libp2p. Co-authored-by: John Hiesey <john@hiesey.com> Co-authored-by: Steven Allen <steven@stebalien.com> Co-authored-by: Max Inden <mail@max-inden.de>
2026-01-09 23:37:55 -05:00 · 2021-06-29 20:21:22 +01:00
parent 3d7385294c
commit 5879eb8ed7
2 changed files with 445 additions and 0 deletions
--- a/kad-dht/README.md
+++ b/kad-dht/README.md
@@ -0,0 +1,443 @@
+# libp2p Kademlia DHT specification
+
+| Lifecycle Stage | Maturity       | Status | Latest Revision |
+|-----------------|----------------|--------|-----------------|
+| 3A              | Recommendation | Active | r0, 2021-05-07  |
+
+Authors: [@raulk], [@jhiesey], [@mxinden]
+
+Interest Group:
+
+[@raulk]: https://github.com/raulk
+[@jhiesey]: https://github.com/jhiesey
+[@mxinden]: https://github.com/mxinden
+
+See the [lifecycle document][lifecycle-spec] for context about maturity level
+and spec status.
+
+[lifecycle-spec]: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md
+
+---
+
+## Overview
+
+The Kademlia Distributed Hash Table (DHT) subsystem in libp2p is a DHT
+implementation largely based on the Kademlia [0] whitepaper, augmented with
+notions from S/Kademlia [1], Coral [2] and the [BitTorrent DHT][bittorrent].
+
+This specification assumes the reader has prior knowledge of those systems. So
+rather than explaining DHT mechanics from scratch, we focus on differential
+areas:
+
+1. Specialisations and peculiarities of the libp2p implementation.
+2. Actual wire messages.
+3. Other algorithmic or non-standard behaviours worth pointing out.
+
+For everything else that isn't explicitly stated herein, it is safe to assume
+behaviour similar to Kademlia-based libraries.
+
+Code snippets use a Go-like syntax.
+
+## Definitions
+
+### Replication parameter (`k`)
+
+The amount of replication is governed by the replication parameter `k`. The
+recommended value for `k` is 20.
+
+### Distance
+
+In all cases, the distance between two keys is `XOR(sha256(key1),
+sha256(key2))`.
+
+### Kademlia routing table
+
+An implementation of this specification must try to maintain `k` peers with
+shared key prefix of length `L`, for every `L` in `[0..(keyspace-length - 1)]`,
+in its routing table. Given the keyspace length of 256 through the sha256 hash
+function, `L` can take values between 0 (inclusive) and 255 (inclusive). The
+local node shares a prefix length of 256 with its own key only.
+
+Implementations may use any data structure to maintain their routing table.
+Examples are the k-bucket data structure outlined in the Kademlia paper [0] or
+XOR-tries (see [go-libp2p-xor]).
+
+### Alpha concurrency parameter (`α`)
+
+The concurrency of node and value lookups are limited by parameter `α`, with a
+default value of 3. This implies that each lookup process can perform no more
+than 3 inflight requests, at any given time.
+
+## DHT operations
+
+The libp2p Kademlia DHT offers the following types of operations:
+
+- **Peer routing**
+
+  - Finding the closest nodes to a given key via `FIND_NODE`.
+
+- **Value storage and retrieval**
+
+  - Storing a value on the nodes closest to the value's key by looking up the
+    closest nodes via `FIND_NODE` and then putting the value to those nodes via
+    `PUT_VALUE`.
+
+  - Getting a value by its key from the nodes closest to that key via
+    `GET_VALUE`.
+
+- **Content provider advertisement and discovery**
+
+  - Adding oneself to the list of providers for a given key at the nodes closest
+    to that key by finding the closest nodes via `FIND_NODE` and then adding
+    oneself via `ADD_PROVIDER`.
+
+  - Getting providers for a given key from the nodes closest to that key via
+    `GET_PROVIDERS`.
+
+In addition the libp2p Kademlia DHT offers the auxiliary _bootstrap_ operation.
+
+### Peer routing
+
+The below is one possible algorithm to find nodes closest to a given key on the
+DHT. Implementations may diverge from this base algorithm as long as they adhere
+to the wire format and make progress towards the target key.
+
+
+Let's assume we’re looking for nodes closest to key `Key`. We then enter an
+iterative network search.
+
+We keep track of the set of peers we've already queried (`Pq`) and the set of
+next query candidates sorted by distance from `Key` in ascending order (`Pn`).
+At initialization `Pn` is seeded with the `k` peers from our routing table we
+know are closest to `Key`, based on the XOR distance function (see [distance
+definition](#distance)).
+
+Then we loop:
+
+1. > The lookup terminates when the initiator has queried and gotten responses
+   from the k (see [#replication-parameter-k]) closest nodes it has seen.
+
+   (See Kademlia paper [0].)
+
+   The lookup might terminate early in case the local node queried all known
+   nodes, with the number of nodes being smaller than `k`.
+2. Pick as many peers from the candidate peers (`Pn`) as the `α` concurrency
+   factor allows. Send each a `FIND_NODE(Key)` request, and mark it as _queried_
+   in `Pq`.
+3. Upon a response:
+	1. If successful the response will contain the `k` closest nodes the peer
+       knows to the key `Key`. Add them to the candidate list `Pn`, except for
+       those that have already been queried.
+	2. If an error or timeout occurs, discard it.
+4. Go to 1.
+
+### Value storage and retrieval
+
+#### Value storage
+
+To _put_ a value the DHT finds `k` or less closest peers to the key of the value
+using the `FIND_NODE` RPC (see [peer routing section](#peer-routing)), and then
+sends a `PUT_VALUE` RPC message with the record value to each of the peers.
+
+#### Value retrieval
+
+When _getting_ a value from the DHT, implementions may use a mechanism like
+quorums to define confidence in the values found on the DHT, put differently a
+mechanism to determine when a query is _finished_. E.g. with quorums one would
+collect at least `Q` (quorum) responses from distinct nodes to check for
+consistency before returning an answer.
+
+Entry validation: Should the responses from different peers diverge, the
+implementation should use some validation mechanism to resolve the conflict and
+select the _best_ result (see [entry validation section](#entry-validation)).
+
+Entry correction: Nodes that returned _worse_ records and nodes that returned no
+record but where among the closest to the key, are updated via a direct
+`PUT_VALUE` RPC call when the lookup completes. Thus the DHT network eventually
+converges to the best value for each record, as a result of nodes collaborating
+with one another.
+
+The below is one possible algorithm to lookup a value on the DHT.
+Implementations may diverge from this base algorithm as long as they adhere to
+the wire format and make progress towards the target key.
+
+Let's assume we’re looking for key `Key`. We first try to fetch the value from the
+local store. If found, and `Q == { 0, 1 }`, the search is complete.
+
+Otherwise, the local result counts for one towards the search of `Q` values. We
+then enter an iterative network search.
+
+We keep track of:
+
+* the number of values we've fetched (`cnt`).
+* the best value we've found (`best`), and which peers returned it (`Pb`)
+* the set of peers we've already queried (`Pq`) and the set of next query
+  candidates sorted by distance from `Key` in ascending order (`Pn`).
+* the set of peers with outdated values (`Po`).
+
+At initialization we seed `Pn` with the `α` peers from our routing table we know
+are closest to `Key`, based on the XOR distance function.
+
+Then we loop:
+
+1. If we have collected `Q` or more answers, we cancel outstanding requests and
+   return `best`. If there are no outstanding requests and `Pn` is empty we
+   terminate early and return `best`. In either case we notify the peers holding
+   an outdated value (`Po`) of the best value we discovered, or holding no value
+   for the given key, even though being among the `k` closest peers to the key,
+   by sending `PUT_VALUE(Key, best)` messages.
+2. Pick as many peers from the candidate peers (`Pn`) as the `α` concurrency
+   factor allows. Send each a `GET_VALUE(Key)` request, and mark it as _queried_
+   in `Pq`.
+3. Upon a response:
+	1. If successful, and we receive a value:
+		1. If this is the first value we've seen, we store it in `best`, along
+           with the peer who sent it in `Pb`.
+		2. Otherwise, we resolve the conflict by e.g. calling
+           `Validator.Select(best, new)`:
+			1. If the new value wins, store it in `best`, and mark all formerly
+               “best" peers (`Pb`) as _outdated peers_ (`Po`). The current peer
+               becomes the new best peer (`Pb`).
+			2. If the new value loses, we add the current peer to `Po`.
+	2. If successful with or without a value, the response will contain the
+       closest nodes the peer knows to the key `Key`. Add them to the candidate
+       list `Pn`, except for those that have already been queried.
+	3. If an error or timeout occurs, discard it.
+4. Go to 1.
+
+#### Entry validation
+
+Implementations should validate DHT entries during retrieval and before storage
+e.g. by allowing to supply a record `Validator` when constructing a DHT node.
+Below is a sample interface of such a `Validator`:
+
+``` go
+// Validator is an interface that should be implemented by record
+// validators.
+type Validator interface {
+	// Validate validates the given record, returning an error if it's
+	// invalid (e.g., expired, signed by the wrong key, etc.).
+	Validate(key string, value []byte) error
+
+	// Select selects the best record from the set of records (e.g., the
+	// newest).
+	//
+	// Decisions made by select should be stable.
+	Select(key string, values [][]byte) (int, error)
+}
+```
+
+`Validate()` should be a pure function that reports the validity of a record. It
+may validate a cryptographic signature, or else. It is called on two occasions:
+
+1. To validate values retrieved in a `GET_VALUE` query.
+2. To validate values received in a `PUT_VALUE` query before storing them in the
+   local data store.
+
+Similarly, `Select()` is a pure function that returns the best record out of 2
+or more candidates. It may use a sequence number, a timestamp, or other
+heuristic of the value to make the decision.
+
+### Content provider advertisement and discovery
+
+Nodes must keep track of which nodes advertise that they provide a given key
+(CID). These provider advertisements should expire, by default, after 24 hours.
+These records are managed through the `ADD_PROVIDER` and `GET_PROVIDERS`
+messages.
+
+#### Content provider advertisement
+
+When the local node wants to indicate that it provides the value for a given
+key, the DHT finds the closest peers to the key using the `FIND_NODE` RPC (see
+[peer routing section](#peer-routing)), and then sends an `ADD_PROVIDER` RPC with
+its own `PeerInfo` to each of these peers.
+
+Each peer that receives the `ADD_PROVIDER` RPC should validate that the received
+`PeerInfo` matches the sender's `peerID`, and if it does, that peer should store
+the `PeerInfo` in its datastore. Implementations may choose to not store the
+addresses of the providing peer e.g. to reduce the amount of required storage or
+to prevent storing potentially outdated address information.
+
+#### Content provider discovery
+
+_Getting_ the providers for a given key is done in the same way as _getting_ a
+value for a given key (see [getting values section](#getting-values)) except
+that instead of using the `GET_VALUE` RPC message the `GET_PROVIDERS` RPC
+message is used.
+
+When a node receives a `GET_PROVIDERS` RPC, it must look up the requested
+key in its datastore, and respond with any corresponding records in its
+datastore, plus a list of closer peers in its routing table.
+
+### Bootstrap process
+
+The bootstrap process is responsible for keeping the routing table filled and
+healthy throughout time. The below is one possible algorithm to bootstrap.
+Implementations may diverge from this base algorithm as long as they adhere to
+the wire format and keep their routing table up-to-date, especially with peers
+closest to themselves.
+
+The process runs once on startup, then periodically with a configurable
+frequency (default: 5 minutes). On every run, we generate a random peer ID and
+we look it up via the process defined in [peer routing](#peer-routing). Peers
+encountered throughout the search are inserted in the routing table, as per
+usual business.
+
+This is repeated as many times per run as configuration parameter `QueryCount`
+(default: 1). In addition, to improve awareness of nodes close to oneself,
+implementations should include a lookup for their own peer ID.
+
+Every repetition is subject to a `QueryTimeout` (default: 10 seconds), which
+upon firing, aborts the run.
+
+## RPC messages
+
+Remote procedure calls are performed by:
+
+1. Opening a new stream.
+2. Sending the RPC request message.
+3. Listening for the RPC response message.
+4. Closing the stream.
+
+On any error, the stream is reset.
+
+Implementations may choose to re-use streams by sending one or more RPC request
+messages on a single outgoing stream before closing it. Implementations must
+handle additional RPC request messages on an incoming stream.
+
+All RPC messages sent over a stream are prefixed with the message length in
+bytes, encoded as an unsigned variable length integer as defined by the
+[multiformats unsigned-varint spec][uvarint-spec].
+
+All RPC messages conform to the following protobuf:
+
+```protobuf
+// Record represents a dht record that contains a value
+// for a key value pair
+message Record {
+    // The key that references this record
+    bytes key = 1;
+
+    // The actual value this record is storing
+    bytes value = 2;
+
+    // Note: These fields were removed from the Record message
+    //
+    // Hash of the authors public key
+    // optional string author = 3;
+    // A PKI signature for the key+value+author
+    // optional bytes signature = 4;
+
+    // Time the record was received, set by receiver
+    // Formatted according to https://datatracker.ietf.org/doc/html/rfc3339
+    string timeReceived = 5;
+};
+
+message Message {
+    enum MessageType {
+        PUT_VALUE = 0;
+        GET_VALUE = 1;
+        ADD_PROVIDER = 2;
+        GET_PROVIDERS = 3;
+        FIND_NODE = 4;
+        PING = 5;
+    }
+
+    enum ConnectionType {
+        // sender does not have a connection to peer, and no extra information (default)
+        NOT_CONNECTED = 0;
+
+        // sender has a live connection to peer
+        CONNECTED = 1;
+
+        // sender recently connected to peer
+        CAN_CONNECT = 2;
+
+        // sender recently tried to connect to peer repeatedly but failed to connect
+        // ("try" here is loose, but this should signal "made strong effort, failed")
+        CANNOT_CONNECT = 3;
+    }
+
+    message Peer {
+        // ID of a given peer.
+        bytes id = 1;
+
+        // multiaddrs for a given peer
+        repeated bytes addrs = 2;
+
+        // used to signal the sender's connection capabilities to the peer
+        ConnectionType connection = 3;
+    }
+
+    // defines what type of message it is.
+    MessageType type = 1;
+
+    // defines what coral cluster level this query/response belongs to.
+    // in case we want to implement coral's cluster rings in the future.
+    int32 clusterLevelRaw = 10; // NOT USED
+
+    // Used to specify the key associated with this message.
+    // PUT_VALUE, GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
+    bytes key = 2;
+
+    // Used to return a value
+    // PUT_VALUE, GET_VALUE
+    Record record = 3;
+
+    // Used to return peers closer to a key in a query
+    // GET_VALUE, GET_PROVIDERS, FIND_NODE
+    repeated Peer closerPeers = 8;
+
+    // Used to return Providers
+    // GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
+    repeated Peer providerPeers = 9;
+}
+```
+
+These are the requirements for each `MessageType`:
+
+* `FIND_NODE`: In the request `key` must be set to the binary `PeerId` of the
+  node to be found. In the response `closerPeers` is set to the `k` closest
+  `Peer`s.
+
+* `GET_VALUE`: In the request `key` is an unstructured array of bytes. `record`
+  is set to the value for the given key (if found in the datastore) and
+  `closerPeers` is set to the `k` closest peers.
+
+* `PUT_VALUE`: In the request `key` is an unstructured array of bytes. The
+  target node validates `record`, and if it is valid, it stores it in the
+  datastore.
+
+* `GET_PROVIDERS`: In the request `key` is set to a CID. The target node
+  returns the closest known `providerPeers` (if any) and the `k` closest known
+  `closerPeers`.
+
+* `ADD_PROVIDER`: In the request `key` is set to a CID. The target node verifies
+  `key` is a valid CID, all `providerPeers` that match the RPC sender's PeerID
+  are recorded as providers.
+
+* `PING`: Deprecated message type replaced by the dedicated [ping
+  protocol][ping]. Implementations may still handle incoming `PING` requests for
+  backwards compatibility. Implementations must not actively send `PING`
+  requests.
+
+Note: Any time a relevant `Peer` record is encountered, the associated
+multiaddrs are stored in the node's peerbook.
+
+---
+
+## References
+
+[0]: Maymounkov, P., & Mazières, D. (2002). Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In P. Druschel, F. Kaashoek, & A. Rowstron (Eds.), Peer-to-Peer Systems (pp. 53–65). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45748-8_5
+
+[1]: Baumgart, I., & Mies, S. (2014). S / Kademlia : A practicable approach towards secure key-based routing S / Kademlia : A Practicable Approach Towards Secure Key-Based Routing, (June). https://doi.org/10.1109/ICPADS.2007.4447808
+
+[2]: Freedman, M. J., & Mazières, D. (2003). Sloppy Hashing and Self-Organizing Clusters. In IPTPS. Springer Berlin / Heidelberg. Retrieved from www.coralcdn.org/docs/coral-iptps03.ps
+
+[bittorrent]: http://bittorrent.org/beps/bep_0005.html
+
+[uvarint-spec]: https://github.com/multiformats/unsigned-varint
+
+[ping]: https://github.com/libp2p/specs/issues/183
+
+[go-libp2p-xor]: https://github.com/libp2p/go-libp2p-xor