IPFS Kademlia DHT

The IPFS Distributed Hash Table (DHT) specification defines a structured overlay network used for peer and content routing in the InterPlanetary File System (IPFS). It extends the libp2p Kademlia DHT specification, adapting and adding features to support IPFS-specific requirements.

1. Introduction

The Kademlia Distributed Hash Table (DHT) is a decentralized key-value store designed to enable efficient and scalable peer-to-peer routing. It provides a structured overlay network that allows nodes to locate peers and content in a distributed system without relying on centralized servers.

The primary goal of the Kademlia routing algorithm is to progressively discover and interact with nodes that are closest to a given key based on the network's distance metric. Once a node has identified the closest peers, it can either:

Locate a specific peer in the network
Find content providers serving content associated with a CID
Store and retrieve values directly within the DHT, such as IPNS names

1.1 Relation to libp2p kad-dht

The IPFS Kademlia DHT specification extends the libp2p Kademlia DHT, with practical details related to CID, IPNS, and content providing.

It is possible to use an alternative DHT specification alongside an IPFS implementation, rather than the one detailed here. This document specifically outlines all protocol customizations and adaptations required for participation in the Amino DHT. If you're designing a new Kademlia-based DHT for use with IPFS, some details in this specification may appear overly specific or prescriptive.

1.2 Relation to the Amino DHT

Nodes participating in the public Amino DHT Swarm MUST implement the IPFS Kademlia DHT specification. The IPFS Kademlia DHT specification MAY be used in other DHT swarms as well.

2. DHT Swarms

A DHT swarm is a group of interconnected nodes running the IPFS Kademlia DHT protocol, collectively identified by a unique libp2p protocol identifier. IPFS nodes MAY participate in multiple DHT swarms simultaneously. DHT swarms can be either public or private.

2.1 Identifiers & Existing Swarms

Every DHT swarm is associated with a specific libp2p protocol identifier, and all nodes within that swarm must use it. Public DHT swarms MUST use a unique libp2p protocol identifier, whereas private swarms SHOULD use a distinct identifier. Although private swarms may reuse an identifier if their networks remain isolated, they will merge upon interaction. Therefore, unique identifiers SHOULD be used.

2.1.1 Amino DHT

Amino DHT is a public instance of the IPFS Kademlia DHT spec mounted under /ipfs/kad/1.0.0 libp2p protocol, it is also referred to as the Public IPFS DHT.

The Amino DHT is utilized by multiple IPFS implementations, including kubo and helia and can be joined by using the public good Amino DHT Bootstrappers.

2.1.2 IPFS LAN DHTs

IPFS LAN DHTs are DHT swarms operating exclusively within a local network. They are accessible only to nodes within the same network and are identified by the libp2p protocol /ipfs/lan/kad/1.0.0.

In a LAN DHT:

Only hosts on the local network MAY be added to the routing table.
By default, all hosts operate in server mode.

Although many IPFS LAN DHTs use the same protocol identifier, each swarm is distinct because its scope is limited to its own local network.

Nodes MAY participate in LAN DHTs, enabling fast peer and content discovery in their local network.

2.1.3 Creating a Custom DHT Swarm

Custom DHT swarms can be created to serve specific use cases by meeting these requirements:

Unique libp2p Protocol Identifier: All nodes in a DHT swarm MUST use the same libp2p protocol identifier. A suggested format is /<swarm-prefix>/kad/<version>. Note that if two public swarms share the same protocol identifier and encounter each other, they will merge.
Consistent Protocol Implementation: All nodes participating in the swarm MUST implement the same DHT protocol, including support for all defined RPC messages and behaviors.
Bootstrapper Nodes: To join a swarm, a new node MUST know the multiaddresses of at least one existing node participating in the swarm. Dedicated bootstrapper nodes MAY be used to facilitate this process. They SHOULD be publicly reachable, maintain high availability and possess sufficient resources to support the network.

2.2 Client and Server Mode

A node operating in Server Mode (or DHT Server) is responsible for responding to lookup queries from other nodes and storing records. It stores a share of the global DHT state, and needs to ensure that this state is up-to-date.

A node operating in Client Mode (or DHT Client) is simply a client able to make requests to DHT Servers. DHT Clients don't answer to queries and don't store records.

Having a large number of reliable DHT Servers benefits the network by distributing the load of handling queries and storing records. Nodes SHOULD operate in Server Mode if they are publicly reachable and have sufficient resources. Conversely, nodes behind NATs or firewalls, or with intermittent availability, low bandwidth, or limited CPU, RAM, or storage resources, SHOULD operate in Client Mode. Operating a DHT server without the capacity to respond quickly to queries negatively impacts network performance and SHOULD be avoided.

DHT Servers MUST advertise the libp2p Kademlia protocol identifier via the libp2p identify protocol. In addition DHT Servers MUST accept incoming streams using the libp2p Kademlia protocol identifier.

DHT Clients MUST NOT advertise support for the libp2p Kademlia protocol identifier nor offer the libp2p Kademlia protocol identifier for incoming streams.

DHT Clients MAY Provide Content and Records to the network, content providing is not exclusive to DHT Servers.

2.3 Networking

All nodes MUST run the libp2p network stack.

DHT Servers MUST support the libp2p ping protocol to allow probing by other DHT nodes.

DHT Servers MUST support TCP with Yamux multiplexing and SHOULD support QUIC over UDP as a modern alternative to TCP. For secure communication, DHT Servers MUST support both TLS and Noise. It is essential that all DHT Servers are able to open a connection to each other. Additionally, DHT Servers SHOULD support WebRTC direct, Secure WebSockets and WebTransport. DHT Servers adoption of browser-based transports is encouraged to allow for browser-based DHT Clients to interact with the DHT.

DHT Clients SHOULD support TCP with Yamux multiplexing and QUIC over UDP whenever possible. For secure communication, clients SHOULD support both TLS and Noise. They MAY also support additional libp2p transports. However, to guarantee discovery of existing records in the DHT, a client MUST implement at least one transport (QUIC or TCP+Yamux) with at least one security protocol (TLS or Noise).

Clients that cannot support the required transports and security protocols (e.g., browser-based nodes) MAY still act as DHT Clients, but their ability to find records in the DHT will be limited.

4. Routing Table

The Kademlia Routing Table maintains contact information about other DHT Servers in the network. It has knowledge about all nearby nodes and progressively fewer nodes as the XOR distance increases. This structure allows efficient and rapid navigation of the network during lookups.

4.1 Bucket Size

The Routing Table MUST contain information about at least k DHT Servers whose Kademlia Identifier shares a common prefix of length l with the local node, for every l in [0, 255], provided such nodes exist. The set of k peers sharing a common prefix of length l with the local node is called the bucket l.

In practice, buckets with smaller indices will typically be full, as many nodes in the network share shorter prefix lengths with the local node. Conversely, buckets beyond a certain index usually remain empty, since it's statistically unlikely that any node will have an identifier sharing a very long common prefix with the local node. For more information see bucket population measurements.

The IPFS Kademlia DHT uses a bucket size of k = 20. This corresponds to the k value as defined in the original Kademlia paper [0]. The k value is also used as a replication factor and defines how many peers are returned to a lookup request.

Note that DHT Clients are never included in a Routing Table.

Each DHT Server MUST store the public multiaddresses for every node in its Routing Table. DHT Servers MUST discard nodes with only private and/or relay multiaddresses. Additionally, DHT Servers MUST verify that these nodes are reachable and replace any nodes that are no longer accessible.

4.2 Replacement Policy

Nodes MUST NOT be removed from the Routing Table as long as they remain online. Therefore, the bucket replacement policy is based on seniority, ensuring that the most stable peers are eventually retained in the Routing Table.

4.2.1 IP Diversity Filter

DHT servers SHOULD implement an IP Diversity Filter to ensure that nodes in their routing table originate from a diverse set of Autonomous System Numbers (ASNs). This measure helps mitigate Sybil attacks and enhances the network’s resilience.

Implementations SHOULD impose the following limits:

Globally, a maximum of 3 nodes sharing the same IP grouping should be allowed in the routing table.
Per routing table bucket, a maximum of 2 nodes from the same IP grouping should be permitted.

For IP grouping:

IPv6 addresses are grouped by ASN.
IPv4 addresses are grouped by /16 prefixes, except for legacy Class A blocks, which are grouped by /8 prefixes.

Since a single node can advertise multiple addresses, a peer MUST NOT be added to the routing table if any of its addresses already exceed the allowed representation within the table.

4.3 Routing Table Refresh

There are several strategies a DHT Server can use to verify that nodes in its Routing Table remain reachable. Implementations may choose their own methods, provided they avoid serving unresponsive nodes. The recommended strategy is to periodically refresh the Routing Table.

When using periodic refresh, DHT Servers SHOULD perform a Routing Table Refresh every 10 minutes. During this process, the server sends a ping request to all nodes it hasn't heard from recently (e.g., in the last 5 minutes). Any peer that fails to respond MUST be removed from the Routing Table.

After removing unresponsive peers, any buckets that are not full MUST be replenished with fresh, online peers. This can be accomplished by either adding recently connected peers or by executing a FIND_NODE RPC message with a randomly generated Peer ID matching the bucket. FIND_NODE requests should only be run for buckets up to the last non-empty bucket.

Finally, the refresh process concludes by executing a FIND_NODE request for the local node's Peer ID, ensuring the DHT Server maintains up-to-date information on its closest peers.

5. Lookup Process

When performing a lookup for a Kademlia Identifier in the DHT, a node begins by sending requests to known DHT servers whose identifiers are close to the target. Each response provides information on peers that are even closer to the target identifier, and the process continues iteratively until the absolute closest peers are discovered.

5.1 Iterative vs Recursive Lookup

In an iterative lookup, the querying node sends requests to several known DHT servers. Each server returns a list of peers that are closer to the target Kademlia Identifier, but does not continue the lookup process. The querying node then directly contacts these closer peers, repeating the process until the closest nodes are found.

In a recursive lookup, the querying node delegates the task to a peer that is closer to the target. That peer then queries its own closer peers on behalf of the original node, and this delegation continues recursively until the target is reached.

The IPFS Kademlia DHT uses an iterative lookup approach because recursive lookups can enable amplification attacks and make error handling more complex.

5.2 Server behavior

Upon receiving a lookup request for a Kademlia Identifier kid, a DHT Server MUST return the Peer ID and multiaddresses of the k closest DHT Servers to kid that are stored in its Routing Table. It SHOULD NOT include itself, nor the requester in the response. If itself or the requester's Peer ID are among the k closest DHT Servers, it SHOULD return the next closest nodes instead, to return a total of k DHT Servers. DHT Servers SHOULD NOT return any information about unresponsive nodes.

In public DHT swarms, DHT Servers MUST filter out private and loopback multiaddresses, and MUST NOT include DHT Servers whose only addresses are private or loopback.

A DHT Server SHOULD always return information about the k closest DHT Servers to kid (excluding self and the requester), provided its routing table contains enough DHT Servers, even if these DHT Servers are not closer to kid than self or the requester.

5.3 Client behavior

When a client initiates a lookup for a Kademlia Identifier kid (DHT Servers can initiate lookups as clients), it starts by selecting the closest nodes to kid in XOR distance, and put them in a list/set. Then it sends requests for kid to the closest nodes (see concurrency) to kid from the list.

Upon receiving a response, the client adds freshly received peers to the list of closest peers. It sends a request to the closest peer to kid that hasn't been queried yet. The client ignores timeouts and invalid responses.

5.3.1 Termination

The resilience parameter (β) defines the number of closest reachable peers that must be successfully queried before a lookup is considered complete. Implementations SHOULD set β to 3, ensuring that multiple nodes confirm the lookup result for increased reliability.

The lookup process continues until the β closest reachable peers to kid have been queried. However, the process MAY terminate earlier if the request-specific success criteria are met. Additionally, if all candidate peers have been queried without discovering any new ones, the lookup MUST terminate.

5.3.2 Concurrency

A client MAY have multiple concurrent in-flight queries to distinct nodes for the same lookup. This behavior is specific to the client and does not affect how DHT servers operate.

The maximum number of in-flight requests (denoted by α) SHOULD be set to 10.

6. Peer Routing

Implementations typically provide two interfaces for peer routing using the FIND_NODE RPC:

FindPeer, which locates a specific Peer ID, and
GetClosestPeers, which finds the k closest peers to a given key.

6.1 `FindPeer`

FindPeer is the process of discovering the multiaddresses of a given Peer ID. The requester uses the FIND_NODE RPC, including the bytes representation of the target Peer ID in the key field. The lookup eventually converges on the target Peer ID. The lookup process terminates early if the requester has established a connection to the target Peer ID.

6.1.1 Discovering non-DHT Servers

DHT clients that want to remain routable MUST ensure their multiaddresses are stored in the peerstore of the DHT Servers closest to them in XOR distance. Since peerstore entries expire over time, DHT Clients SHOULD periodically reconnect to their closest DHT servers to prevent their information from being removed. Implementations SHOULD perform this reconnection every 10 minutes.

When receiving a FIND_NODE request for a given Peer ID, DHT Servers MUST always respond with the information of that Peer ID, if it is included in their peerstore, even if the target node isn't a DHT Server or only advertises private addresses. Moreover, if the target Peer ID is self, or the requester's Peer ID, the corresponding peer information should be included in addition to the k closest DHT Servers.

6.2 `GetClosestPeers`

GetClosestPeers also makes use of the FIND_NODE RPC, but allows the sender to look for the k closest peers to any key. The key provided to FIND_NODE corresponds to the preimage of the Kademlia Identifier, as described below.

GetClosestPeers is used for the purpose of Content Routing (Provider Record Routing).

7. Provider Record Routing

Provider Record Routing is IPFS-specific process of locating peers that provide a specific piece of content, identified by its CID. This is achieved by storing and retrieving Provider Records in the DHT.

7.1 Provider Records

A Provider Record is an entry stored in the DHT associating a CID with one or more Peer IDs providing the corresponding content. Instead of storing the content itself, the DHT stores provider records pointing to the peers hosting the content.

A Provider Record is identified by the multihash contained by the CID. It functions as an append-only list, where multiple providers can add themselves as content hosts. Since strict consistency across the network is not required, different DHT servers MAY store slightly different sets of providers, but the lookup mechanism ensures that clients can still discover multiple sources efficiently.

7.2 Content Kademlia Identifier

The Kademlia Identifier associated with a CID is derived from the multihash contained by the CID, by hashing it with the SHA2-256 hash function. The resulting 256-bit digest is used as the Kademlia Identifier for the content.

Example:

CIDv1 (base32)           : bafybeihfg3d7rdltd43u3tfvncx7n5loqofbsobojcadtmokrljfthuc7y
Multihash from CID (hex) : 1220e536c7f88d731f374dccb568aff6f56e838a19382e488039b1ca8ad2599e82fe
Kademlia Identifier (hex): d623250f3f660ab4c3a53d3c97b3f6a0194c548053488d093520206248253bcb

7.3 Content Provider Advertisement

When a node wants to indicate that it provides the content associated with a given CID, it first finds the k closest DHT Servers to the Kademlia Identifier associated with the CID using GetClosestPeers. The key in the FIND_NODE payload is set to the multihash contained in the CID.

Once the k closest DHT Servers are found, the node sends each of them an ADD_PROVIDER RPC, using the same key and setting its own Peer ID as providerPeers. Providers MUST indicate their listen multiaddresses to be cached and served with the provider record.

The DHT Servers MUST make 2 checks before adding the provided record to their datastore:

Verify that key is set, and doesn't exceed 80 bytes in size
Discard providerPeers whose Peer ID is not matching the sender's Peer ID

Upon successful verification, the DHT Server stores the Provider Record in its datastore, and caches the provided public multiaddresses. It responds by echoing the request to confirm success. If verification fails, the server MUST close the stream without sending a response.

7.3.1 Provide Validity

Provide Validity defines the time-to-live (TTL) of a Provider Record on a DHT Server. DHT Servers MUST implement a Provide Validity of 48h, and discard the record after expiration.

7.3.2 Provider Record Republish Interval

Because of the churn in the network, Provider Records need to be republished more often than their validity period. DHT Clients SHOULD republish Provider Records every 22h (rationale).

7.3.3 Provider Addresses TTL

DHT Servers SHOULD persist the multiaddresses of providers for 24h after the PROVIDE operation. This allows DHT Servers to serve the multiaddresses of the content provider alongside the provide record, avoiding an additional DHT walk for the Client (rationale).

7.4 Content Provider Lookup

To find providers for a given CID, a node initiates a lookup using the GET_PROVIDERS RPC. This process follows the same approach as a FIND_NODE lookup, but with one key difference: if a DHT server holds a matching provider record, it MUST include it in the response.

Clients MAY terminate the lookup early if they are satisfied with the returned providers. If a node does not find any provider records and is unable to discover closer DHT servers after querying the β closest reachable servers, the request is considered a failure.

8. Value Storage and Retrieval

The IPFS Kademlia DHT allows users to store and retrieve records directly within the DHT. These records serve as key-value mappings, where the key and value are defined as arrays of bytes. Each record belongs to a specific keyspace, which defines its type and structure.

The IPFS Kademlia DHT supports two types of records, each stored in its own keyspace:

Public Key Records (/pk/) – Used to store public keys that cannot be derived from Peer IDs.
IPNS Records (/ipns/) – Used for decentralized naming and content resolution. See IPNS Routing Record and IPNS Record Verification.

Records with the above prefixes MUST meet validity criteria specific to their record type before being stored or updated. DHT Servers MUST verify the validity of each record before accepting it. Records with other prefixes are not supported by the IPFS Kademlia DHT and MUST be rejected.

8.1 Record Routing

The Kademlia Identifier of a record is derived by applying the SHA2-256 hash function to the record’s key and using the resulting digest in binary format.

To store a value in the DHT, a client first finds the k closest peers to the record’s Kademlia Identifier using GetClosestPeers. The client then sends a PUT_VALUE RPC to each of these peers, including the key and the record. DHT servers MUST validate the record based on its type before accepting it.

Retrieving values from the DHT follows a process similar to provider record lookups. Clients send a GET_VALUE RPC, which directs the search toward the k closest nodes to the target key. If a DHT Server holds a matching record, it MUST include it in its response. The conditions for terminating the lookup depend on the specific record type.

8.2 Public Keys

Some public keys are too large to be embedded within libp2p Peer IDs (keys larger than 42 bytes). In such cases, the Peer ID is derived from the hash of the public key, but the full key still needs to be accessible. To facilitate retrieval, public keys MAY be stored directly in the DHT under the /pk/ keyspace.

Key: /pk/<PeerID> (binary Peer ID format).
Value: The full public key (in binary format).

8.2.1 Validation

DHT servers MUST verify that the Peer ID derived from the full public key matches the Peer ID encoded in the key. If the derived Peer ID does not match, the record MUST be rejected.

8.3 IPNS

IPNS (InterPlanetary Naming System) allows peers to publish mutable records that point to content in IPFS. These records MAY be stored in the DHT under the /ipns/ namespace.

Record format and validation is documented in the IPNS specification.

IPNS implementations MUST follow IPNS Routing Record, IPNS Record Verification, and IPNS Record Size Limit.

8.3.1 Quorum

A quorum is the minimum number of distinct responses a client must collect from DHT Servers to determine a valid result. Since different DHT Servers may store different versions of an IPNS record, a client fetches the record from multiple DHT Servers to increase the likelihood of retrieving the most recent version.

For IPNS lookups, implementations SHOULD use a quorum value of 16, meaning the client attempts to collect responses from at least 16 DHT Servers out of 20 before determining the best available record.

8.3.2 Entry Correction

Because some DHT servers may store outdated versions of a record, clients need to ensure that the latest valid version is propagated. After obtaining a quorum, the client MUST send the most recent valid record to any of the k closest DHT Servers to the record’s Kademlia Identifier that did not return the latest version.

9. RPC Messages

Remote procedure calls are performed by:

Opening a new stream.
Sending the RPC request message.
Listening for the RPC response message.
Closing the stream.

On any error, the stream is reset.

Implementations MAY re-use streams by sending one or more RPC request messages on a single outgoing stream before closing it. Implementations MUST handle additional RPC request messages on an incoming stream.

All RPC messages sent over a stream are prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

All RPC messages conform to the following protobuf:

syntax = "proto3";

// Record represents a dht record that contains a value
// for a key value pair
message Record {
    // The key that references this record
    bytes key = 1;

    // The actual value this record is storing
    bytes value = 2;

    // Note: These fields were removed from the Record message
    //
    // Hash of the authors public key
    // optional string author = 3;
    // A PKI signature for the key+value+author
    // optional bytes signature = 4;

    // Time the record was received, set by receiver
    // Formatted according to https://datatracker.ietf.org/doc/html/rfc3339
    string timeReceived = 5;
};

message Message {
    enum MessageType {
        PUT_VALUE = 0;
        GET_VALUE = 1;
        ADD_PROVIDER = 2;
        GET_PROVIDERS = 3;
        FIND_NODE = 4;
        PING = 5; // DEPRECATED
    }

    enum ConnectionType {
        // sender does not have a connection to peer, and no extra information (default)
        NOT_CONNECTED = 0;

        // sender has a live connection to peer
        CONNECTED = 1;

        // sender recently connected to peer
        CAN_CONNECT = 2;

        // sender recently tried to connect to peer repeatedly but failed to connect
        // ("try" here is loose, but this should signal "made strong effort, failed")
        CANNOT_CONNECT = 3;
    }

    message Peer {
        // ID of a given peer.
        bytes id = 1;

        // multiaddrs for a given peer
        repeated bytes addrs = 2;

        // used to signal the sender's connection capabilities to the peer
        ConnectionType connection = 3;
    }

    // defines what type of message it is.
    MessageType type = 1;

    // defines what coral cluster level this query/response belongs to.
    // in case we want to implement coral's cluster rings in the future.
    int32 clusterLevelRaw = 10; // NOT USED

    // Used to specify the key associated with this message.
    // PUT_VALUE, GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
    bytes key = 2;

    // Used to return a value
    // PUT_VALUE, GET_VALUE
    Record record = 3;

    // Used to return peers closer to a key in a query
    // GET_VALUE, GET_PROVIDERS, FIND_NODE
    repeated Peer closerPeers = 8;

    // Used to return Providers
    // GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
    repeated Peer providerPeers = 9;
}

These are the requirements for each MessageType:

FIND_NODE: In the request key must be set to the binary PeerId of the node to be found. In the response closerPeers is set to the DHT Server's k closest Peers.
GET_VALUE: In the request key is an unstructured array of bytes. closerPeers is set to the k closest peers. If key is found in the datastore record is set to the value for the given key.
PUT_VALUE: In the request record is set to the record to be stored and key on Message is set to equal key of the Record. The target node validates record, and if it is valid, it stores it in the datastore and as a response echoes the request.
GET_PROVIDERS: In the request key is set to the multihash contained in the target CID. The target node returns the known providerPeers (if any) and the k closest known closerPeers.
ADD_PROVIDER: In the request key is set to the multihash contained in the target CID. The target node verifies key is a valid multihash, all providerPeers matching the RPC sender's PeerID are recorded as providers.
PING: Deprecated message type replaced by the dedicated ping protocol.

If a DHT server receives an invalid request, it simply closes the libp2p stream without responding.