A Node is the smallest unit present in a graph, and it comes from graph theory. In UnixFS, there is a 1-to-1 mapping between nodes and blocks. Therefore, they are used interchangeably in this document.
A node is addressed by a CID. In order to be able to read a node, its CID is required. A CID includes two important pieces of information:
Thus, when a block is retrieved and its bytes are hashed using the hash function specified in the multihash, this gives the same multihash value contained in the CID.
In UnixFS, a node can be encoded using two different multicodecs, listed below. More details are provided in the following sections:
raw
NodeThe simplest nodes use raw
encoding and are implicitly a File. They can
be recognized because their CIDs are encoded using the raw
(0x55
) codec:
Tsize
in parent is equal to blocksize
).Important: Do not confuse raw
codec blocks (0x55
) with the deprecated Raw
DataType (enum value 0
):
raw
codec - Modern way to store data without protobuf wrapper (used for small files and leaves)Raw
DataType - Legacy UnixFS type that wrapped raw data in dag-pb protobuf (implementations MUST NOT produce, MAY read for compatibility)dag-pb
NodeMore complex nodes use the dag-pb
(0x70
) encoding. These nodes require two steps of
decoding. The first step is to decode the outer container of the block. This is encoded using the dag-pb
specification, which uses Protocol Buffers and can be
summarized as follows:
message PBLink {
// Binary representation of CID (https://github.com/multiformats/cid) of the target object.
// This contains raw CID bytes (either CIDv0 or CIDv1) with no multibase prefix.
// CIDv1 is a binary format composed of unsigned varints, while CIDv0 is a raw multihash.
// In both cases, the bytes are stored directly without any additional prefix.
bytes Hash = 1;
// UTF-8 string name
string Name = 2;
// cumulative size of target object
uint64 Tsize = 3;
}
message PBNode {
// refs to other objects
repeated PBLink Links = 2;
// opaque user data
bytes Data = 1;
}
After decoding the node, we obtain a PBNode
. This PBNode
contains a field
Data
that contains the bytes that require the second decoding. This will also be
a protobuf message specified in the UnixFSV1 format:
message Data {
enum DataType {
Raw = 0; // deprecated, use raw codec blocks without dag-pb instead
Directory = 1;
File = 2;
Metadata = 3; // reserved for future use
Symlink = 4;
HAMTShard = 5;
}
DataType Type = 1; // MUST be present - validate at application layer
bytes Data = 2; // file content (File), symlink target (Symlink), bitmap (HAMTShard), unused (Directory)
uint64 filesize = 3; // mandatory for Type=File and Type=Raw, defaults to 0 if omitted
repeated uint64 blocksizes = 4; // required for multi-block files (Type=File) with Links
uint64 hashType = 5; // required for Type=HAMTShard (currently always murmur3-x64-64)
uint64 fanout = 6; // required for Type=HAMTShard (power of 2, max 1024)
uint32 mode = 7; // opt-in, AKA UnixFS 1.5
UnixTime mtime = 8; // opt-in, AKA UnixFS 1.5
}
message Metadata {
string MimeType = 1; // reserved for future use
}
message UnixTime {
int64 Seconds = 1; // MUST be present when UnixTime is used
fixed32 FractionalNanoseconds = 2;
}
Summarizing, a dag-pb
UnixFS node is a dag-pb
protobuf,
whose Data
field is a UnixFSV1 Protobuf message. For clarity, the specification
document may represent these nested Protobufs as one object. In this representation,
it is implied that the PBNode.Data
field is protobuf-encoded.
dag-pb
TypesA dag-pb
UnixFS node supports different types, which are defined in
decode(PBNode.Data).Type
. Every type is handled differently.
dag-pb
File
A File is a container over an arbitrary sized amount of bytes. Files are either single block or multi-block. A multi-block file is a concatenation of multiple child files.
Single-block files SHOULD prefer the raw
codec (0x55) over dag-pb
for the canonical CID,
as it's more efficient and avoids the protobuf overhead. The raw
encoding is described
in the raw
Node section.
PBNode.Links
and decode(PBNode.Data).blocksizes
The sister-lists are the key point of why dag-pb
is important for files. They
allow us to concatenate smaller files together.
Linked files would be loaded recursively with the same process following a DFS (Depth-First-Search) order.
Child nodes must be of type File; either a dag-pb
File, or a
raw
block.
For example, consider this pseudo-json block:
{
"Links": [{"Hash":"Qmfoo"}, {"Hash":"Qmbar"}],
"Data": {
"Type": "File",
"blocksizes": [20, 30]
}
}
This indicates that this file is the concatenation of the Qmfoo
and Qmbar
files.
When reading a file represented with dag-pb
, the blocksizes
array gives us the
size in bytes of the partial file content present in children DAGs. Each index in
PBNode.Links
MUST have a corresponding chunk size stored at the same index
in decode(PBNode.Data).blocksizes
.
The child blocks containing the partial file data can be either:
raw
blocks (0x55): Direct file data without protobuf wrapperdag-pb
blocks (0x70): File data wrapped in protobuf, potentially with further childrenImplementers need to be extra careful to ensure the values in Data.blocksizes
are calculated by following the definition from Blocksize
.
This allows for fast indexing into the file. For example, if someone is trying
to read bytes 25 to 35, we can compute an offset list by summing all previous
indexes in blocksizes
, then do a search to find which indexes contain the
range we are interested in.
In the example above, the offset list would be [0, 20]
. Thus, we know we only need to download Qmbar
to get the range we are interested in.
A UnixFS parser MUST reject the node and halt processing if the blocksizes
array and
Links
array contain different numbers of elements. Implementations SHOULD return a
descriptive error indicating the array length mismatch rather than silently failing or
attempting to process partial data.
decode(PBNode.Data).Data
An array of bytes that is the file content and is appended before
the links. This must be taken into account when doing offset calculations; that is,
the length of decode(PBNode.Data).Data
defines the value of the zeroth element
of the offset list when computing offsets.
PBNode.Links[].Name
The Name
field is primarily used in directories to identify child entries.
For internal file chunks:
Name
fields (the field should be absent in the protobuf, not an empty string)Name
is present in an internal file chunk, the parser MUST reject the file and halt processing as this indicates an invalid file structuredecode(PBNode.Data).Blocksize
This field is not directly present in the block, but rather a computable property
of a dag-pb
, which would be used in the parent node in decode(PBNode.Data).blocksizes
.
Important: blocksize
represents only the raw file data size, NOT including the protobuf envelope overhead.
It is calculated as:
dag-pb
blocks: the length of decode(PBNode.Data).Data
field plus the sum of all child blocksizes
raw
blocks (small files, raw leaves): the length of the entire raw blockExamples of where blocksize
is useful:
blocksizes
array allows calculating byte offsets (see Offset List) to determine which blocks contain the requested range without downloading unnecessary blocks.decode(PBNode.Data).filesize
For Type=File
(0) and Type=Raw
(2), this field is mandatory. While marked as "optional"
in the protobuf schema (for compatibility with other types like Directory), implementations:
Blocksize
computation above, otherwise the file is invaliddag-pb
File
Path ResolutionA file terminates a UnixFS content path. Any attempt to resolve a path past a file MUST be rejected with an error indicating that UnixFS files cannot have children.
dag-pb
Directory
A Directory, also known as folder, is a named collection of child Nodes:
PBNode.Links
is an entry (child) of the directory, and
PBNode.Links[].Name
gives you the name of that child.PBNode.Link
CANNOT
have the same Name
. Names are considered identical if they are byte-for-byte
equal (not just semantically equivalent). If two identical names are present in
a directory, the decoder MUST fail.Directory
block and use [HAMTDirectory
] type instead.The PBNode.Data
field MUST contain valid UnixFS protobuf data for all UnixFS nodes.
For directories (DataType==1), the minimum valid PBNode.Data
field is as follows:
{
"Type": "Directory"
}
For historical compatibility, implementations MAY encounter dag-pb nodes with empty or missing Data fields from older IPFS versions, but MUST NOT produce such nodes.
dag-pb
Directory
Link OrderingDirectory links SHOULD be sorted lexicographically by the Name
field when creating
new directories. This ensures consistent, deterministic directory structures across
implementations.
While decoders MUST accept directories with any link ordering, encoders SHOULD use lexicographic sorting for better interoperability and deterministic CIDs.
A decoder SHOULD, if it can, preserve the order of the original files. This "sort on write, not on read" approach maintains DAG stability - existing unsorted directories remain unchanged when accessed or traversed, preventing unintentional mutations of intermediate nodes that could alter their CIDs.
Note: Lexicographic sorting was chosen as the standard because it provides a universal, locale-independent ordering that works consistently across all implementations and languages. Sorting on write (when the Links list is modified) helps with deduplication detection and enables more efficient directory traversal algorithms in some implementations.
dag-pb
Directory
Path ResolutionPop the left-most component of the path, and match it to the Name
of
a child under PBNode.Links
.
Duplicate names are not allowed in UnixFS directories. However, when reading third-party data that contains duplicates, implementations MUST always return the first matching entry and ignore subsequent ones (following the Robustness Principle). Similarly, when writers mutate a UnixFS directory that has duplicate names, they MUST drop the redundant entries and only keep the first occurrence of each name.
Assuming no errors were raised, you can continue to the path resolution on the remaining components and on the CID you popped.
dag-pb
HAMTDirectory
A HAMT Directory is a Hashed-Array-Mapped-Trie data structure representing a Directory. It is generally used to represent directories that cannot fit inside a single block. These are also known as "sharded directories", since they allow you to split large directories into multiple blocks, known as "shards".
The HAMT directory is configured through the UnixFS metadata in PBNode.Data
:
decode(PBNode.Data).Type
MUST be HAMTShard
(value 5
)
decode(PBNode.Data).hashType
indicates the multihash function to use to digest
the path components for sharding. Currently, all HAMT implementations use murmur3-x64-64
(0x22
),
and this value MUST be consistent across all shards within the same HAMT structure
decode(PBNode.Data).fanout
is REQUIRED for HAMTShard nodes (though marked optional in the
protobuf schema). The value MUST be a power of two, a multiple of 8 (for byte-aligned
bitfields), and at most 1024.
This determines the number of possible bucket indices (permutations) at each level of the trie.
For example, fanout=256 provides 256 possible buckets (0x00 to 0xFF), requiring 8 bits from the hash.
The hex prefix length is log2(fanout)/4
characters (since each hex character represents 4 bits).
The same fanout value is used throughout all levels of a single HAMT structure
Implementations that onboard user data to create new HAMTDirectory structures are free to choose a fanout
value or allow users to configure it based on their use case:
Implementations MUST limit the fanout
parameter to a maximum of 1024 to prevent
denial-of-service attacks. Excessively large fanout values can cause memory exhaustion
when allocating bucket arrays. See CVE-2023-23625 and
GHSA-q264-w97q-q778 for details
on this vulnerability.
decode(PBNode.Data).Data
contains a bitfield indicating which buckets contain entries.
Each bit corresponds to one bucket (0 to fanout-1), with bit value 1 indicating the bucket
is occupied. The bitfield is stored in little-endian byte order. The bitfield size in bytes
is fanout/8
, which is why fanout MUST be a multiple of 8.
The field Name
of an element of PBNode.Links
for a HAMT uses a
hex-encoded prefix corresponding to the bucket index, zero-padded to a width
of log2(fanout)/4
characters.
To illustrate the HAMT structure with a concrete example:
// Root HAMT shard (bafybeidbclfqleg2uojchspzd4bob56dqetqjsj27gy2cq3klkkgxtpn4i)
// This shard contains 1000 files distributed across buckets
message PBNode {
// UnixFS metadata in Data field
Data = {
Type = HAMTShard // Type = 5
Data = 0xffffff... // Bitmap: bits set for populated buckets
hashType = 0x22 // murmur3-x64-64
fanout = 256 // 256 buckets (8-bit width)
}
// Links to sub-shards or entries
Links = [
{
Hash = bafybeiaebmuestgbpqhkkbrwl2qtjtvs3whkmp2trkbkimuod4yv7oygni
Name = "00" // Bucket 0x00
Tsize = 2693 // Cumulative size of this subtree
},
{
Hash = bafybeia322onepwqofne3l3ptwltzns52fgapeauhmyynvoojmcvchxptu
Name = "01" // Bucket 0x01
Tsize = 7977
},
// ... more buckets as needed up to "FF"
]
}
// Sub-shard for bucket "00" (multiple files hash to 00 at first level)
message PBNode {
Data = {
Type = HAMTShard // Still a HAMT at second level
Data = 0x800000... // Bitmap for this sub-level
hashType = 0x22 // murmur3-x64-64
fanout = 256 // Same fanout throughout
}
Links = [
{
Hash = bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa
Name = "6E470.txt" // Bucket 0x6E + filename
Tsize = 1271
},
{
Hash = bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa
Name = "FF742.txt" // Bucket 0xFF + filename
Tsize = 1271
}
]
}
dag-pb
HAMTDirectory
Path ResolutionTo resolve a path inside a HAMT:
decode(PBNode.Data).hashType
log2(fanout)
bits from the hash digest (lowest/least significant bits first),
then hex encode those bits using little endian to form the bucket prefix. The prefix MUST use uppercase hex characters (00-FF, not 00-ff)Name
starts with this hex prefix:
Name
equals the prefix exactly → this is a sub-shard, follow the link and repeat from step 2Name
equals prefix + filename → target foundNote: Empty intermediate shards are typically collapsed during deletion operations to maintain consistency and avoid having HAMT structures that differ based on insertion/deletion history.
Example: Finding "470.txt" in a HAMT with fanout=256 (see HAMT Sharded Directory test vector)
Given a HAMT-sharded directory containing 1000 files:
0x22
)0x00
→ link name "00"bafybeiaebmuestgbpqhkkbrwl2qtjtvs3whkmp2trkbkimuod4yv7oygni
)0x6E
[hex_prefix][original_filename]
Implementations typically convert regular directories to HAMT when the serialized directory node exceeds a size threshold between 256 KiB and 1 MiB. This threshold:
See Block Size Considerations for details on block size limits and conventions.
dag-pb
Symlink
A Symlink represents a POSIX symbolic link.
A symlink MUST NOT have children in PBNode.Links
.
The PBNode.Data.Data
field is a POSIX path that MAY be inserted in front of the
currently remaining path component stack.
dag-pb
Symlink
Path ResolutionSymlink path resolution SHOULD follow the POSIX specification, over the current UnixFS path context, as much as is applicable.
There is no current consensus on how pathing over symlinks should behave. Some implementations return symlink objects and fail if a consumer tries to follow them through.
dag-pb
TSize
(cumulative DAG size)Tsize
is an optional field in PBNode.Links[]
which represents the cumulative size of the entire DAG rooted at that link, including all protobuf encoding overhead.
While optional in the protobuf schema, implementations SHOULD include Tsize
for:
Key distinction from blocksize:
blocksize
: Only the raw file data (no protobuf overhead)Tsize
: Total size of all serialized blocks in the DAG (includes protobuf overhead)To compute Tsize
: sum the serialized size of the current dag-pb block and the Tsize values of all child links.
Example: Directory with multi-block file
Consider the Simple Directory fixture (bafybeihchr7vmgjaasntayyatmp5sv6xza57iy2h4xj7g46bpjij6yhrmy
):
The directory has a total Tsize
of 1572 bytes:
The multiblock.txt
file within this directory demonstrates how Tsize
accumulates:
Tsize
: 245 + 1026 = 1271 bytesThis shows how Tsize
includes both the protobuf overhead and all child data, while blocksize
only counts the raw file data.
Examples of where Tsize
is useful:
An implementation SHOULD NOT assume the TSize
values are correct. The value is only a hint that provides performance optimization for better UX.
Following the Robustness Principle, implementation SHOULD be
able to decode nodes where the Tsize
field is wrong (not matching the sizes of sub-DAGs), or
partially or completely missing.
When total data size is needed for important purposes such as accounting, billing, and cost estimation, the Tsize
SHOULD NOT be used, and instead a full DAG walk SHOULD to be performed.
dag-pb
Optional MetadataUnixFS defines the following optional metadata fields.
mode
FieldThe mode
(introduced in UnixFS v1.5) is for persisting the file permissions in numeric notation
[spec].
0755
for directories/HAMT shards0644
for all other types where applicableugo-rwx
setuid
, setgid
and the sticky bit
clonedValue = ( modifiedBits & 07777 ) | ( originalValue & 0xFFFFF000 )
interpretedValue = originalValue & 07777
Implementation guidance:
mtime
FieldThe mtime
(introduced in UnixFS v1.5) is a two-element structure ( Seconds
, FractionalNanoseconds
) representing the
modification time in seconds relative to the unix epoch 1970-01-01T00:00:00Z
.
The two fields are:
Seconds
( always present, signed 64bit integer ): represents the amount of seconds after or before the epoch.FractionalNanoseconds
( optional, 32bit unsigned integer ): when specified, represents the fractional part of the mtime
as the amount of nanoseconds. The valid range for this value are the integers [1, 999999999]
.Implementations encoding or decoding wire-representations MUST observe the following:
mtime
structure with FractionalNanoseconds
outside of the on-wire range
[1, 999999999]
is not valid. This includes a fractional value of 0
.
Implementations encountering such values should consider the entire enclosing
metadata block malformed and abort the processing of the corresponding DAG.mtime
structure is optional. Its absence implies unspecified
rather
than 0
.0
as
input, while at the same time MUST ensure it is stripped from the final structure
before encoding, satisfying the above constraints.Implementations interpreting the mtime
metadata in order to apply it within a
non-IPFS target MUST observe the following:
unspecified
and 0
/1970-01-01T00:00:00Z
,
the distinction must be preserved within the target. For example, if no mtime
structure
is available, a web gateway must not render a Last-Modified:
header.mtime
( e.g. a FUSE interface ) and no mtime
is
supplied OR the supplied mtime
falls outside of the targets accepted range:
mtime
is specified or the resulting UnixTime
is negative:
implementations must assume 0
/1970-01-01T00:00:00Z
(note that such values
are not merely academic: e.g. the OpenVMS epoch is 1858-11-17T00:00:00Z
)UnixTime
is larger than the targets range ( e.g. 32bit
vs 64bit mismatch), implementations must assume the highest possible value
in the targets range. In most cases, this would be 2038-01-19T03:14:07Z
.Implementation guidance:
Path resolution describes how IPFS systems traverse UnixFS DAGs. While path resolution behavior is mostly IPFS semantics layered over UnixFS data structures, certain UnixFS types (notably HAMTDirectory) define specific resolution algorithms as part of their data structure specification. Each UnixFS type includes a "Path Resolution" subsection documenting its specific requirements.
Paths begin with a <CID>/
or /ipfs/<CID>/
, where <CID>
is a [multibase]
encoded CID. The CID encoding MUST NOT use a multibase alphabet that contains
/
(0x2f
) unicode codepoints. However, CIDs may use a multibase encoding with
a /
in the alphabet if the encoded CID does not contain /
once encoded.
Everything following the CID is a collection of path components (some bytes)
separated by /
(0x2F
). UnixFS paths read from left to right, and are
inspired by POSIX paths.
/
unicode codepoints because it would break
the path into two components.Behavior is not defined.
Until we agree on a specification for this, implementations SHOULD NOT depend on any escape sequences and/or non-ASCII characters for mission-critical applications, or limit escaping to specific context.
\
may be used to trigger an escape sequence. However, it is currently broken and inconsistent across implementations.Relative path components MUST be resolved before trying to work on the path:
.
points to the current node and MUST be removed...
points to the parent node and MUST be removed left to right. When removing
a ..
, the path component on the left MUST also be removed. If there is no path
component on the left, implementations MUST reject the path with an error to avoid
out-of-bounds path resolution./ipfs/cid/../foo
) with an error indicating invalid path traversal.The following names SHOULD NOT be used in UnixFS directories:
.
string, as it represents the self node in POSIX pathing...
string, as it represents the parent node in POSIX pathing.NULL
(0x00
) byte, as this is often used to signify string
terminations in some systems, such as C-compatible systems. Many unix
file systems do not accept this character in path components.Implementations SHOULD validate against these test vectors and reference implementations before production use.
This section provides test vectors organized by UnixFS structure type, progressing from simple to complex within each category.
Test vectors for UnixFS file structures, progressing from simple single-block files to complex multi-block files.
raw
Block Filedir-with-files.car
bafkreifjjcie6lypi6ny7amxnfftagclbuxndqonfipmb64f2km2devei4
(hello.txt)raw
Nodeipfs block stat
): 12 bytesipfs cat
): 12 bytesraw
codec, no protobuf wrapperdag-pb
Block Filebafybeifx7yeb55armcsxwwitkymga5xf53dxiarykms3ygqic223w5sk3m
dag-pb
File with data in the same blockipfs block stat
): 40 bytesipfs cat
): 32 bytes📄 small-file.txt # bafybeifx7yeb55armcsxwwitkymga5xf53dxiarykms3ygqic223w5sk3m (dag-pb)
└── 📦 Data.Data # "Hello from IPFS Gateway Checker\n" (32 bytes, stored inline in UnixFS protobuf)
dir-with-files.car
bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa
(multiblock.txt)dag-pb
File with multiple raw
Node leavesipfs block stat
): 245 bytes (dag-pb)ipfs cat
): 1026 bytes📄 multiblock.txt # bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa (dag-pb root)
├── 📦 [0-255] # bafkreie5noke3mb7hqxukzcy73nl23k6lxszxi5w3dtmuwz62wnvkpsscm (raw, 256 bytes)
├── 📦 [256-511] # bafkreih4ephajybraj6wnxsbwjwa77fukurtpl7oj7t7pfq545duhot7cq (raw, 256 bytes)
├── 📦 [512-767] # bafkreigu7buvm3cfunb35766dn7tmqyh2um62zcio63en2btvxuybgcpue (raw, 256 bytes)
├── 📦 [768-1023] # bafkreicll3huefkc3qnrzeony7zcfo7cr3nbx64hnxrqzsixpceg332fhe (raw, 256 bytes)
└── 📦 [1024-1025] # bafkreifst3pqztuvj57lycamoi7z34b4emf7gawxs74nwrc2c7jncmpaqm (raw, 2 bytes)
bafybeibfhhww5bpsu34qs7nz25wp7ve36mcc5mxd5du26sr45bbnjhpkei.dag-pb
bafybeibfhhww5bpsu34qs7nz25wp7ve36mcc5mxd5du26sr45bbnjhpkei
dag-pb
File with 7 links to child blocks📄 large-file # bafybeibfhhww5bpsu34qs7nz25wp7ve36mcc5mxd5du26sr45bbnjhpkei (dag-pb root)
├── ⚠️ block[0] # (missing child block)
├── ⚠️ block[1] # (missing child block)
├── ⚠️ block[2] # (missing child block)
├── ⚠️ block[3] # (missing child block)
├── ⚠️ block[4] # (missing child block)
├── ⚠️ block[5] # (missing child block)
└── ⚠️ block[6] # (missing child block)
file-3k-and-3-blocks-missing-block.car
QmYhmPjhFjYFyaoiuNzYv8WGavpSRDwdHWe5B4M5du5Rtk
dag-pb
File with 3 links but middle block intentionally missing📄 file-3k # QmYhmPjhFjYFyaoiuNzYv8WGavpSRDwdHWe5B4M5du5Rtk (dag-pb root)
├── 📦 [0-1023] # QmPKt7ptM2ZYSGPUc8PmPT2VBkLDK3iqpG9TBJY7PCE9rF (raw, 1024 bytes)
├── ⚠️ [1024-2047] # (missing block - intentionally removed)
└── 📦 [2048-3071] # QmWXY482zQdwecnfBsj78poUUuPXvyw2JAFAEMw4tzTavV (raw, 1024 bytes)
QmPKt7ptM2ZYSGPUc8PmPT2VBkLDK3iqpG9TBJY7PCE9rF
) and third block (QmWXY482zQdwecnfBsj78poUUuPXvyw2JAFAEMw4tzTavV
) can be fetched independentlyTest vectors for UnixFS directory structures, progressing from simple flat directories to complex HAMT-sharded directories.
dir-with-files.car
bafybeihchr7vmgjaasntayyatmp5sv6xza57iy2h4xj7g46bpjij6yhrmy
dag-pb
Directoryipfs block stat
): 185 bytes📁 / # bafybeihchr7vmgjaasntayyatmp5sv6xza57iy2h4xj7g46bpjij6yhrmy
├── 📄 ascii-copy.txt # bafkreifkam6ns4aoolg3wedr4uzrs3kvq66p4pecirz6y2vlrngla62mxm (raw, 31 bytes) "hello application/vnd.ipld.car"
├── 📄 ascii.txt # bafkreifkam6ns4aoolg3wedr4uzrs3kvq66p4pecirz6y2vlrngla62mxm (raw, 31 bytes) "hello application/vnd.ipld.car"
├── 📄 hello.txt # bafkreifjjcie6lypi6ny7amxnfftagclbuxndqonfipmb64f2km2devei4 (raw, 12 bytes) "hello world\n"
└── 📄 multiblock.txt # bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa (dag-pb, 1026 bytes) Lorem ipsum text
Fixture: subdir-with-two-single-block-files.car
bafybeietjm63oynimmv5yyqay33nui4y4wx6u3peezwetxgiwvfmelutzu
dag-pb
Directory containing another Directory📁 / # bafybeietjm63oynimmv5yyqay33nui4y4wx6u3peezwetxgiwvfmelutzu
└── 📁 subdir/ # bafybeiggghzz6dlue3m6nb2dttnbrygxh3lrjl5764f2m4gq7dgzdt55o4 (dag-pb Directory)
├── 📄 ascii.txt # bafkreifkam6ns4aoolg3wedr4uzrs3kvq66p4pecirz6y2vlrngla62mxm (raw, 31 bytes) "hello application/vnd.ipld.car"
└── 📄 hello.txt # bafkreifjjcie6lypi6ny7amxnfftagclbuxndqonfipmb64f2km2devei4 (raw, 12 bytes) "hello world\n"
/subdir/hello.txt
path correctlyFixture: dag-pb.car
bafybeiegxwlgmoh2cny7qlolykdf7aq7g6dlommarldrbm7c4hbckhfcke
dag-pb
Directory📁 / # bafybeiegxwlgmoh2cny7qlolykdf7aq7g6dlommarldrbm7c4hbckhfcke
├── 📁 foo/ # bafybeidryarwh34ygbtyypbu7qjkl4euiwxby6cql6uvosonohkq2kwnkm (dag-pb Directory)
│ └── 📄 bar.txt # bafkreigzafgemjeejks3vqyuo46ww2e22rt7utq5djikdofjtvnjl5zp6u (raw, 14 bytes) "Hello, world!"
└── 📄 foo.txt # bafkreic3ondyhizrzeoufvoodehinugpj3ecruwokaygl7elezhn2khqfa (raw, 13 bytes) "Hello, IPFS!"
Fixture: path_gateway_tar/fixtures.car
bafybeig6ka5mlwkl4subqhaiatalkcleo4jgnr3hqwvpmsqfca27cijp3i
dag-pb
Directory with nested subdirectories📁 / # bafybeig6ka5mlwkl4subqhaiatalkcleo4jgnr3hqwvpmsqfca27cijp3i
└── 📁 ą/ # (dag-pb Directory)
└── 📁 ę/ # (dag-pb Directory)
└── 📄 file-źł.txt # (raw, 34 bytes) "I am a txt file on path with utf8"
/ipfs/bafybeig6ka5mlwkl4subqhaiatalkcleo4jgnr3hqwvpmsqfca27cijp3i/ą/ę/file-źł.txt
Fixture: dir-with-percent-encoded-filename.car
bafybeig675grnxcmshiuzdaz2xalm6ef4thxxds6o6ypakpghm5kghpc34
dag-pb
Directory📁 / # bafybeig675grnxcmshiuzdaz2xalm6ef4thxxds6o6ypakpghm5kghpc34
└── 📄 Portugal%2C+España=Peninsula Ibérica.txt # bafkreihfmctcb2kuvoljqeuphqr2fg2r45vz5cxgq5c2yrxnqg5erbitmq (raw, 38 bytes) "hello from a percent encoded filename"
%2C
), plus signs, equals, and non-ASCII characters%2C
as %252C
in the URL path (/ipfs//Portugal%252C+España=Peninsula%20Ibérica.txt
)%2C
in the filename to avoid conflicts with URL encodingbafybeigcsevw74ssldzfwhiijzmg7a35lssfmjkuoj2t5qs5u5aztj47tq.dag-pb
bafybeigcsevw74ssldzfwhiijzmg7a35lssfmjkuoj2t5qs5u5aztj47tq
dag-pb
Directory📁 / # bafybeigcsevw74ssldzfwhiijzmg7a35lssfmjkuoj2t5qs5u5aztj47tq
├── ⚠️ audio_only.m4a # (link to missing block, ~24MB)
├── ⚠️ chat.txt # (link to missing block, ~1KB)
├── ⚠️ playback.m3u # (link to missing block, ~116 bytes)
└── ⚠️ zoom_0.mp4 # (link to missing block)
single-layer-hamt-with-multi-block-files.car
bafybeidbclfqleg2uojchspzd4bob56dqetqjsj27gy2cq3klkkgxtpn4i
dag-pb
HAMTDirectoryipfs block stat
): 12046 bytes📂 / # bafybeidbclfqleg2uojchspzd4bob56dqetqjsj27gy2cq3klkkgxtpn4i (HAMT root)
├── 📄 1.txt # (dag-pb file, multi-block)
├── 📄 2.txt # (dag-pb file, multi-block)
├── ...
└── 📄 1000.txt # (dag-pb file, multi-block)
Test vectors for special UnixFS features and edge cases.
Common empty structures that implementations frequently encounter:
Empty dag-pb directory
QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn
bafybeiczsscdsbs7ffqz55asqdf3smv6klcw3gofszvwlyarci47bgf354
bafyaabakaieac
(identity multihash)Empty dag-pb file
QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH
bafybeif7ztnhq65lumvvtr4ekcwd2ifwgm3awq4zfr3srh462rwyinlb4y
Empty raw block
bafkreihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku
bafkqaaa
(identity multihash)These CIDs appear frequently in UnixFS implementations and are often hardcoded for performance optimization.
symlink.car
QmWvY6FaqFMS89YAQ9NAPjVP4WZKA1qbHbicc9HeSKQTgt
dag-pb
Directory containing dag-pb
SymlinkQmTB8BaCJdCH5H3k7GrxJsxgDNmNYGGR71C58ERkivXoj5
): 9 bytesQme2y5HA5kvo2jAx13UsnV5bQJVijiAJCPvaW3JGQWhvJZ
): 16 bytes📁 / # QmWvY6FaqFMS89YAQ9NAPjVP4WZKA1qbHbicc9HeSKQTgt
├── 📄 foo # Qme2y5HA5kvo2jAx13UsnV5bQJVijiAJCPvaW3JGQWhvJZ - file containing "content"
└── 🔗 bar # QmTB8BaCJdCH5H3k7GrxJsxgDNmNYGGR71C58ERkivXoj5 - symlink pointing to "foo"
subdir-with-mixed-block-files.car
bafybeidh6k2vzukelqtrjsmd4p52cpmltd2ufqrdtdg6yigi73in672fwu
dag-pb
Directory with subdirectory📁 / # bafybeidh6k2vzukelqtrjsmd4p52cpmltd2ufqrdtdg6yigi73in672fwu
└── 📁 subdir/ # bafybeicnmple4ehlz3ostv2sbojz3zhh5q7tz5r2qkfdpqfilgggeen7xm
├── 📄 ascii.txt # bafkreifkam6ns4aoolg3wedr4uzrs3kvq66p4pecirz6y2vlrngla62mxm (raw, 31 bytes) "hello application/vnd.ipld.car"
├── 📄 hello.txt # bafkreifjjcie6lypi6ny7amxnfftagclbuxndqonfipmb64f2km2devei4 (raw, 12 bytes) "hello world\n"
└── 📄 multiblock.txt # bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa (dag-pb, 1271 bytes total)
dir-with-duplicate-files.car
bafybeihchr7vmgjaasntayyatmp5sv6xza57iy2h4xj7g46bpjij6yhrmy
dag-pb
Directory📁 / # bafybeihchr7vmgjaasntayyatmp5sv6xza57iy2h4xj7g46bpjij6yhrmy
├── 🔗 ascii-copy.txt # bafkreifkam6ns4aoolg3wedr4uzrs3kvq66p4pecirz6y2vlrngla62mxm (same CID as ascii.txt)
├── 📄 ascii.txt # bafkreifkam6ns4aoolg3wedr4uzrs3kvq66p4pecirz6y2vlrngla62mxm (raw, 31 bytes) "hello application/vnd.ipld.car"
├── 📄 hello.txt # bafkreifjjcie6lypi6ny7amxnfftagclbuxndqonfipmb64f2km2devei4 (raw, 12 bytes) "hello world\n"
└── 📄 multiblock.txt # bafybeigcisqd7m5nf3qmuvjdbakl5bdnh4ocrmacaqkpuh77qjvggmt2sa (dag-pb, multi-block)
These fixtures test raw dag-pb codec capabilities and serve as invalid test vectors for UnixFS implementations. Most lack UnixFS metadata - meaning their dag-pb Data field either doesn't exist, is empty, or contains bytes that aren't a valid UnixFS protobuf (which requires at minimum a Type
field specifying File/Directory/Symlink etc).
These validate that implementations properly reject malformed or non-UnixFS dag-pb nodes rather than crashing or behaving unpredictably:
bafybeihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku.dag-pb
- Empty dag-pb node, 0 bytes (no UnixFS metadata)bafybeihyivpglm6o6wrafbe36fp5l67abmewk7i2eob5wacdbhz7as5obe.dag-pb
- Single link without data, bytes: 12240a2212207521fe19c374a97759226dc5c0c8e674e73950e81b211f7dd3b6b30883a08a51
(no UnixFS metadata)bafybeibh647pmxyksmdm24uad6b5f7tx4dhvilzbg2fiqgzll4yek7g7y4.dag-pb
- Two links with data, bytes: 12340a2212208ab7a6c5e74737878ac73863cb76739d15d4666de44e5756bf55a2f9e9ab5f431209736f6d65206c696e6b1880c2d72f12370a2212208ab7a6c5e74737878ac73863cb76739d15d4666de44e5756bf55a2f9e9ab5f44120f736f6d65206f74686572206c696e6b18080a09736f6d652064617461
(invalid UnixFS protobuf)bafybeie7xh3zqqmeedkotykfsnj2pi4sacvvsjq6zddvcff4pq7dvyenhu.dag-pb
- Eleven unnamed links with data (invalid UnixFS protobuf)bafybeibazl2z4vqp2tmwcfag6wirmtpnomxknqcgrauj7m2yisrz3qjbom.dag-pb
- Node with data field populated, bytes: 0a050001020304
(invalid UnixFS protobuf)bafybeiaqfni3s5s2k2r6rgpxz4hohdsskh44ka5tk6ztbjerqpvxwfkwaq.dag-pb
- Node with empty data field, bytes: 0a00
(no UnixFS metadata)bafybeia53f5n75ituvc3yupuf7tdnxf6fqetrmo2alc6g6iljkmk7ys5mm.dag-pb
- Links with hash only, bytes: 120b0a09015500050001020304
(no UnixFS metadata)bafybeifq4hcxma3kjljrpxtunnljtc6tvbkgsy3vldyfpfbx2lij76niyu.dag-pb
- Links with hash and name, bytes: 12160a090155000500010203041209736f6d65206e616d65
(no UnixFS metadata)bafybeie7fstnkm4yshfwnmpp7d3mlh4f4okmk7a54d6c3ffr755q7qzk44.dag-pb
- Links with hash but empty name, bytes: 120d0a090155000500010203041200
(no UnixFS metadata)bafybeiezymjvhwfuharanxmzxwuomzjjuzqjewjolr4phaiyp6l7qfwo64.dag-pb
- Links with hash and Tsize, bytes: 12140a0901550005000102030418ffffffffffffff0f
(no UnixFS metadata)bafybeichjs5otecmbvwh5azdr4jc45mp2qcofh2fr54wjdxhz4znahod2i.dag-pb
- Links with hash but zero Tsize, bytes: 120d0a090155000500010203041800
(no UnixFS metadata)bafybeia2qk4u55f2qj7zimmtpulejgz7urp7rzs44cvledcaj42gltkk3u.dag-pb
- Simple form variant 1, bytes: 0a03010203
(invalid UnixFS protobuf)bafybeiahfgovhod2uvww72vwdgatl5r6qkoeegg7at2bghiokupfphqcku.dag-pb
- Simple form variant 2, bytes: 120b0a0901550005000102030412100a09015500050001020304120362617212100a090155000500010203041203666f6f
(no UnixFS metadata)bafybeidrg2f6slbv4yzydqtgmsi2vzojajnt7iufcreynfpxndca4z5twm.dag-pb
- Simple form variant 3, bytes: 120b0a09015500050001020304120e0a09015500050001020304120161120e0a09015500050001020304120161
(no UnixFS metadata)bafybeieube7zxmzoc5bgttub2aqofi6xdzimv5munkjseeqccn36a6v6j4.dag-pb
- Simple form variant 4, bytes: 120e0a09015500050001020304120161120e0a09015500050001020304120161
(no UnixFS metadata)Gateway Conformance Suite: ipfs/gateway-conformance
Test fixture generator: go-fixtureplate
Report specification issues or submit corrections via ipfs/specs.
This section and included subsections are not authoritative.
@helia/unixfs
implementation of a filesystem compatible with Helia SDKipfs/boxo/../unixfs.proto
ipfs/boxo/files
ipfs/boxo/ipld/unixfs
go-ipld-prime
implementation: ipfs/go-unixfsnode
While UnixFS itself does not mandate specific block size limits, implementations typically enforce practical constraints for operational efficiency:
These limits affect several UnixFS behaviors:
raw
blocks or within the Data
field of a single dag-pb
nodeNote that specific block size policies are implementation-dependent and may be configurable. If you want to maximize the interoperability of your data, make sure to keep chunk sizes no bigger than 1 MiB. Consult your implementation's documentation for exact limits and configuration options.
raw
ExampleIn this example, we will build a single raw
block with the string test
as its content.
First, hash the data:
$ echo -n "test" | sha256sum
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 -
Add the CID prefix:
f01551220
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
f this is the multibase prefix, we need it because we are working with a hex CID, this is omitted for binary CIDs
01 the CID version, here one
55 the codec, here we MUST use Raw because this is a Raw file
12 the hashing function used, here sha256
20 the digest length 32 bytes
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 is the the digest we computed earlier
Done. Assuming we stored this block in some implementation of our choice, which makes it accessible to our client, we can try to decode it.
$ ipfs cat f015512209f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
test
The offset list isn't the only way to use blocksizes and reach a correct implementation, it is a simple canonical one, python pseudo code to compute it looks like this:
def offsetlist(node):
unixfs = decodeDataField(node.Data)
if len(node.Links) != len(unixfs.Blocksizes):
raise "unmatched sister-lists" # error messages are implementation details
cursor = len(unixfs.Data) if unixfs.Data else 0
return [cursor] + [cursor := cursor + size for size in unixfs.Blocksizes[:-1]]
This will tell you which offset inside this node the children at the corresponding index starts to cover. (using [x,y)
ranging)
Below section explains some of historical decisions. This is not part of specification, and is provided here only for extra context.
Metadata support in UnixFSv1.5 has been expanded to increase the number of possible
use cases. These include rsync
and filesystem-based package managers.
Several metadata systems were evaluated, as discussed in the following sections.
UnixFS 1.5 stores optional mode
and mtime
metadata in the Data
fields of
the root dag-pb
node, however below analysis may be useful when additional
metadata is being discussed, or UnixFS 1.5 approach is revisited.
In this scheme, the existing Metadata
message is expanded to include additional
metadata types (mtime
, mode
, etc). It contains links to the actual file data,
but never the file data itself.
This was ultimately rejected for a number of reasons:
File
node. This would not be
possible with an intermediate Metadata
node.File
node already contains some metadata (e.g. the file size), so metadata
would be stored in multiple places. This complicates forwards compatibility with
UnixFSv2, as mapping between metadata formats potentially requires multiple fetch
operations.Repeated Metadata
messages are added to UnixFS Directory
and HAMTShard
nodes,
the index of which indicates which entry they are to be applied to. Where entries are
HAMTShard
s, an empty message is added.
One advantage of this method is that, if we expand stored metadata to include entry
types and sizes, we can perform directory listings without needing to fetch further
entry nodes (excepting HAMTShard
nodes). However, without removing the storage of
these datums elsewhere in the spec, we run the risk of having non-canonical data
locations and perhaps conflicting data as we traverse through trees containing
both UnixFS v1 and v1.5 nodes.
This was rejected for the following reasons:
This adds new fields to the UnixFS Data
message to represent the various metadata fields.
It has the advantage of being simple to implement. Metadata is maintained whether the file is accessed directly via its CID or via an IPFS path that includes a containing directory. In addition, metadata is kept small enough that we can inline root UnixFS nodes into their CIDs so that we can end up fetching the same number of nodes if we decide to keep file data in a leaf node for deduplication reasons.
Downsides to this approach are:
mtime
s being different. If the content is
stored in another node, its CID will be constant between the two users,
but you can't navigate to it unless you have the parent node, which will be
less available due to the proliferation of CIDs.With this approach, we would maintain a separate data structure outside of the UnixFS tree to hold metadata.
This was rejected due to concerns about added complexity, recovery after system crashes while writing, and having to make extra requests to fetch metadata nodes when resolving CIDs from peers.
This scheme would see metadata stored in an external database.
The downsides to this are that metadata would not be transferred from one node to another when syncing, as Bitswap is not aware of the database and in-tree metadata.
The integer portion of UnixTime is represented on the wire using a varint
encoding.
While this is inefficient for negative values, it avoids introducing zig-zag encoding.
Values before the year 1970
are exceedingly rare, and it would be handy having
such cases stand out, while ensuring that the "usual" positive values are easily readable. The varint
representing the time of writing this text is 5 bytes
long. It will remain so until October 26, 3058 (34,359,738,367).
Fractional values are effectively a random number in the range 1 to 999,999,999.
In most cases, such values will exceed 2^28 (268,435,456) nanoseconds. Therefore,
the fractional part is represented as a 4-byte fixed32
,
as per Google's recommendation.
We gratefully acknowledge the following individuals for their valuable contributions, ranging from minor suggestions to major insights, which have shaped and improved this specification.