Contents manifest specification: Difference between revisions

From OLPC
Jump to navigation Jump to search
m (added some category tags)
(→‎Notes on resource-bounded parsing: Document contents manifest handling of non-ASCII filenames.)
 
(18 intermediate revisions by 4 users not shown)
Line 30: Line 30:
opaque string describing the type of the object, semi-opaque integers
opaque string describing the type of the object, semi-opaque integers
providing version information, and json-encoded contents. Envelopes
providing version information, and json-encoded contents. Envelopes
are represented by tuples (JSON 4-element lists) of the form:
are represented by 3-tuples (JSON 3-element lists) of the form:


(type, version, subversion, data) :: (String, Integer, Integer, Value)
(type, version, data) :: (String, Integer, Value)
^ ^ ^ ^
^ ^ ^
| | | \________ a JSON value in a format governed by
| | \________ a JSON value in a format governed by
| | | (type, version, subversion)
| | (type, version)
| | \____ transparent version identifier; see below.
| \______ an opaque version identifier; see below.
| \______ an opaque version identifier; see below.
\_________ a hint about how to interpret the data field that follows.
\_________ a hint about how to interpret the data field that follows.
Line 44: Line 43:
be supported by compliant clients. Version numbers are incomparable:
be supported by compliant clients. Version numbers are incomparable:
only equality testing of versions is valid.
only equality testing of versions is valid.

On the other hand, the subversion number is transparent, and is
defined to be monotonically increasing. It is designed to allow the
identification of backwards-compatible protocol extensions.

In security-critical applications, only exact matches on version and
subversion information should be allowed, and any unknown keys in maps
or entries in lists should trigger a fatal parsing error.

In non-security-critical applications, version numbers must still be
matched exactly, but a subversion number greater than or equal to the
one(s) known to be parseable may be allowed. Unknown map keys or list
items should be silently ignored.

In the remainder, the version and subversion information will often be
specified as a dotted integer. For example, "1.2" represents version
1 and subversion 2.


==Keys==
==Keys==


Keys are envelopes with type "key" and version 1.0. A key's datum consists of a
Keys are envelopes with type "key" and version 1. A key's datum consists of a
tuple (JSON 3-element list):
tuple (JSON 3-element list):


Line 79: Line 61:


In practice, we might see a key encoded as:
In practice, we might see a key encoded as:
["key", 1, 0, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]
["key", 1, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]


Initially, the key data should be the same hexadecimal-encoded octet string given in the [[Firmware Key and Signature Formats#Key|firmware key spec]], the key fingerprint should be the trailing 64 characters of that string, as in the [[Firmware Key and Signature Formats#Signature|firmware signature spec]], and the algorithm string should be "rsa-2048-pub" exactly. An encoding for private keys is not (yet) given.


==Credentials==
==Credentials==


A credential is an envelope with type "credential" and version 1.0.
A credential is an envelope with type "sig" and version 1.
Its datum consists of a list of [[Firmware Key and Signature Formats#Signature|signatures]] in the [[Firmware Key and Signature Formats#Version_1|sig01]] or [[Firmware Key and Signature Formats#Version_2|sig02]] formats.
Conceptually, its datum consists of a relation whose tuples are
A valid credential must have at least one signature in the list, and *all*
signatures of the form:

(signature-algorithm, key, message, signature)

However, in practice, the message is defined implicitly by context
(i.e. the manifest with which this credential is paired) and we wish
to use the credential/manifest system to help detect accidental
corruption as early as possible, including corruption of the manifest or confusion of manifest/signature pairs.

Therefore, we actually propose to represent the datum of a message credential
as a list of tuples, where each tuple (JSON 5-item list) is:

(hash-alg, sig-alg, expected-hash, key-identifier, signature)

Each tuple in the list specifies a different signature algorithm. A
valid credential must have at least one tuple in the list, and *all*
signatures must verify for the credential to be valid.
signatures must verify for the credential to be valid.
(Implementations may chose to verify only a subset of the provided
(Implementations may chose to verify only a subset of the provided
signatures, either for performance reasons or to allow migration to
signatures, either for performance reasons or to allow migration to
future preferred signature algorithms, but any invalid signature
future preferred signature algorithms, but any invalid signature
invalidates the credential.) Since the list of tuples is nominally
invalidates the credential.) Since the list of signatures is nominally
unordered, we will require that the encoding of the credential list
unordered, we will require that the encoding of the credential lists signatures in lexicographic order.
All signatures must be unique: the credential is invalid if the same signature string is
tuples in lexicographic order. That is, tuples are sorted first
lexicographically on hash-alg, then tuples with the same hash-alg are
lexicographically sorted by sig-alg, and further ties are broken by
the expected hash, key-identifier, and (if necessary) signature field.
All tuples must be unique: the credential is invalid if the same
(hash-alg, sig-alg, expected-hash, key-identifier, signature) tuple is
present more than once.
present more than once.


Note that detached signatures are a pervasive implementation detail within bitfrost; we assume some external mechanism to associate this signature with the appropriate message.
We include the message hash as a convenient way to verify the
(implicit) message to which this signature applies, as well as to
check that the manifest was not accidentally damaged in transit
(e.g. by a flash corruption bug). Note that detached signatures are a
pervasive implementation detail within bitfrost.

The key fingerprint identifies the key that should be used to verify the
signature.

The hash-algorithm and signature-algorithm are used to verify the signature
against the *computed* message hash (not the 'expected-hash' of the
credential tuple). The signature algorithm may use a different hash
function or different message encapsulation, so it is not necessarily
expected that the hash which is signed is the same as the
'expected-hash' field.

Finally, the signature itself is opaque.

As with keys, the specific hash and signature algorithms should be
careful to unambiguously encoded the hashed data and signature. For
example, if hex digits are used, only lowercase hex digits should be
permitted.


In actual JSON, we might imagine that such a message would look like:
In actual JSON, we might imagine that such a message would look like:
["sig", 1, [ "sig01: sha256 807b41...0001 c06dc..cb3c\n", "sig01: rmd160 807b41...0001 c06dc..1234\n" ]]
["credential", 1, 0, [ ["sha-256", "rsa-2048", "deadbeef", "123mykey", "0paqu3s1g"],
["whirlpool-XXX", "ecc-YYY", "123095", "9912key", "8567a"] ] ]



==Manifests==
==Contents Manifests==


Manifests describe the contents of a filesystem. Their primary use is
Contents manifests describe the contents of a filesystem. Their primary use is
in describing software releases and allowing upgrades between these.
in describing software releases and allowing upgrades between these.


In version 1.0 of the manifest format, we have decided to omit
In version 1 of the contents format, we have decided to omit
unnecessary file metadata from the manifest. This will allow more
unnecessary file metadata from the manifest. This will allow more
efficient updates, because we don't have to synchronize irrelevant
efficient updates, because we don't have to synchronize irrelevant
metadata. Omitted metadata includes file creation, modification, and
metadata. Omitted metadata includes file creation, modification, and
access times, and hard link and xattr information. We also use
access times, and hard link and xattr information. We also use both
numeric user and group ids, which may not correlate with the target
numeric and textual user and group ids, which may require care on the target
system if the user/group databases are not synchronized. Our design
system if the user/group databases are not synchronized. Our design
is sufficient for the present; we will consider adding additional
is sufficient for the present; we will consider adding additional
metadata to the manifest in future versions if a need can be
metadata to the contents manifest in future versions if a need can be
demonstrated.
demonstrated.


Line 164: Line 104:
unnecessary at the present, and (b) it is difficult to properly
unnecessary at the present, and (b) it is difficult to properly
support in our current containerization copy-on-write framework.
support in our current containerization copy-on-write framework.
Manifest-generation tools should check that hard links are not used.)
Contents-generation tools should check that hard links are not used.)


A manifest is an envelope with type "manifest" and version 1.0.
A contents manifest is an envelope with type "manifest" and version 1.0.
A manifest's datum is a 2-tuple (JSON 2-element list):
A contents manifest's datum is a list:


(hash-algorithm-list, directory-list) ::
directory-list :: List<Directory>
(List<String>, List<Directory>)


The hash-algorithm-list is an ordered list of opaque algorithm
identifiers. There must be at least one item in the list. This hash algorithm order is used by the included [[#Directory objects|directory objects]].
The directory-list is an list of [[#Directory objects|directory]] envelopes, in post-order.
The directory-list is an list of [[#Directory objects|directory]] envelopes, in post-order.
The first directory corresponds to the root of the filesystem defined by the manifest, and it is followed by directory objects for its subdirectories (and their subdirectories) in the order in which they appear in the root directory object. For example, the directory objects for the directory tree:
The first directory corresponds to the root of the filesystem defined by the manifest, and it is followed by directory objects for its subdirectories (and their subdirectories) in the order in which they appear in the root directory object. For example, the directory objects for the directory tree:
Line 187: Line 124:
are written out in the order:
are written out in the order:


[a, b, c, d, e, f, g]
[/, a, b, c, d, e, f, g]


The directory-list must include the 'root' object, but it need not
The directory-list must include the 'root' object (/), but it need not
be complete: subdirectory trees may be omitted from the manifest object if
be complete: subdirectory trees may be omitted from the manifest object if
they can be obtained by other means. In the example above, 'f' could be omitted, and 'e' and 'f' could be omitted, but 'f' should not be included in the directory list if 'e' is omitted.
they can be obtained by other means. In the example above, 'f' could be omitted, and 'e' and 'f' could be omitted, but 'f' should not be included in the directory list if 'e' is omitted.
Omitted subdirectories can be discovered by comparing the hash of each directory object encountered with the hash declared in its parent directory, and skipping omitted directories until the hashes match.
Omitted subdirectories can be discovered by comparing the hash of each directory object encountered with the hash declared in its parent directory, and skipping omitted directories until the hashes match.


Only the root directory need be included in a manifest transmitted for the purpose
Only the root directory need be included in a contents manifest transmitted for the purpose
of validating a directory tree: hashes for the subdirectories can be
of validating a directory tree: hashes for the subdirectories can be
recomputed from the filesystem contents and compared to those named in
recomputed from the filesystem contents and compared to those named in
the root directory object. Similarly, one may chose to store
the root directory object. Similarly, one may chose to store
subdirectory manifest objects in special hidden files in the
subdirectory contents manifest objects in special hidden files in the
directories themselves, either as bare JSON-encoded directory objects,
directories themselves, either as bare JSON-encoded directory objects,
or as full-fledged manifests with their (sub)directory lists.
or as full-fledged contents manifests with their (sub)directory lists.
The manifest format allows a completely self-contained manifest, however,
The contents format allows a completely self-contained contents manifest, however,
where all subdirectories are contained in the directory list.
where all subdirectories are contained in the directory list.


Since subdirectories may be elided from the directory list, the credential authenticating a manifest should sign the root directory object (ie, 'a' in the example above). The validity of the remaining subdirectories is assured by the [http://en.wikipedia.org/wiki/Hash_tree hash tree] construction of the directory objects.
Since subdirectories may be elided from the directory list, '''the credential authenticating a contents manifest should sign (a hash of) the root directory object''' (ie, '/' in the example above). The validity of the remaining subdirectories is assured by the [http://en.wikipedia.org/wiki/Hash_tree hash tree] construction of the directory objects. The credential need not share hash algorithms with the contents manifest.


Although objects may be omitted from the directory list, it should
Although objects may be omitted from the directory list, it should
be considered an error to include extraneous directories not directly referenced from the included directory objects.
be considered an error to include extraneous directories not directly referenced from the included directory objects.


A contents manifest is just a convenience object for bundling a number of related directory objects; it should not be directly signed. Instead, the root directory object in the contents manifest should be the element which receives a signature.
In version 1.0 of the manifest format, there must be exactly two
hash algorithms specified, and they must be 'sha-256' and
'whirlpool-XXX'. (Future subversions of the manifest format may add new
algorithms.)


===Directory objects===
Note that adding a bogus hash algorithm to the hash-algorithm-list
will allow injection of arbitrary data into the manifest, which could
comprise an attack on the hash algorithm(s) used. For this reason,
security-conscious applications will require a fixed algorithm list
(as version 1.0 does). Migration to a new hash algorithm requires
first distributing an upgrade which recognizes a new manifest version
(either a new subversion, if the hash algorithm list is simply
extended, or a new version if hash algorithms are removed). After the
upgrade, manifests with the new version (and thus hash algorithm list)
can be published.


Directory objects describe the contents of a single directory; a contents
Similarly, a manifest with N listed hash algorithms admits N! possible
manifest is built from a collection of directory objects.
variants where the algorithms are listed in a different order. Since
N! grows much faster than 2^N, it is wise to keep the list of hash
algorithms small.


A directory object is an envelope with type "dir" and version
This attack does not apply to hash and signature algorithms in a
1. The datum is a 2-tuple (JSON 2-element list):
credential authenticating this manifest's root directory object, since the credential object is not itself hashed or
signed. Any desired set of hash and signature algorithms can be used in the credential. It is recommended, however, that the
hash algorithms used in the credential exactly match those named in the
manifest header.


(hash-algorithm-list, contents-map) ::
===Directory objects===
(List<String>, Map)


The hash-algorithm-list is an ordered list of opaque algorithm
Directory objects describe the contents of a single directory; a
identifiers. There must be at least one item in the list. This hash algorithm list is used by the contents-map, which contains:
manifest is built from a collection of directory objects.


A directory object is an envelope with type "dir" and version
1.0. The datum is a map:
{ filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
{ filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
Map<String, Map<String, Value>>
Map<String, Map<String, Value>>
Line 261: Line 180:
h: an ordered list of hashes for this entry (for regular files or
h: an ordered list of hashes for this entry (for regular files or
directories ONLY)
directories ONLY)
dl: length of the canonical-JSON encoding of the directory object
(for directories ONLY)
ml: length of the canonical-JSON encoding of the manifest for this
directory and its children (for directories ONLY)


The list of hashes under key 'h' should match the order of
The list of hashes under key 'h' should match the order of
hash-algorithm-list in a containing manifest object. The hash for a
hash-algorithm-list. The hash for a
entry naming a subdirectory is defined to be the hash of the
entry naming a subdirectory is defined to be the hash of the
"canonical JSON"-encoded directory object describing that directory.
"canonical JSON"-encoded directory object describing that directory; the contents and order of the hash-algorithm-list in the subdirectory must exactly match the one for this directory.

The 'dl' and 'ml' keys are for entries naming subdirectories and provide length information for random access to individual directory entries within a contents manifest. The 'ml' attribute is 16 plus the sum of (1 + dlen) for the referenced directory and all its children, where 'dlen' is the length of the JSON-encoded directory object.


Note that both numeric and symbolic user and group names are included. This is the same strategy [http://www.gnu.org/software/tar/manual/html_node/Attributes.html#index-numeric_002downer-485 used by the 'tar' utility]. When extracting or verifying a manifest, certain of this information may be ignored: for example, activities in the user's home directory will typically belong to the user, while 'system' activities may belong to root. Care should be taken when generating manifests to ensure that user and group information has standardized content, even if that information will be ignored in the expected use, to avoid having multiple valid manifests for a given activity or library bundle.
Note that both numeric and symbolic user and group names are included. This is the same strategy [http://www.gnu.org/software/tar/manual/html_node/Attributes.html#index-numeric_002downer-485 used by the 'tar' utility]. When extracting or verifying a manifest, certain of this information may be ignored: for example, activities in the user's home directory will typically belong to the user, while 'system' activities may belong to root. Care should be taken when generating manifests to ensure that user and group information has standardized content, even if that information will be ignored in the expected use, to avoid having multiple valid manifests for a given activity or library bundle.

In version 1 of the directory format, there must be exactly two
hash algorithms specified in the hash-algorithm-list, and they must be 'sha-256' and
'ripemd-160', in that order.

Note that adding a bogus hash algorithm to the hash-algorithm-list
will allow injection of arbitrary data into the directory object, which could
comprise an attack on the hash algorithm(s) used. For this reason,
security-conscious applications will require a fixed algorithm list
(as version 1 does). Migration to a new hash algorithm requires
first distributing an upgrade which recognizes a new directory version. After the
upgrade, directories with the new version (and thus hash algorithm list)
can be published.

Similarly, a directory with N listed hash algorithms admits N! possible
variants where the algorithms are listed in a different order. Since
N! grows much faster than 2^N, it is wise to keep the list of hash
algorithms small.

This attack does not apply to hash and signature algorithms in a
credential authenticating this directory object, since the credential object is not itself hashed or signed. Any desired set of hash and signature algorithms can be used in the credential.


A small example of a directory object might be:
A small example of a directory object might be:
["dir", 1, 0, {
["dir",1,[["sha-256","ripemd-160"],{
"bar": { g: "users", g#: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, u: "olpc", u#: 1000 },
"bar":{"g":"users","g#":1000,"h":["7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730","7d4e874a231f57b72509087d1e509942fdb6eac6"],"m":33188,"u":"olpc","u#":1000},
"fifo": { g: "users", g#: 1000, m: 4516, u: "olpc", u#: 1000 },
"fifo":{"g":"users","g#":1000,"m":4516,"u":"olpc","u#":1000},
"frobnitz": { g: "users", g#: 1000, l: "bar", m: 41380, u: "olpc", u#: 1000 },
"frobnitz":{"g":"users","g#":1000,"l":"bar","m":41471,"u":"olpc","u#":1000},
"null": { d: 2049, g: "users", g#: 1000, m: 8612, u: "olpc", u#: 1000 },
"null":{"d":259,"g":"users","g#":1000,"m":8612,"u":"olpc","u#":1000},
"subdir": { g: "users", g#: 1000, h: [ "da39a3ee5e6b4b0d3255bfef95601890afd80709", "d41d8cd98f00b204e9800998ecf8427e" ], m: 16804, u: "olpc", u#: 1000 }
"subdir":{"dl":39,"g":"users","g#":1000,"h":["19b46e0c53a25994e5f5e4d133bf308df3f99a3879b7e954d75b51f8393523f1","75fc670c37b3d1aaf0f402c531dc98325862e8ae"],"m":16877,"ml":56,"u":"olpc","u#":1000}
} ]
}]]

(Of course, the JSON encoding should quote map keys and use normalized whitespace
(Of course, the JSON encoding should remove whitespace to actually be [[Canonical JSON]].)

to actually be canonical JSON.)
A simple contents manifest containing this directory might be:
["manifest",1,[
["dir",1,[["sha-256","ripemd-160"],{
"bar":{"g":"users","g#":1000,"h":["7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730","7d4e874a231f57b72509087d1e509942fdb6eac6"],"m":33188,"u":"olpc","u#":1000},
"fifo":{"g":"users","g#":1000,"m":4516,"u":"olpc","u#":1000},
"frobnitz":{"g":"users","g#":1000,"l":"bar","m":41471,"u":"olpc","u#":1000},
"null":{"d":259,"g":"users","g#":1000,"m":8612,"u":"olpc","u#":1000},
"subdir":{"dl":39,"g":"users","g#":1000,"h":["19b46e0c53a25994e5f5e4d133bf308df3f99a3879b7e954d75b51f8393523f1","75fc670c37b3d1aaf0f402c531dc98325862e8ae"],"m":16877,"ml":56,"u":"olpc","u#":1000}
}]],
["dir",1,[["sha-256","ripemd-160"],{}]]
]]

==Notes on non-ASCII filenames==
Generally speaking, filenames on a UNIX system are assumed to be uninterpreted byte streams; they are explicitly *not* required to be valid in UTF-8 or any other encoding. ''However'', for the Sugar environment we insist that all filenames are valid UTF-8 (since various services exposed over DBus require this). For this reason, ''some implementations'' (in particular old versions of fv.c from [http://dev.laptop.org/git/projects/olpc-contents olpc-contents]) will treat filenames and directory names as unicode strings, converting from UTF-8 and [[Canonical JSON|canonicalizing]] them when the contents manifest is created, and renormalizing before comparison during verification. This is problematic: it allows multiple encodings of the same name to match the same manifest.


A better solution (implemented in olpc-contents 2.5 and later) is to insist on normalized filenames during contents manifest creation, and allow only exact comparisons during verification.
A simple manifest containing this directory might be:
["manifest", 1, 0, [
[ "sha1", "md5" ], [
["dir", 1, 0, {
"bar": { g: "users", g#: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, u: "olpc", u#: 1000 },
"fifo": { g: "users", g#: 1000, m: 4516, u: "olpc", u#: 1000 },
"frobnitz": { g: "users", g#: 1000, l: "bar", m: 41380, u: "olpc", u#: 1000 },
"null": { d: 2049, g: "users", g#: 1000, m: 8612, u: "olpc", u#: 1000 },
"subdir": { g: "users", g#: 1000, h: [ "da39a3ee5e6b4b0d3255bfef95601890afd80709", "d41d8cd98f00b204e9800998ecf8427e" ], m: 16804, u: "olpc", u#: 1000 }
} ],
[dir,1,0,{}]
] ] ]


==Notes on resource-bounded parsing==
==Notes on resource-bounded parsing==
Line 313: Line 262:


[[Category:software]]
[[Category:software]]
[[Category:Specifications]]
[[Category:Security]]

Latest revision as of 17:37, 28 August 2008

  This page is monitored by the OLPC team.
Author
Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>, C. Scott Ananian <cscott@laptop.org>
Date
August 12, 2007

Our system for update verification manipulates five kinds of data: envelopes, keys, credentials, directory objects, and manifests.

Presently, we intend to use JSON as a transfer encoding because we often would like to represent structured data: trees, lists, tuples. JSON is a "modern S-expression" format which is easily and safely parseable in Python and other languages.

In security-sensitive applications, we will use a "canonical JSON". Canonical JSON is restricted such that there is only one possible encoding of a given data structure: whitespace is normalized, keys in mappings are ordered, and string encoding is restricted. Canonical JSON can be read by any standard JSON parser, although security-conscious applications will want to verify that the restrictions of canonical JSON format are observed. Canonical JSON format is fully specified in another document.

If we learn that we need a streamable or more space-efficient encoding, then we will consider alternatives.


Envelopes

Envelopes are fundamental self-describing objects. They contain an opaque string describing the type of the object, semi-opaque integers providing version information, and json-encoded contents. Envelopes are represented by 3-tuples (JSON 3-element lists) of the form:

 (type, version, data) :: (String, Integer, Value)
    ^       ^       ^
    |       |       \________ a JSON value in a format governed by
    |       |                   (type, version)
    |       \______ an opaque version identifier; see below.
    \_________ a hint about how to interpret the data field that follows.


Version information is opaque, and specifies the data format that must be supported by compliant clients. Version numbers are incomparable: only equality testing of versions is valid.

Keys

Keys are envelopes with type "key" and version 1. A key's datum consists of a tuple (JSON 3-element list):

 (algorithm, fingerprint, key)  :: (String, String, String)
      ^          ^         ^
      |          |         \______ an opaque string specifiying the actual key
      |          \______ an opaque string to use to refer to the key 
      \_________ an opaque algorithm identifier; 

Algorithms should be careful to specify a canonical representation for the key and fingerprint. In particular, if the key data and fingerprint are represented as hex digits, then (for example) only lowercase hex digits should be allowed.

In practice, we might see a key encoded as:

 ["key", 1, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]

Initially, the key data should be the same hexadecimal-encoded octet string given in the firmware key spec, the key fingerprint should be the trailing 64 characters of that string, as in the firmware signature spec, and the algorithm string should be "rsa-2048-pub" exactly. An encoding for private keys is not (yet) given.

Credentials

A credential is an envelope with type "sig" and version 1. Its datum consists of a list of signatures in the sig01 or sig02 formats. A valid credential must have at least one signature in the list, and *all* signatures must verify for the credential to be valid. (Implementations may chose to verify only a subset of the provided signatures, either for performance reasons or to allow migration to future preferred signature algorithms, but any invalid signature invalidates the credential.) Since the list of signatures is nominally unordered, we will require that the encoding of the credential lists signatures in lexicographic order. All signatures must be unique: the credential is invalid if the same signature string is present more than once.

Note that detached signatures are a pervasive implementation detail within bitfrost; we assume some external mechanism to associate this signature with the appropriate message.

In actual JSON, we might imagine that such a message would look like:

 ["sig", 1, [ "sig01: sha256 807b41...0001 c06dc..cb3c\n", "sig01: rmd160 807b41...0001 c06dc..1234\n" ]]

Contents Manifests

Contents manifests describe the contents of a filesystem. Their primary use is in describing software releases and allowing upgrades between these.

In version 1 of the contents format, we have decided to omit unnecessary file metadata from the manifest. This will allow more efficient updates, because we don't have to synchronize irrelevant metadata. Omitted metadata includes file creation, modification, and access times, and hard link and xattr information. We also use both numeric and textual user and group ids, which may require care on the target system if the user/group databases are not synchronized. Our design is sufficient for the present; we will consider adding additional metadata to the contents manifest in future versions if a need can be demonstrated.

(Hard link information is omitted because (a) it seems to be unnecessary at the present, and (b) it is difficult to properly support in our current containerization copy-on-write framework. Contents-generation tools should check that hard links are not used.)

A contents manifest is an envelope with type "manifest" and version 1.0. A contents manifest's datum is a list:

 directory-list :: List<Directory>

The directory-list is an list of directory envelopes, in post-order. The first directory corresponds to the root of the filesystem defined by the manifest, and it is followed by directory objects for its subdirectories (and their subdirectories) in the order in which they appear in the root directory object. For example, the directory objects for the directory tree:

  /
  /a
  /a/b
  /a/c
  /d
  /d/e
  /d/e/f
  /g

are written out in the order:

  [/, a, b, c, d, e, f, g]

The directory-list must include the 'root' object (/), but it need not be complete: subdirectory trees may be omitted from the manifest object if they can be obtained by other means. In the example above, 'f' could be omitted, and 'e' and 'f' could be omitted, but 'f' should not be included in the directory list if 'e' is omitted. Omitted subdirectories can be discovered by comparing the hash of each directory object encountered with the hash declared in its parent directory, and skipping omitted directories until the hashes match.

Only the root directory need be included in a contents manifest transmitted for the purpose of validating a directory tree: hashes for the subdirectories can be recomputed from the filesystem contents and compared to those named in the root directory object. Similarly, one may chose to store subdirectory contents manifest objects in special hidden files in the directories themselves, either as bare JSON-encoded directory objects, or as full-fledged contents manifests with their (sub)directory lists. The contents format allows a completely self-contained contents manifest, however, where all subdirectories are contained in the directory list.

Since subdirectories may be elided from the directory list, the credential authenticating a contents manifest should sign (a hash of) the root directory object (ie, '/' in the example above). The validity of the remaining subdirectories is assured by the hash tree construction of the directory objects. The credential need not share hash algorithms with the contents manifest.

Although objects may be omitted from the directory list, it should be considered an error to include extraneous directories not directly referenced from the included directory objects.

A contents manifest is just a convenience object for bundling a number of related directory objects; it should not be directly signed. Instead, the root directory object in the contents manifest should be the element which receives a signature.

Directory objects

Directory objects describe the contents of a single directory; a contents manifest is built from a collection of directory objects.

A directory object is an envelope with type "dir" and version 1. The datum is a 2-tuple (JSON 2-element list):

 (hash-algorithm-list, contents-map) ::
                     (List<String>, Map)

The hash-algorithm-list is an ordered list of opaque algorithm identifiers. There must be at least one item in the list. This hash algorithm list is used by the contents-map, which contains:

 { filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
                                      Map<String, Map<String, Value>>

The map keys are filenames for the contents in the directory. Path components are not included in these keys. The values are maps describing the file entry, with the following keys:

  m:  numeric POSIX mode of the entry
  u:  username of the entry's owner (string)
  u#: numeric UID of the entry's owner
  g:  group name of the entry's group (string)
  g#: numeric GID of the entry's group
  d:  numeric device (for entries where mode indicates a character or
                      block device ONLY)
      [This is 'st_rdev' in the entry's lstat]
  l: link target (for entries where mode indicates a symlink ONLY)
  h: an ordered list of hashes for this entry (for regular files or
                                               directories ONLY)
  dl: length of the canonical-JSON encoding of the directory object
      (for directories ONLY)
  ml: length of the canonical-JSON encoding of the manifest for this
      directory and its children (for directories ONLY)

The list of hashes under key 'h' should match the order of hash-algorithm-list. The hash for a entry naming a subdirectory is defined to be the hash of the "canonical JSON"-encoded directory object describing that directory; the contents and order of the hash-algorithm-list in the subdirectory must exactly match the one for this directory.

The 'dl' and 'ml' keys are for entries naming subdirectories and provide length information for random access to individual directory entries within a contents manifest. The 'ml' attribute is 16 plus the sum of (1 + dlen) for the referenced directory and all its children, where 'dlen' is the length of the JSON-encoded directory object.

Note that both numeric and symbolic user and group names are included. This is the same strategy used by the 'tar' utility. When extracting or verifying a manifest, certain of this information may be ignored: for example, activities in the user's home directory will typically belong to the user, while 'system' activities may belong to root. Care should be taken when generating manifests to ensure that user and group information has standardized content, even if that information will be ignored in the expected use, to avoid having multiple valid manifests for a given activity or library bundle.

In version 1 of the directory format, there must be exactly two hash algorithms specified in the hash-algorithm-list, and they must be 'sha-256' and 'ripemd-160', in that order.

Note that adding a bogus hash algorithm to the hash-algorithm-list will allow injection of arbitrary data into the directory object, which could comprise an attack on the hash algorithm(s) used. For this reason, security-conscious applications will require a fixed algorithm list (as version 1 does). Migration to a new hash algorithm requires first distributing an upgrade which recognizes a new directory version. After the upgrade, directories with the new version (and thus hash algorithm list) can be published.

Similarly, a directory with N listed hash algorithms admits N! possible variants where the algorithms are listed in a different order. Since N! grows much faster than 2^N, it is wise to keep the list of hash algorithms small.

This attack does not apply to hash and signature algorithms in a credential authenticating this directory object, since the credential object is not itself hashed or signed. Any desired set of hash and signature algorithms can be used in the credential.

A small example of a directory object might be:

 ["dir",1,[["sha-256","ripemd-160"],{
  "bar":{"g":"users","g#":1000,"h":["7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730","7d4e874a231f57b72509087d1e509942fdb6eac6"],"m":33188,"u":"olpc","u#":1000},
  "fifo":{"g":"users","g#":1000,"m":4516,"u":"olpc","u#":1000},
  "frobnitz":{"g":"users","g#":1000,"l":"bar","m":41471,"u":"olpc","u#":1000},
  "null":{"d":259,"g":"users","g#":1000,"m":8612,"u":"olpc","u#":1000},
  "subdir":{"dl":39,"g":"users","g#":1000,"h":["19b46e0c53a25994e5f5e4d133bf308df3f99a3879b7e954d75b51f8393523f1","75fc670c37b3d1aaf0f402c531dc98325862e8ae"],"m":16877,"ml":56,"u":"olpc","u#":1000}
 }]]

(Of course, the JSON encoding should remove whitespace to actually be Canonical JSON.)

A simple contents manifest containing this directory might be:

 ["manifest",1,[
   ["dir",1,[["sha-256","ripemd-160"],{
    "bar":{"g":"users","g#":1000,"h":["7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730","7d4e874a231f57b72509087d1e509942fdb6eac6"],"m":33188,"u":"olpc","u#":1000},
    "fifo":{"g":"users","g#":1000,"m":4516,"u":"olpc","u#":1000},
    "frobnitz":{"g":"users","g#":1000,"l":"bar","m":41471,"u":"olpc","u#":1000},
    "null":{"d":259,"g":"users","g#":1000,"m":8612,"u":"olpc","u#":1000},
    "subdir":{"dl":39,"g":"users","g#":1000,"h":["19b46e0c53a25994e5f5e4d133bf308df3f99a3879b7e954d75b51f8393523f1","75fc670c37b3d1aaf0f402c531dc98325862e8ae"],"m":16877,"ml":56,"u":"olpc","u#":1000}
   }]],
   ["dir",1,[["sha-256","ripemd-160"],{}]]
 ]]

Notes on non-ASCII filenames

Generally speaking, filenames on a UNIX system are assumed to be uninterpreted byte streams; they are explicitly *not* required to be valid in UTF-8 or any other encoding. However, for the Sugar environment we insist that all filenames are valid UTF-8 (since various services exposed over DBus require this). For this reason, some implementations (in particular old versions of fv.c from olpc-contents) will treat filenames and directory names as unicode strings, converting from UTF-8 and canonicalizing them when the contents manifest is created, and renormalizing before comparison during verification. This is problematic: it allows multiple encodings of the same name to match the same manifest.

A better solution (implemented in olpc-contents 2.5 and later) is to insist on normalized filenames during contents manifest creation, and allow only exact comparisons during verification.

Notes on resource-bounded parsing

Small bounds can be enforced when parsing directory entry descriptions in directory objects: the numeric components can be limited to 2^32=10 digits (2^64=20 digits for the 'size' field), and the string components (hash, symlink, entry key) can be limited to 256 characters. It is a little more problematic to limit the number of entries "reasonably" found in a directory, but some empirical bound can be set here as well.

Credentials can be resource-bounded as well, using small bounds on the number of signatures included and the size of the hash and signature algorithm descriptors and the signature data.

For manifests, small bounds on the hash algorithm list are wise to limit the possible permutations which can be generated. The 'directory-list' in the manifest will have as many elements as there are directories in the filesystem. However, the directory list can be directly checked as the entries are parsed, requiring space proportional only to the depth of the directory tree (since we must keep around subdirectory hashes until we parse the subdirectory's object). Thus, a limit on directory depth can be used instead of limits on aggregate size of the manifest.