Contents manifest specification: Difference between revisions

From OLPC
Jump to navigation Jump to search
m (Verification System Data moved to Manifest Specification: Manifests are not just used for the verification system, they are also used to guide upload/downloads and to perform hard linking to share space.)
(My counter-proposal for the manifest specification.)
Line 1: Line 1:
;Author : Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>
;Author : Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>, C. Scott Ananian <cscott@laptop.org>
;Date : August 8, 2007
;Date : August 12, 2007


Our system for update verification manipulates four kinds of data:
Our system for update verification manipulates five kinds of data:
envelopes, keys, credentials, and manifests.
envelopes, keys, credentials, directory objects, and manifests.


Presently, we intend to use JSON as a transfer encoding because we only need to
Presently, we intend to use JSON as a transfer encoding because we
often would like to represent structured data: trees, lists, tuples.
represent tree-structured data and because we want an encoding that is easily
JSON is a "modern S-expression" format which is easily and safely
and safely parseable in pure Python and in other languages.
parseable in Python and other languages.


In security-sensitive applications, we will use a "canonical JSON".
If we learn that we need a canonical encoding, a more space-efficient encoding,
Canonical JSON is restricted such that there is only one possible
or an encoding that can be more efficiently streamed, then we will consider
encoding of a given data structure: whitespace is normalized, keys in
alternatives.
mappings are ordered, and string encoding is restricted. Canonical
JSON can be read by any standard JSON parser, although
security-conscious applications will want to verify that the
restrictions of canonical JSON format are observed. Canonical JSON
format is fully specified in another document.

If we learn that we need a streamable or more space-efficient encoding,
then we will consider alternatives.




==Envelopes==
==Envelopes==


Envelopes are tuples of the form
Envelopes are fundamental self-describing objects. They contain an
opaque string describing the type of the object, semi-opaque integers
providing version information, and json-encoded contents. Envelopes
are represented by tuples (JSON 4-element lists) of the form:


(version, type, data) :: (Integer, String, Value)
(type, version, subversion, data) :: (String, Integer, Integer, Value)
^ ^ ^
^ ^ ^ ^
| | \________ a JSON value in a format governed by (version, type)
| | | \________ a JSON value in a format governed by
| \______ a hint about how to interpret the data field that follows.
| | | (type, version, subversion)
\_________ an opaque version identifier; see below.
| | \____ transparent version identifier; see below.
| \______ an opaque version identifier; see below.
\_________ a hint about how to interpret the data field that follows.


The version identifier is intended to specify a minimal dialect that must be
supported by compliant clients. Backwards-compatible extensions of this dialect
may be included in an enveloped value without changing the version identifier.


Version information is opaque, and specifies the data format that must
Note: Noah suggests make the version identifier transparent and declare it to
be supported by compliant clients. Version numbers are incomparable:
be an integer with a a monotonically increasing value. I'm okay with this,
only equality testing of versions is valid.
but since we anticipate changing the identifer only when making

backwards-incompatible protocol changes, I think it might as well be opaque.
On the other hand, the subversion number is transparent, and is
defined to be monotonically increasing. It is designed to allow the
identification of backwards-compatible protocol extensions.

In security-critical applications, only exact matches on version and
subversion information should be allowed, and any unknown keys in maps
or entries in lists should trigger a fatal parsing error.

In non-security-critical applications, version numbers must still be
matched exactly, but a subversion number greater than or equal to the
one(s) known to be parseable may be allowed. Unknown map keys or list
items should be silently ignored.

In the remainder, the version and subversion information will often be
specified as a dotted integer. For example, "1.2" represents version
1 and subversion 2.


==Keys==
==Keys==


Keys are envelopes with version 1 and type "key". A key's datum consists of a
Keys are envelopes with type "key" and version 1.0. A key's datum consists of a
tuple (JSON 3-element list):
tuple


(algorithm, fingerprint, key) :: (String, String, String)
(algorithm, fingerprint, key) :: (String, String, String)
Line 43: Line 70:
| \______ an opaque string to use to refer to the key
| \______ an opaque string to use to refer to the key
\_________ an opaque algorithm identifier;
\_________ an opaque algorithm identifier;

Algorithms should be careful to specify a canonical representation for
the key and fingerprint. In particular, if the key data and
fingerprint are represented as hex digits, then (for example) only
lowercase hex digits should be allowed.


In practice, we might see a key encoded as:
In practice, we might see a key encoded as:
[1, "key", ["rsa", "12508713460986140897", "AE1908579871250981230896109761"]]
["key", 1, 0, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]




==Credentials==
==Credentials==


A credential is an envelope with version 1 and type "credential". Conceptually,
A credential is an envelope with type "credential" and version 1.0.
its datum consists of a relation whose tuples are signatures of the form:
Conceptually, its datum consists of a relation whose tuples are
signatures of the form:


(signature-algorithm, key, message, signature)
(signature-algorithm, key, message, signature)


However, in practice, the message is defined implicitly by context (i.e. as the
However, in practice, the message is defined implicitly by context
hash of the manifest with which this credential is paired) and we wish to use
(i.e. the manifest with which this credential is paired) and we wish
the credential/manifest system to help detect accidental corruption as early as
to use the credential/manifest system to help detect accidental
possible, including corruption of the manifest.
corruption as early as possible, including corruption of the manifest.


Therefore, we actually propose to represent the datum of a manifest credential
Therefore, we actually propose to represent the datum of a message credential
as a list of tuples, where each tuple (JSON 5-item list) is:
as a tuple of


(hash-alg, sig-alg, expected-hash, key-identifier, signature)
(hash-alg, sig-alg, expected-hash, key-identifier, signature)


Each tuple in the list specifies a different signature algorithm. A
We include the manifest hash as a convenient way to check that the manifest
valid credential must have at least one tuple in the list, and *all*
was not accidentally damaged in transit (e.g. by a flash corruption bug).
signatures must verify for the credential to be valid.
(Implementations may chose to verify only a subset of the provided
signatures, either for performance reasons or to allow migration to
future preferred signature algorithms, but any invalid signature
invalidates the credential.) Since the list of tuples is nominally
unordered, we will require that the encoding of the credential list
tuples in lexicographic order. That is, tuples are sorted first
lexicographically on hash-alg, then tuples with the same hash-alg are
lexicographically sorted by sig-alg, and further ties are broken by
the expected hash, key-identifier, and (if necessary) signature field.
All tuples must be unique: the credential is invalid if the same
(hash-alg, sig-alg, expected-hash, key-identifier, signature) tuple is
present more than once.

We include the message hash as a convenient way to verify the
(implicit) message to which this signature applies, as well as to
check that the manifest was not accidentally damaged in transit
(e.g. by a flash corruption bug).


The key fingerprint identifies the key that should be used to verify the
The key fingerprint identifies the key that should be used to verify the
Line 72: Line 123:


The hash-algorithm and signature-algorithm are used to verify the signature
The hash-algorithm and signature-algorithm are used to verify the signature
against the *computed* manifest-hash (not the one cached in the signature
against the *computed* message hash (not the 'expected-hash' of the
credential tuple). The signature algorithm may use a different hash
message).
function or different message encapsulation, so it is not necessarily
expected that the hash which is signed is the same as the
'expected-hash' field.


Finally, the signature itself is opaque.
Finally, the signature itself is opaque.

As with keys, the specific hash and signature algorithms should be
careful to unambiguously encoded the hashed data and signature. For
example, if hex digits are used, only lowercase hex digits should be
permitted.


In actual JSON, we might imagine that such a message would look like:
In actual JSON, we might imagine that such a message would look like:
[1, "credential", [ ["sha1", "rsa", "DEADBEEF", "123MYKEY", "0PAQU3S1G"],
["credential", 1, 0, [ ["sha-256", "rsa-2048", "deadbeef", "123mykey", "0paqu3s1g"],
["whirlpool", "ecc", "123095", "9912KEY", "8567A"] ] ]
["whirlpool-XXX", "ecc-YYY", "123095", "9912key", "8567a"] ] ]





==Manifests==
==Manifests==


Manifests are the most complicated piece of the system because they have
Manifests describe the contents of a filesystem. Their primary use is
in describing software releases and allowing upgrades between these.
different conceptual and transfer-encoded forms.


In version 1.0 of the manifest format, we have decided to omit
Conceptually, a manifest's datum is a constraint-set. Implementations of this
unnecessary file metadata from the manifest. This will allow more
constraint language are free to support checking as many or as few kinds of
efficient updates, because we don't have to synchronize irrelevant
constraint as they wish. Clearly, though, adequate verification of a file-system
metadata. Omitted metadata includes file creation, modification, and
against a manifest depends on both the presence of a constraint set that
access times, and hard link and xattr information. We also use
sufficiently constrains the properties of the file-system being checked and a
numeric user and group ids, which may not correlate with the target
correct and complete implementation of a checker for that constraint-set.
system if the user/group databases are not synchronized. Our design
is sufficient for the present; we will consider adding additional
metadata to the manifest in future versions if a need can be
demonstrated.


(Hard link information is omitted because (a) it seems to be
This model controls the implementation cost of both the verification machine
unnecessary at the present, and (b) it is difficult to properly
(and the manifest generation tool) while allowing great flexibility in the
support in our current containerization copy-on-write framework.
underlying properties being verified and in the degree to which we compromise
Manifest-generation tools should check that hard links are not used.)
our implementation goals in the face of time pressure.


A manifest is an envelope with type "manifest" and version 1.0.
===Details===
A manifest's datum is a 3-tuple (JSON 3-element list):


(hash-algorithm-list, object-table, credential) ::
There are several basic constraints we wish to support. These include
(List<String>, Map<String,Directory>, Credential)
constraints on file content, file type (normal, hardlink, symlink, device, ...),
permissions, ownership, and on the contents of directories.


The hash-algorithm-list is an ordered list of opaque algorithm
Manifests are envelopes with version 1 and type "manifest". A manifest's datum
identifiers. There must be at least one item in the list, and the
consists of a tuple of a dictionary of optimization hints and a dictionary of
first hash algorithm in the list is special: we'll call it hash-alg1.
constraints. Structurally, the datum looks like
The object table contains envelopes, keyed by the hash-alg1 hash of
their canonical JSON encoding. The credential is a "credential"
(Hint -> Value, Constraint -> (Query, Result))
envelope as described above, and gives the signature information for
this manifest. The credential identifies (using its expected-hash
field) and signs a 'root' directory object in the manifest, and for
this reason at least one of the 'hash-alg's mentioned in the
credential tuples must match hash-alg1.

The object hash table must include the 'root' object, but it need not
be complete: keys and data may be omitted from the manifest object if
they can be obtained by other means. For example, only the root
directory need be included in a manifest transmitted for the purpose
of validating a directory tree: hashes for the subdirectories can be
recomputed from the filesystem contents and compared to those named in
the root directory object. Similarly, one may chose to store
subdirectory manifest objects in special hidden files in the
directories themselves, either as bare JSON-encoded directory objects,
or as full-fledged manifests with their own object hash tables. The
manifest format allows a completely self-contained manifest, however,
where all subdirectory manifests are contained in the object hash
table.

Although objects may be omitted from the object table, it should
be considered an error to include extraneous objects not transitively
referenced from the root directory object.

In version 1.0 of the manifest format, there must be exactly two
hash algorithms specified, and they must be 'sha-256' and
'whirlpool-XXX'. (Future subversions of the manifest format may add new
algorithms.)

Note that adding a bogus hash algorithm to the hash-algorithm-list
will allow injection of arbitrary data into the manifest, which could
comprise an attack on the hash algorithm(s) used. For this reason,
security-conscious applications will require a fixed algorithm list
(as version 1.0 does). Migration to a new hash algorithm requires
first distributing an upgrade which recognizes a new manifest version
(either a new subversion, if the hash algorithm list is simply
extended, or a new version if hash algorithms are removed). After the
upgrade, manifests with the new version (and thus hash algorithm list)
can be published.

Similarly, a manifest with N listed hash algorithms admits N! possible
variants where the algorithms are listed in a different order. Since
N! grows much faster than 2^N, it is wise to keep the list of hash
algorithms small.

This attack does not apply to hash and signature algorithms in the
credential, since the credential object is not itself hashed or
signed. Any desired set of hash and signature algorithms can be used
there, subject to the requirement that at least one must match the
hash algorithm used in the object table. It is recommended that the
hash algorithms used in the credential exactly match those named in the
manifest header.

===Directory objects===

Directory objects describe the contents of a single directory; a
manifest is built from a collection of directory objects.

A directory object is an envelope with type "dir" and version
1.0. The datum is a map:
{ filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
Map<String, Map<String, Value>>

The map keys are filenames for the contents in the directory. Path
components are not included in these keys. The values are maps
describing the file entry, with the following keys:

m: numeric POSIX mode of the entry
u: numeric UID of the entry's owner
g: numeric GID of the entry's group
d: numeric device (for entries where mode indicates a character or
block device ONLY)
[This is 'st_rdev' in the entry's lstat]
s: file size (for entries where mode indicates a regular file ONLY)
l: link target (for entries where mode indicates a symlink ONLY)
h: an ordered list of hashes for this entry (for regular files or
directories ONLY)

The list of hashes under key 'h' should match the order of
hash-algorithm-list in a containing manifest object. The hash for a
entry naming a subdirectory is defined to be the hash of the
"canonical JSON"-encoded directory object describing that directory.

A small example of a directory object might be:
["dir", 1, 0, {
"bar": { g: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, s: 4, u: 1000 },
"fifo": { g: 1000, m: 4516, u: 1000 },
"frobnitz": { g: 1000, l: "bar", m: 41380, u: 1000 },
"null": { d: 2049, g: 1000, m: 8612, u: 1000 },
"subdir": { g: 1000, h: [ "da39a3ee5e6b4b0d3255bfef95601890afd80709", "d41d8cd98f00b204e9800998ecf8427e" ], m: 16804, u: 1000 }
} ]
(Of course, the JSON encoding should use normalized whitespace and
quote all map keys to actually be canonical JSON.)


A simple manifest containing this directory might be:
Here "hints" are opaque strings that may be given some uninterpreted data as an
["manifest", 1, 0, [
argument (e.g. "invert-filesystem" : true). These are provided because checking
[ "sha1", "md5" ], {
some of the global constraints may rely on expensive-to-compute structures such
"8c98dc63b862e0901263f5b0304480487a424248":
as an inverted index that maps inodes to the paths that point to them.
["dir", 1, 0, {
"bar": { g: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, s: 4, u: 1000 },
"fifo": { g: 1000, m: 4516, u: 1000 },
"frobnitz": { g: 1000, l: "bar", m: 41380, u: 1000 },
"null": { d: 2049, g: 1000, m: 8612, u: 1000 },
"subdir": { g: 1000, h: [ "6052a04eab133434e4c418261bc074e8c4446346", "ce57e96b0e70715b8bec635a34a0ca54" ], m: 16804, u: 1000 }
} ],
"6052a04eab133434e4c418261bc074e8c4446346": [dir,1,0,{}]
},
["credential", 1, 0, [
["sha1", "rsa-2048", "8c98dc63b862e0901263f5b0304480487a424248", "123MYKEY", "0PAQU3S1G"]
] ] ] ]


The constraint itself dictionary is also made slightly more complicated in the
name of speed and memory usage. While conceptually, we wish to store a ternary
relation with columns


==Notes on resource-bounded parsing==
(constraint-algorithm, query, expected-result)


Small bounds can be enforced when parsing directory entry descriptions
in practice, we expect that there will very few (~5) constraint algorithms and
in directory objects: the numeric components can be limited to 2^32=10 digits
very many (~100000) (query, expected-result) pairs. Hence we suggest
(2^64=20 digits for the 'size' field), and the string components
optimizing for this case by grouping the relation's tuples by
(hash, symlink, entry key) can be limited to 256 characters.
constraint-algorithm, as shown above in the structural description.
It is a little more problematic to limit the number of entries
"reasonably" found in a directory, but some empirical bound can be set
here as well.


Credentials can be resource-bounded as well, using small bounds on the
A small example of a manifest might be:
number of signatures included and the size of the hash and signature
algorithm descriptors and the signature data.


For manifests, small bounds on the hash algorithm list are wise to limit
[1, "manifest", [
the possible permutations which can be generated. The manifest
{"invert-filesystem": ["/links"]},
credential is resource-limited as described above. This leaves the
{"file-sha1" : [ ["/etc/passwd", "BLAH18AK"],
'object-table' in the manifest, which will have as many keys as there
["/bin/bash", "LA81985ED"]],
are directories in the filesystem. A limit on the aggregate size of
"permissions" : [ ["/my/secret", 448] ],
the object table is probably most reasonable.
"same-inode" : [ [["/links/1", "/links/2", "/links/3"], null] ],
"dir-contains" : [ ["/", ["bin", "sbin"]] ],
"update-excludes": [ ["/security", null] ]
}]]

Revision as of 16:33, 14 August 2007

Author
Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>, C. Scott Ananian <cscott@laptop.org>
Date
August 12, 2007

Our system for update verification manipulates five kinds of data: envelopes, keys, credentials, directory objects, and manifests.

Presently, we intend to use JSON as a transfer encoding because we often would like to represent structured data: trees, lists, tuples. JSON is a "modern S-expression" format which is easily and safely parseable in Python and other languages.

In security-sensitive applications, we will use a "canonical JSON". Canonical JSON is restricted such that there is only one possible encoding of a given data structure: whitespace is normalized, keys in mappings are ordered, and string encoding is restricted. Canonical JSON can be read by any standard JSON parser, although security-conscious applications will want to verify that the restrictions of canonical JSON format are observed. Canonical JSON format is fully specified in another document.

If we learn that we need a streamable or more space-efficient encoding, then we will consider alternatives.


Envelopes

Envelopes are fundamental self-describing objects. They contain an opaque string describing the type of the object, semi-opaque integers providing version information, and json-encoded contents. Envelopes are represented by tuples (JSON 4-element lists) of the form:

 (type, version, subversion, data) :: (String, Integer, Integer, Value)
    ^       ^       ^        ^
    |       |       |        \________ a JSON value in a format governed by
    |       |       |                   (type, version, subversion)
    |       |       \____ transparent version identifier; see below.
    |       \______ an opaque version identifier; see below.
    \_________ a hint about how to interpret the data field that follows.


Version information is opaque, and specifies the data format that must be supported by compliant clients. Version numbers are incomparable: only equality testing of versions is valid.

On the other hand, the subversion number is transparent, and is defined to be monotonically increasing. It is designed to allow the identification of backwards-compatible protocol extensions.

In security-critical applications, only exact matches on version and subversion information should be allowed, and any unknown keys in maps or entries in lists should trigger a fatal parsing error.

In non-security-critical applications, version numbers must still be matched exactly, but a subversion number greater than or equal to the one(s) known to be parseable may be allowed. Unknown map keys or list items should be silently ignored.

In the remainder, the version and subversion information will often be specified as a dotted integer. For example, "1.2" represents version 1 and subversion 2.

Keys

Keys are envelopes with type "key" and version 1.0. A key's datum consists of a tuple (JSON 3-element list):

 (algorithm, fingerprint, key)  :: (String, String, String)
      ^          ^         ^
      |          |         \______ an opaque string specifiying the actual key
      |          \______ an opaque string to use to refer to the key 
      \_________ an opaque algorithm identifier; 

Algorithms should be careful to specify a canonical representation for the key and fingerprint. In particular, if the key data and fingerprint are represented as hex digits, then (for example) only lowercase hex digits should be allowed.

In practice, we might see a key encoded as:

 ["key", 1, 0, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]


Credentials

A credential is an envelope with type "credential" and version 1.0. Conceptually, its datum consists of a relation whose tuples are signatures of the form:

 (signature-algorithm, key, message, signature)

However, in practice, the message is defined implicitly by context (i.e. the manifest with which this credential is paired) and we wish to use the credential/manifest system to help detect accidental corruption as early as possible, including corruption of the manifest.

Therefore, we actually propose to represent the datum of a message credential as a list of tuples, where each tuple (JSON 5-item list) is:

 (hash-alg, sig-alg, expected-hash, key-identifier, signature)

Each tuple in the list specifies a different signature algorithm. A valid credential must have at least one tuple in the list, and *all* signatures must verify for the credential to be valid. (Implementations may chose to verify only a subset of the provided signatures, either for performance reasons or to allow migration to future preferred signature algorithms, but any invalid signature invalidates the credential.) Since the list of tuples is nominally unordered, we will require that the encoding of the credential list tuples in lexicographic order. That is, tuples are sorted first lexicographically on hash-alg, then tuples with the same hash-alg are lexicographically sorted by sig-alg, and further ties are broken by the expected hash, key-identifier, and (if necessary) signature field. All tuples must be unique: the credential is invalid if the same (hash-alg, sig-alg, expected-hash, key-identifier, signature) tuple is present more than once.

We include the message hash as a convenient way to verify the (implicit) message to which this signature applies, as well as to check that the manifest was not accidentally damaged in transit (e.g. by a flash corruption bug).

The key fingerprint identifies the key that should be used to verify the signature.

The hash-algorithm and signature-algorithm are used to verify the signature against the *computed* message hash (not the 'expected-hash' of the credential tuple). The signature algorithm may use a different hash function or different message encapsulation, so it is not necessarily expected that the hash which is signed is the same as the 'expected-hash' field.

Finally, the signature itself is opaque.

As with keys, the specific hash and signature algorithms should be careful to unambiguously encoded the hashed data and signature. For example, if hex digits are used, only lowercase hex digits should be permitted.

In actual JSON, we might imagine that such a message would look like:

 ["credential", 1, 0, [ ["sha-256", "rsa-2048", "deadbeef", "123mykey", "0paqu3s1g"], 
                        ["whirlpool-XXX", "ecc-YYY", "123095", "9912key", "8567a"] ] ]


Manifests

Manifests describe the contents of a filesystem. Their primary use is in describing software releases and allowing upgrades between these.

In version 1.0 of the manifest format, we have decided to omit unnecessary file metadata from the manifest. This will allow more efficient updates, because we don't have to synchronize irrelevant metadata. Omitted metadata includes file creation, modification, and access times, and hard link and xattr information. We also use numeric user and group ids, which may not correlate with the target system if the user/group databases are not synchronized. Our design is sufficient for the present; we will consider adding additional metadata to the manifest in future versions if a need can be demonstrated.

(Hard link information is omitted because (a) it seems to be unnecessary at the present, and (b) it is difficult to properly support in our current containerization copy-on-write framework. Manifest-generation tools should check that hard links are not used.)

A manifest is an envelope with type "manifest" and version 1.0. A manifest's datum is a 3-tuple (JSON 3-element list):

 (hash-algorithm-list, object-table, credential) ::
                     (List<String>, Map<String,Directory>, Credential)

The hash-algorithm-list is an ordered list of opaque algorithm identifiers. There must be at least one item in the list, and the first hash algorithm in the list is special: we'll call it hash-alg1. The object table contains envelopes, keyed by the hash-alg1 hash of their canonical JSON encoding. The credential is a "credential" envelope as described above, and gives the signature information for this manifest. The credential identifies (using its expected-hash field) and signs a 'root' directory object in the manifest, and for this reason at least one of the 'hash-alg's mentioned in the credential tuples must match hash-alg1.

The object hash table must include the 'root' object, but it need not be complete: keys and data may be omitted from the manifest object if they can be obtained by other means. For example, only the root directory need be included in a manifest transmitted for the purpose of validating a directory tree: hashes for the subdirectories can be recomputed from the filesystem contents and compared to those named in the root directory object. Similarly, one may chose to store subdirectory manifest objects in special hidden files in the directories themselves, either as bare JSON-encoded directory objects, or as full-fledged manifests with their own object hash tables. The manifest format allows a completely self-contained manifest, however, where all subdirectory manifests are contained in the object hash table.

Although objects may be omitted from the object table, it should be considered an error to include extraneous objects not transitively referenced from the root directory object.

In version 1.0 of the manifest format, there must be exactly two hash algorithms specified, and they must be 'sha-256' and 'whirlpool-XXX'. (Future subversions of the manifest format may add new algorithms.)

Note that adding a bogus hash algorithm to the hash-algorithm-list will allow injection of arbitrary data into the manifest, which could comprise an attack on the hash algorithm(s) used. For this reason, security-conscious applications will require a fixed algorithm list (as version 1.0 does). Migration to a new hash algorithm requires first distributing an upgrade which recognizes a new manifest version (either a new subversion, if the hash algorithm list is simply extended, or a new version if hash algorithms are removed). After the upgrade, manifests with the new version (and thus hash algorithm list) can be published.

Similarly, a manifest with N listed hash algorithms admits N! possible variants where the algorithms are listed in a different order. Since N! grows much faster than 2^N, it is wise to keep the list of hash algorithms small.

This attack does not apply to hash and signature algorithms in the credential, since the credential object is not itself hashed or signed. Any desired set of hash and signature algorithms can be used there, subject to the requirement that at least one must match the hash algorithm used in the object table. It is recommended that the hash algorithms used in the credential exactly match those named in the manifest header.

Directory objects

Directory objects describe the contents of a single directory; a manifest is built from a collection of directory objects.

A directory object is an envelope with type "dir" and version 1.0. The datum is a map:

 { filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
                                      Map<String, Map<String, Value>>

The map keys are filenames for the contents in the directory. Path components are not included in these keys. The values are maps describing the file entry, with the following keys:

  m: numeric POSIX mode of the entry
  u: numeric UID of the entry's owner
  g: numeric GID of the entry's group
  d: numeric device (for entries where mode indicates a character or
                     block device ONLY)
     [This is 'st_rdev' in the entry's lstat]
  s: file size (for entries where mode indicates a regular file ONLY)
  l: link target (for entries where mode indicates a symlink ONLY)
  h: an ordered list of hashes for this entry (for regular files or
                                               directories ONLY)

The list of hashes under key 'h' should match the order of hash-algorithm-list in a containing manifest object. The hash for a entry naming a subdirectory is defined to be the hash of the "canonical JSON"-encoded directory object describing that directory.

A small example of a directory object might be:

 ["dir", 1, 0, {
   "bar": { g: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, s: 4, u: 1000 },
   "fifo": { g: 1000, m: 4516, u: 1000 },
   "frobnitz": { g: 1000, l: "bar", m: 41380, u: 1000 },
   "null": { d: 2049, g: 1000, m: 8612, u: 1000 },
   "subdir": { g: 1000, h: [ "da39a3ee5e6b4b0d3255bfef95601890afd80709", "d41d8cd98f00b204e9800998ecf8427e" ], m: 16804, u: 1000 }
 } ]

(Of course, the JSON encoding should use normalized whitespace and quote all map keys to actually be canonical JSON.)

A simple manifest containing this directory might be:

 ["manifest", 1, 0, [
   [ "sha1", "md5" ], {
     "8c98dc63b862e0901263f5b0304480487a424248":
       ["dir", 1, 0, {
         "bar": { g: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, s: 4, u: 1000 },
         "fifo": { g: 1000, m: 4516, u: 1000 },
         "frobnitz": { g: 1000, l: "bar", m: 41380, u: 1000 },
         "null": { d: 2049, g: 1000, m: 8612, u: 1000 },
         "subdir": { g: 1000, h: [ "6052a04eab133434e4c418261bc074e8c4446346", "ce57e96b0e70715b8bec635a34a0ca54" ], m: 16804, u: 1000 }
       } ],
     "6052a04eab133434e4c418261bc074e8c4446346": [dir,1,0,{}]
     },
   ["credential", 1, 0, [
     ["sha1", "rsa-2048", "8c98dc63b862e0901263f5b0304480487a424248", "123MYKEY", "0PAQU3S1G"]
   ] ] ] ]


Notes on resource-bounded parsing

Small bounds can be enforced when parsing directory entry descriptions in directory objects: the numeric components can be limited to 2^32=10 digits (2^64=20 digits for the 'size' field), and the string components (hash, symlink, entry key) can be limited to 256 characters. It is a little more problematic to limit the number of entries "reasonably" found in a directory, but some empirical bound can be set here as well.

Credentials can be resource-bounded as well, using small bounds on the number of signatures included and the size of the hash and signature algorithm descriptors and the signature data.

For manifests, small bounds on the hash algorithm list are wise to limit the possible permutations which can be generated. The manifest credential is resource-limited as described above. This leaves the 'object-table' in the manifest, which will have as many keys as there are directories in the filesystem. A limit on the aggregate size of the object table is probably most reasonable.