Contents manifest specification

From OLPC
Revision as of 16:33, 14 August 2007 by CScott (talk | contribs) (My counter-proposal for the manifest specification.)
Jump to navigation Jump to search
Author
Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>, C. Scott Ananian <cscott@laptop.org>
Date
August 12, 2007

Our system for update verification manipulates five kinds of data: envelopes, keys, credentials, directory objects, and manifests.

Presently, we intend to use JSON as a transfer encoding because we often would like to represent structured data: trees, lists, tuples. JSON is a "modern S-expression" format which is easily and safely parseable in Python and other languages.

In security-sensitive applications, we will use a "canonical JSON". Canonical JSON is restricted such that there is only one possible encoding of a given data structure: whitespace is normalized, keys in mappings are ordered, and string encoding is restricted. Canonical JSON can be read by any standard JSON parser, although security-conscious applications will want to verify that the restrictions of canonical JSON format are observed. Canonical JSON format is fully specified in another document.

If we learn that we need a streamable or more space-efficient encoding, then we will consider alternatives.


Envelopes

Envelopes are fundamental self-describing objects. They contain an opaque string describing the type of the object, semi-opaque integers providing version information, and json-encoded contents. Envelopes are represented by tuples (JSON 4-element lists) of the form:

 (type, version, subversion, data) :: (String, Integer, Integer, Value)
    ^       ^       ^        ^
    |       |       |        \________ a JSON value in a format governed by
    |       |       |                   (type, version, subversion)
    |       |       \____ transparent version identifier; see below.
    |       \______ an opaque version identifier; see below.
    \_________ a hint about how to interpret the data field that follows.


Version information is opaque, and specifies the data format that must be supported by compliant clients. Version numbers are incomparable: only equality testing of versions is valid.

On the other hand, the subversion number is transparent, and is defined to be monotonically increasing. It is designed to allow the identification of backwards-compatible protocol extensions.

In security-critical applications, only exact matches on version and subversion information should be allowed, and any unknown keys in maps or entries in lists should trigger a fatal parsing error.

In non-security-critical applications, version numbers must still be matched exactly, but a subversion number greater than or equal to the one(s) known to be parseable may be allowed. Unknown map keys or list items should be silently ignored.

In the remainder, the version and subversion information will often be specified as a dotted integer. For example, "1.2" represents version 1 and subversion 2.

Keys

Keys are envelopes with type "key" and version 1.0. A key's datum consists of a tuple (JSON 3-element list):

 (algorithm, fingerprint, key)  :: (String, String, String)
      ^          ^         ^
      |          |         \______ an opaque string specifiying the actual key
      |          \______ an opaque string to use to refer to the key 
      \_________ an opaque algorithm identifier; 

Algorithms should be careful to specify a canonical representation for the key and fingerprint. In particular, if the key data and fingerprint are represented as hex digits, then (for example) only lowercase hex digits should be allowed.

In practice, we might see a key encoded as:

 ["key", 1, 0, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]


Credentials

A credential is an envelope with type "credential" and version 1.0. Conceptually, its datum consists of a relation whose tuples are signatures of the form:

 (signature-algorithm, key, message, signature)

However, in practice, the message is defined implicitly by context (i.e. the manifest with which this credential is paired) and we wish to use the credential/manifest system to help detect accidental corruption as early as possible, including corruption of the manifest.

Therefore, we actually propose to represent the datum of a message credential as a list of tuples, where each tuple (JSON 5-item list) is:

 (hash-alg, sig-alg, expected-hash, key-identifier, signature)

Each tuple in the list specifies a different signature algorithm. A valid credential must have at least one tuple in the list, and *all* signatures must verify for the credential to be valid. (Implementations may chose to verify only a subset of the provided signatures, either for performance reasons or to allow migration to future preferred signature algorithms, but any invalid signature invalidates the credential.) Since the list of tuples is nominally unordered, we will require that the encoding of the credential list tuples in lexicographic order. That is, tuples are sorted first lexicographically on hash-alg, then tuples with the same hash-alg are lexicographically sorted by sig-alg, and further ties are broken by the expected hash, key-identifier, and (if necessary) signature field. All tuples must be unique: the credential is invalid if the same (hash-alg, sig-alg, expected-hash, key-identifier, signature) tuple is present more than once.

We include the message hash as a convenient way to verify the (implicit) message to which this signature applies, as well as to check that the manifest was not accidentally damaged in transit (e.g. by a flash corruption bug).

The key fingerprint identifies the key that should be used to verify the signature.

The hash-algorithm and signature-algorithm are used to verify the signature against the *computed* message hash (not the 'expected-hash' of the credential tuple). The signature algorithm may use a different hash function or different message encapsulation, so it is not necessarily expected that the hash which is signed is the same as the 'expected-hash' field.

Finally, the signature itself is opaque.

As with keys, the specific hash and signature algorithms should be careful to unambiguously encoded the hashed data and signature. For example, if hex digits are used, only lowercase hex digits should be permitted.

In actual JSON, we might imagine that such a message would look like:

 ["credential", 1, 0, [ ["sha-256", "rsa-2048", "deadbeef", "123mykey", "0paqu3s1g"], 
                        ["whirlpool-XXX", "ecc-YYY", "123095", "9912key", "8567a"] ] ]


Manifests

Manifests describe the contents of a filesystem. Their primary use is in describing software releases and allowing upgrades between these.

In version 1.0 of the manifest format, we have decided to omit unnecessary file metadata from the manifest. This will allow more efficient updates, because we don't have to synchronize irrelevant metadata. Omitted metadata includes file creation, modification, and access times, and hard link and xattr information. We also use numeric user and group ids, which may not correlate with the target system if the user/group databases are not synchronized. Our design is sufficient for the present; we will consider adding additional metadata to the manifest in future versions if a need can be demonstrated.

(Hard link information is omitted because (a) it seems to be unnecessary at the present, and (b) it is difficult to properly support in our current containerization copy-on-write framework. Manifest-generation tools should check that hard links are not used.)

A manifest is an envelope with type "manifest" and version 1.0. A manifest's datum is a 3-tuple (JSON 3-element list):

 (hash-algorithm-list, object-table, credential) ::
                     (List<String>, Map<String,Directory>, Credential)

The hash-algorithm-list is an ordered list of opaque algorithm identifiers. There must be at least one item in the list, and the first hash algorithm in the list is special: we'll call it hash-alg1. The object table contains envelopes, keyed by the hash-alg1 hash of their canonical JSON encoding. The credential is a "credential" envelope as described above, and gives the signature information for this manifest. The credential identifies (using its expected-hash field) and signs a 'root' directory object in the manifest, and for this reason at least one of the 'hash-alg's mentioned in the credential tuples must match hash-alg1.

The object hash table must include the 'root' object, but it need not be complete: keys and data may be omitted from the manifest object if they can be obtained by other means. For example, only the root directory need be included in a manifest transmitted for the purpose of validating a directory tree: hashes for the subdirectories can be recomputed from the filesystem contents and compared to those named in the root directory object. Similarly, one may chose to store subdirectory manifest objects in special hidden files in the directories themselves, either as bare JSON-encoded directory objects, or as full-fledged manifests with their own object hash tables. The manifest format allows a completely self-contained manifest, however, where all subdirectory manifests are contained in the object hash table.

Although objects may be omitted from the object table, it should be considered an error to include extraneous objects not transitively referenced from the root directory object.

In version 1.0 of the manifest format, there must be exactly two hash algorithms specified, and they must be 'sha-256' and 'whirlpool-XXX'. (Future subversions of the manifest format may add new algorithms.)

Note that adding a bogus hash algorithm to the hash-algorithm-list will allow injection of arbitrary data into the manifest, which could comprise an attack on the hash algorithm(s) used. For this reason, security-conscious applications will require a fixed algorithm list (as version 1.0 does). Migration to a new hash algorithm requires first distributing an upgrade which recognizes a new manifest version (either a new subversion, if the hash algorithm list is simply extended, or a new version if hash algorithms are removed). After the upgrade, manifests with the new version (and thus hash algorithm list) can be published.

Similarly, a manifest with N listed hash algorithms admits N! possible variants where the algorithms are listed in a different order. Since N! grows much faster than 2^N, it is wise to keep the list of hash algorithms small.

This attack does not apply to hash and signature algorithms in the credential, since the credential object is not itself hashed or signed. Any desired set of hash and signature algorithms can be used there, subject to the requirement that at least one must match the hash algorithm used in the object table. It is recommended that the hash algorithms used in the credential exactly match those named in the manifest header.

Directory objects

Directory objects describe the contents of a single directory; a manifest is built from a collection of directory objects.

A directory object is an envelope with type "dir" and version 1.0. The datum is a map:

 { filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
                                      Map<String, Map<String, Value>>

The map keys are filenames for the contents in the directory. Path components are not included in these keys. The values are maps describing the file entry, with the following keys:

  m: numeric POSIX mode of the entry
  u: numeric UID of the entry's owner
  g: numeric GID of the entry's group
  d: numeric device (for entries where mode indicates a character or
                     block device ONLY)
     [This is 'st_rdev' in the entry's lstat]
  s: file size (for entries where mode indicates a regular file ONLY)
  l: link target (for entries where mode indicates a symlink ONLY)
  h: an ordered list of hashes for this entry (for regular files or
                                               directories ONLY)

The list of hashes under key 'h' should match the order of hash-algorithm-list in a containing manifest object. The hash for a entry naming a subdirectory is defined to be the hash of the "canonical JSON"-encoded directory object describing that directory.

A small example of a directory object might be:

 ["dir", 1, 0, {
   "bar": { g: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, s: 4, u: 1000 },
   "fifo": { g: 1000, m: 4516, u: 1000 },
   "frobnitz": { g: 1000, l: "bar", m: 41380, u: 1000 },
   "null": { d: 2049, g: 1000, m: 8612, u: 1000 },
   "subdir": { g: 1000, h: [ "da39a3ee5e6b4b0d3255bfef95601890afd80709", "d41d8cd98f00b204e9800998ecf8427e" ], m: 16804, u: 1000 }
 } ]

(Of course, the JSON encoding should use normalized whitespace and quote all map keys to actually be canonical JSON.)

A simple manifest containing this directory might be:

 ["manifest", 1, 0, [
   [ "sha1", "md5" ], {
     "8c98dc63b862e0901263f5b0304480487a424248":
       ["dir", 1, 0, {
         "bar": { g: 1000, h: [ "e242ed3bffccdf271b7fbaf34ed72d089537b42f", "c157a79031e1c40f85931829bc5fc552" ], m: 33188, s: 4, u: 1000 },
         "fifo": { g: 1000, m: 4516, u: 1000 },
         "frobnitz": { g: 1000, l: "bar", m: 41380, u: 1000 },
         "null": { d: 2049, g: 1000, m: 8612, u: 1000 },
         "subdir": { g: 1000, h: [ "6052a04eab133434e4c418261bc074e8c4446346", "ce57e96b0e70715b8bec635a34a0ca54" ], m: 16804, u: 1000 }
       } ],
     "6052a04eab133434e4c418261bc074e8c4446346": [dir,1,0,{}]
     },
   ["credential", 1, 0, [
     ["sha1", "rsa-2048", "8c98dc63b862e0901263f5b0304480487a424248", "123MYKEY", "0PAQU3S1G"]
   ] ] ] ]


Notes on resource-bounded parsing

Small bounds can be enforced when parsing directory entry descriptions in directory objects: the numeric components can be limited to 2^32=10 digits (2^64=20 digits for the 'size' field), and the string components (hash, symlink, entry key) can be limited to 256 characters. It is a little more problematic to limit the number of entries "reasonably" found in a directory, but some empirical bound can be set here as well.

Credentials can be resource-bounded as well, using small bounds on the number of signatures included and the size of the hash and signature algorithm descriptors and the signature data.

For manifests, small bounds on the hash algorithm list are wise to limit the possible permutations which can be generated. The manifest credential is resource-limited as described above. This leaves the 'object-table' in the manifest, which will have as many keys as there are directories in the filesystem. A limit on the aggregate size of the object table is probably most reasonable.