Contents manifest specification

From OLPC
Revision as of 17:37, 28 August 2008 by CScott (talk | contribs) (→‎Notes on resource-bounded parsing: Document contents manifest handling of non-ASCII filenames.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  This page is monitored by the OLPC team.
Author
Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>, C. Scott Ananian <cscott@laptop.org>
Date
August 12, 2007

Our system for update verification manipulates five kinds of data: envelopes, keys, credentials, directory objects, and manifests.

Presently, we intend to use JSON as a transfer encoding because we often would like to represent structured data: trees, lists, tuples. JSON is a "modern S-expression" format which is easily and safely parseable in Python and other languages.

In security-sensitive applications, we will use a "canonical JSON". Canonical JSON is restricted such that there is only one possible encoding of a given data structure: whitespace is normalized, keys in mappings are ordered, and string encoding is restricted. Canonical JSON can be read by any standard JSON parser, although security-conscious applications will want to verify that the restrictions of canonical JSON format are observed. Canonical JSON format is fully specified in another document.

If we learn that we need a streamable or more space-efficient encoding, then we will consider alternatives.


Envelopes

Envelopes are fundamental self-describing objects. They contain an opaque string describing the type of the object, semi-opaque integers providing version information, and json-encoded contents. Envelopes are represented by 3-tuples (JSON 3-element lists) of the form:

 (type, version, data) :: (String, Integer, Value)
    ^       ^       ^
    |       |       \________ a JSON value in a format governed by
    |       |                   (type, version)
    |       \______ an opaque version identifier; see below.
    \_________ a hint about how to interpret the data field that follows.


Version information is opaque, and specifies the data format that must be supported by compliant clients. Version numbers are incomparable: only equality testing of versions is valid.

Keys

Keys are envelopes with type "key" and version 1. A key's datum consists of a tuple (JSON 3-element list):

 (algorithm, fingerprint, key)  :: (String, String, String)
      ^          ^         ^
      |          |         \______ an opaque string specifiying the actual key
      |          \______ an opaque string to use to refer to the key 
      \_________ an opaque algorithm identifier; 

Algorithms should be careful to specify a canonical representation for the key and fingerprint. In particular, if the key data and fingerprint are represented as hex digits, then (for example) only lowercase hex digits should be allowed.

In practice, we might see a key encoded as:

 ["key", 1, ["rsa-2048-pub", "12508713460986140897", "ae1908579871250981230896109761"]]

Initially, the key data should be the same hexadecimal-encoded octet string given in the firmware key spec, the key fingerprint should be the trailing 64 characters of that string, as in the firmware signature spec, and the algorithm string should be "rsa-2048-pub" exactly. An encoding for private keys is not (yet) given.

Credentials

A credential is an envelope with type "sig" and version 1. Its datum consists of a list of signatures in the sig01 or sig02 formats. A valid credential must have at least one signature in the list, and *all* signatures must verify for the credential to be valid. (Implementations may chose to verify only a subset of the provided signatures, either for performance reasons or to allow migration to future preferred signature algorithms, but any invalid signature invalidates the credential.) Since the list of signatures is nominally unordered, we will require that the encoding of the credential lists signatures in lexicographic order. All signatures must be unique: the credential is invalid if the same signature string is present more than once.

Note that detached signatures are a pervasive implementation detail within bitfrost; we assume some external mechanism to associate this signature with the appropriate message.

In actual JSON, we might imagine that such a message would look like:

 ["sig", 1, [ "sig01: sha256 807b41...0001 c06dc..cb3c\n", "sig01: rmd160 807b41...0001 c06dc..1234\n" ]]

Contents Manifests

Contents manifests describe the contents of a filesystem. Their primary use is in describing software releases and allowing upgrades between these.

In version 1 of the contents format, we have decided to omit unnecessary file metadata from the manifest. This will allow more efficient updates, because we don't have to synchronize irrelevant metadata. Omitted metadata includes file creation, modification, and access times, and hard link and xattr information. We also use both numeric and textual user and group ids, which may require care on the target system if the user/group databases are not synchronized. Our design is sufficient for the present; we will consider adding additional metadata to the contents manifest in future versions if a need can be demonstrated.

(Hard link information is omitted because (a) it seems to be unnecessary at the present, and (b) it is difficult to properly support in our current containerization copy-on-write framework. Contents-generation tools should check that hard links are not used.)

A contents manifest is an envelope with type "manifest" and version 1.0. A contents manifest's datum is a list:

 directory-list :: List<Directory>

The directory-list is an list of directory envelopes, in post-order. The first directory corresponds to the root of the filesystem defined by the manifest, and it is followed by directory objects for its subdirectories (and their subdirectories) in the order in which they appear in the root directory object. For example, the directory objects for the directory tree:

  /
  /a
  /a/b
  /a/c
  /d
  /d/e
  /d/e/f
  /g

are written out in the order:

  [/, a, b, c, d, e, f, g]

The directory-list must include the 'root' object (/), but it need not be complete: subdirectory trees may be omitted from the manifest object if they can be obtained by other means. In the example above, 'f' could be omitted, and 'e' and 'f' could be omitted, but 'f' should not be included in the directory list if 'e' is omitted. Omitted subdirectories can be discovered by comparing the hash of each directory object encountered with the hash declared in its parent directory, and skipping omitted directories until the hashes match.

Only the root directory need be included in a contents manifest transmitted for the purpose of validating a directory tree: hashes for the subdirectories can be recomputed from the filesystem contents and compared to those named in the root directory object. Similarly, one may chose to store subdirectory contents manifest objects in special hidden files in the directories themselves, either as bare JSON-encoded directory objects, or as full-fledged contents manifests with their (sub)directory lists. The contents format allows a completely self-contained contents manifest, however, where all subdirectories are contained in the directory list.

Since subdirectories may be elided from the directory list, the credential authenticating a contents manifest should sign (a hash of) the root directory object (ie, '/' in the example above). The validity of the remaining subdirectories is assured by the hash tree construction of the directory objects. The credential need not share hash algorithms with the contents manifest.

Although objects may be omitted from the directory list, it should be considered an error to include extraneous directories not directly referenced from the included directory objects.

A contents manifest is just a convenience object for bundling a number of related directory objects; it should not be directly signed. Instead, the root directory object in the contents manifest should be the element which receives a signature.

Directory objects

Directory objects describe the contents of a single directory; a contents manifest is built from a collection of directory objects.

A directory object is an envelope with type "dir" and version 1. The datum is a 2-tuple (JSON 2-element list):

 (hash-algorithm-list, contents-map) ::
                     (List<String>, Map)

The hash-algorithm-list is an ordered list of opaque algorithm identifiers. There must be at least one item in the list. This hash algorithm list is used by the contents-map, which contains:

 { filename1: file-descriptor1, ..., filenameN: file-descriptorN } ::
                                      Map<String, Map<String, Value>>

The map keys are filenames for the contents in the directory. Path components are not included in these keys. The values are maps describing the file entry, with the following keys:

  m:  numeric POSIX mode of the entry
  u:  username of the entry's owner (string)
  u#: numeric UID of the entry's owner
  g:  group name of the entry's group (string)
  g#: numeric GID of the entry's group
  d:  numeric device (for entries where mode indicates a character or
                      block device ONLY)
      [This is 'st_rdev' in the entry's lstat]
  l: link target (for entries where mode indicates a symlink ONLY)
  h: an ordered list of hashes for this entry (for regular files or
                                               directories ONLY)
  dl: length of the canonical-JSON encoding of the directory object
      (for directories ONLY)
  ml: length of the canonical-JSON encoding of the manifest for this
      directory and its children (for directories ONLY)

The list of hashes under key 'h' should match the order of hash-algorithm-list. The hash for a entry naming a subdirectory is defined to be the hash of the "canonical JSON"-encoded directory object describing that directory; the contents and order of the hash-algorithm-list in the subdirectory must exactly match the one for this directory.

The 'dl' and 'ml' keys are for entries naming subdirectories and provide length information for random access to individual directory entries within a contents manifest. The 'ml' attribute is 16 plus the sum of (1 + dlen) for the referenced directory and all its children, where 'dlen' is the length of the JSON-encoded directory object.

Note that both numeric and symbolic user and group names are included. This is the same strategy used by the 'tar' utility. When extracting or verifying a manifest, certain of this information may be ignored: for example, activities in the user's home directory will typically belong to the user, while 'system' activities may belong to root. Care should be taken when generating manifests to ensure that user and group information has standardized content, even if that information will be ignored in the expected use, to avoid having multiple valid manifests for a given activity or library bundle.

In version 1 of the directory format, there must be exactly two hash algorithms specified in the hash-algorithm-list, and they must be 'sha-256' and 'ripemd-160', in that order.

Note that adding a bogus hash algorithm to the hash-algorithm-list will allow injection of arbitrary data into the directory object, which could comprise an attack on the hash algorithm(s) used. For this reason, security-conscious applications will require a fixed algorithm list (as version 1 does). Migration to a new hash algorithm requires first distributing an upgrade which recognizes a new directory version. After the upgrade, directories with the new version (and thus hash algorithm list) can be published.

Similarly, a directory with N listed hash algorithms admits N! possible variants where the algorithms are listed in a different order. Since N! grows much faster than 2^N, it is wise to keep the list of hash algorithms small.

This attack does not apply to hash and signature algorithms in a credential authenticating this directory object, since the credential object is not itself hashed or signed. Any desired set of hash and signature algorithms can be used in the credential.

A small example of a directory object might be:

 ["dir",1,[["sha-256","ripemd-160"],{
  "bar":{"g":"users","g#":1000,"h":["7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730","7d4e874a231f57b72509087d1e509942fdb6eac6"],"m":33188,"u":"olpc","u#":1000},
  "fifo":{"g":"users","g#":1000,"m":4516,"u":"olpc","u#":1000},
  "frobnitz":{"g":"users","g#":1000,"l":"bar","m":41471,"u":"olpc","u#":1000},
  "null":{"d":259,"g":"users","g#":1000,"m":8612,"u":"olpc","u#":1000},
  "subdir":{"dl":39,"g":"users","g#":1000,"h":["19b46e0c53a25994e5f5e4d133bf308df3f99a3879b7e954d75b51f8393523f1","75fc670c37b3d1aaf0f402c531dc98325862e8ae"],"m":16877,"ml":56,"u":"olpc","u#":1000}
 }]]

(Of course, the JSON encoding should remove whitespace to actually be Canonical JSON.)

A simple contents manifest containing this directory might be:

 ["manifest",1,[
   ["dir",1,[["sha-256","ripemd-160"],{
    "bar":{"g":"users","g#":1000,"h":["7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730","7d4e874a231f57b72509087d1e509942fdb6eac6"],"m":33188,"u":"olpc","u#":1000},
    "fifo":{"g":"users","g#":1000,"m":4516,"u":"olpc","u#":1000},
    "frobnitz":{"g":"users","g#":1000,"l":"bar","m":41471,"u":"olpc","u#":1000},
    "null":{"d":259,"g":"users","g#":1000,"m":8612,"u":"olpc","u#":1000},
    "subdir":{"dl":39,"g":"users","g#":1000,"h":["19b46e0c53a25994e5f5e4d133bf308df3f99a3879b7e954d75b51f8393523f1","75fc670c37b3d1aaf0f402c531dc98325862e8ae"],"m":16877,"ml":56,"u":"olpc","u#":1000}
   }]],
   ["dir",1,[["sha-256","ripemd-160"],{}]]
 ]]

Notes on non-ASCII filenames

Generally speaking, filenames on a UNIX system are assumed to be uninterpreted byte streams; they are explicitly *not* required to be valid in UTF-8 or any other encoding. However, for the Sugar environment we insist that all filenames are valid UTF-8 (since various services exposed over DBus require this). For this reason, some implementations (in particular old versions of fv.c from olpc-contents) will treat filenames and directory names as unicode strings, converting from UTF-8 and canonicalizing them when the contents manifest is created, and renormalizing before comparison during verification. This is problematic: it allows multiple encodings of the same name to match the same manifest.

A better solution (implemented in olpc-contents 2.5 and later) is to insist on normalized filenames during contents manifest creation, and allow only exact comparisons during verification.

Notes on resource-bounded parsing

Small bounds can be enforced when parsing directory entry descriptions in directory objects: the numeric components can be limited to 2^32=10 digits (2^64=20 digits for the 'size' field), and the string components (hash, symlink, entry key) can be limited to 256 characters. It is a little more problematic to limit the number of entries "reasonably" found in a directory, but some empirical bound can be set here as well.

Credentials can be resource-bounded as well, using small bounds on the number of signatures included and the size of the hash and signature algorithm descriptors and the signature data.

For manifests, small bounds on the hash algorithm list are wise to limit the possible permutations which can be generated. The 'directory-list' in the manifest will have as many elements as there are directories in the filesystem. However, the directory list can be directly checked as the entries are parsed, requiring space proportional only to the depth of the directory tree (since we must keep around subdirectory hashes until we parse the subdirectory's object). Thus, a limit on directory depth can be used instead of limits on aggregate size of the manifest.