Contents manifest specification
- Author
- Michael Stone <michael@laptop.org>, Noah Kantrowitz <noah@laptop.org>
- Date
- August 8, 2007
Our system for update verification manipulates four kinds of data: envelopes, keys, credentials, and manifests.
Presently, we intend to use JSON as a transfer encoding because we only need to represent tree-structured data and because we want an encoding that is easily and safely parseable in pure Python and in other languages.
If we learn that we need a canonical encoding, a more space-efficient encoding, or an encoding that can be more efficiently streamed, then we will consider alternatives.
Envelopes
Envelopes are tuples of the form
(version, type, data) :: (Integer, String, Value) ^ ^ ^ | | \________ a JSON value in a format governed by (version, type) | \______ a hint about how to interpret the data field that follows. \_________ an opaque version identifier; see below.
The version identifier is intended to specify a minimal dialect that must be supported by compliant clients. Backwards-compatible extensions of this dialect may be included in an enveloped value without changing the version identifier.
Note: Noah suggests make the version identifier transparent and declare it to be an integer with a a monotonically increasing value. I'm okay with this, but since we anticipate changing the identifer only when making backwards-incompatible protocol changes, I think it might as well be opaque.
Keys
Keys are envelopes with version 1 and type "key". A key's datum consists of a tuple
(algorithm, fingerprint, key) :: (String, String, String) ^ ^ ^ | | \______ an opaque string specifiying the actual key | \______ an opaque string to use to refer to the key \_________ an opaque algorithm identifier;
In practice, we might see a key encoded as:
[1, "key", ["rsa", "12508713460986140897", "AE1908579871250981230896109761"]]
Credentials
A credential is an envelope with version 1 and type "credential". Conceptually, its datum consists of a relation whose tuples are signatures of the form:
(signature-algorithm, key, message, signature)
However, in practice, the message is defined implicitly by context (i.e. as the hash of the manifest with which this credential is paired) and we wish to use the credential/manifest system to help detect accidental corruption as early as possible, including corruption of the manifest.
Therefore, we actually propose to represent the datum of a manifest credential as a tuple of
(hash-alg, sig-alg, expected-hash, key-identifier, signature)
We include the manifest hash as a convenient way to check that the manifest was not accidentally damaged in transit (e.g. by a flash corruption bug).
The key fingerprint identifies the key that should be used to verify the signature.
The hash-algorithm and signature-algorithm are used to verify the signature against the *computed* manifest-hash (not the one cached in the signature message).
Finally, the signature itself is opaque.
In actual JSON, we might imagine that such a message would look like:
[1, "credential", [ ["sha1", "rsa", "DEADBEEF", "123MYKEY", "0PAQU3S1G"], ["whirlpool", "ecc", "123095", "9912KEY", "8567A"] ] ]
Manifests
Manifests are the most complicated piece of the system because they have different conceptual and transfer-encoded forms.
Conceptually, a manifest's datum is a constraint-set. Implementations of this constraint language are free to support checking as many or as few kinds of constraint as they wish. Clearly, though, adequate verification of a file-system against a manifest depends on both the presence of a constraint set that sufficiently constrains the properties of the file-system being checked and a correct and complete implementation of a checker for that constraint-set.
This model controls the implementation cost of both the verification machine (and the manifest generation tool) while allowing great flexibility in the underlying properties being verified and in the degree to which we compromise our implementation goals in the face of time pressure.
Details
There are several basic constraints we wish to support. These include constraints on file content, file type (normal, hardlink, symlink, device, ...), permissions, ownership, and on the contents of directories.
Manifests are envelopes with version 1 and type "manifest". A manifest's datum consists of a tuple of a dictionary of optimization hints and a dictionary of constraints. Structurally, the datum looks like
(Hint -> Value, Constraint -> (Query, Result))
Here "hints" are opaque strings that may be given some uninterpreted data as an argument (e.g. "invert-filesystem" : true). These are provided because checking some of the global constraints may rely on expensive-to-compute structures such as an inverted index that maps inodes to the paths that point to them.
The constraint itself dictionary is also made slightly more complicated in the name of speed and memory usage. While conceptually, we wish to store a ternary relation with columns
(constraint-algorithm, query, expected-result)
in practice, we expect that there will very few (~5) constraint algorithms and very many (~100000) (query, expected-result) pairs. Hence we suggest optimizing for this case by grouping the relation's tuples by constraint-algorithm, as shown above in the structural description.
A small example of a manifest might be:
[1, "manifest", [ {"invert-filesystem": ["/links"]}, {"file-sha1" : [ ["/etc/passwd", "BLAH18AK"], ["/bin/bash", "LA81985ED"]], "permissions" : [ ["/my/secret", 448] ], "same-inode" : [ [["/links/1", "/links/2", "/links/3"], null] ], "dir-contains" : [ ["/", ["bin", "sbin"]] ], "update-excludes": [ ["/security", null] ] }]]