Olpcfs

From OLPC
Revision as of 15:39, 17 March 2008 by 18.85.46.232 (talk) (Flesh out attribute types, examples, and inodes.)
Jump to: navigation, search

This document presents a design for a new filesystem to support:

  1. OLPC's journal and bulletin board UI abstractions (see also the second-generation journal design)
  2. Bitfrost P_SF_RUN and P_SF_CORE protections (current implementation)

In our first-generation software, the journal was based with the "datastore" and Bitfrost protections were intended to be provided with a copy-on-write filesystem, originally based on vserver. This new filesystem, which I am calling "olpcfs", attempts to unify these implementations based on a powerful distributed fully-persistent versioned filesystem abstraction.

Design Goals

In this section we enumerate a number of the design goals for the olpcfs filesystem, based on our experiences with the first-generation datastore.

POSIX semantics

Our first-generation datastore was a bespoke system primarily accessible through a Python API. The on-disk representation was opaque, and users frequently expressed frustration at their inability to make journal contents interact with non-XO systems. For example, simply locating a desired file in the datastore was difficult, as files were stored on disk named only by content hashes. In addition, making ordinary Linux applications work on the XO was complicated by the fact that saving and loading objects from the datastore required the use of our unique API.

A primary design goal for olpcfs is thus to have a sane and reasonable POSIX semantics, so that unmodified applications using the POSIX APIs (ie, linux, Windows, MacOS, BeOS,...) can interact sensibly with the journal. We do not need to export every functionality of olpcfs over POSIX APIs, but we strive to. (An tip of the hat here goes to Plan 9, which taught the virtues of uniform access methods.)

In particular, the journal should be able to direct existing applications to open a document from the datastore by passing in a POSIX pathname to the document contents, and applications should be able to save a file to an appropriate place (say, '~/Documents') and have the new content naturally show up in the Journal. These documents may not have the rich metadata associated with them which native XO applications can provide, but the basic journal features should "just work".

Content-addressable

Cloud storage is emerging as an elegant method of providing globally-accessible document stores, which are key to providing the uniform location-independent model of sharing Sugar aims to provide. When documents are stored into a global pool, naming each uniquely and consistently becomes a concern. Content-addressable storage (CAS) provides a solution to these issues by using (a function of) the content itself as a name. There are many implementations of CAS floating around; we've been most influenced by the emerging XAM standard API, but Sun's Project Honeycomb, Apache's Jackrabbit and (again) Plan 9's Venti are all possible backing storage for a CAS-based olpcfs filesystem.

Continuous versioning

Many versioned filesystems (for example, Fossil, ext3cow, WAFL, and AFS) use snapshot-based versioning. Rather than store every version of every file, snapshots are taken at regular intervals, and only the versions present at the time of a snapshot are accessible.

This is not consistent with the user interaction model underlying the XO, which strives to allow *every* action to be revokable, whether or not a snapshot intervened, to support painless exploration and discovery. It should be impossible for a child to "break" his machine in a way which is not easy fixed — or which requires losing all work since the last snapshot in order to fix.

Further, snapshots are usually taken of a full filesystem. Instead, we want fine-grained browsing of revisions made to individual documents. Ultimately we would like to empower applications to be able to intelligently display and merge revisions; fine-grained storage of the modification chain is an important enabler of this capability.

Recent work has shown that continuous versioning can be implemented at reasonable cost. Apple's Time Machine, and the experimental systems CVFS, VersionFS, Wayback, and PersiFS have demonstrated that this goal is achievable. The on-disk costs of continuous versioning will be mitigated by aggressively pruning unneeded versions at application-level.

Fully-persistent versions

Many versioned filesystems are partially persistent; that is, they support modifications only to the "most recent" version of a file, although all versions are available for read access. Fully-persistent data structures also allow modification of any version. This is obviously necessary to enable distributed editing, since independent parties are not guaranteed to be able to synchronize their edits. The "P_SF_RUN" Bitfrost protection and OLPC update mechanism also require full persistence: new base system images are independent roots for modification, and a user can switch back and forth from one base system image to the other, potentially making (independent) changes to each.

As an aside, other desirable properties may include data structures which are functional or confluently persistent. Functional data structures are immutable after creation, which is a nice property for the CAS backend but not strictly necessary on local disk. Confluently persistent data structures allows a subtree from version A to be efficiently copied into version B. Confluent persistence could be desirable to allow changes to one base system image to be applied to another; but hyper-efficiency for this operation is probably unnecessary for our system. (An unpublished paper by Demaine, Langerman, and Price contains a good discussion of these varieties of persistence, with references and an several examples of functional and confluently-persistent data structures; when the paper is accepted hopefully I'll be able to cite it here.)

POSIX API

In this section we'll describe how files, metadata, and indices are exposed through the standard POSIX API. To some degree, this is independent of actual implementation: one can imagine several different on-disk representations that export the given API, with different costs and properties. We expect that the first implementations will use FUSE on top of an on-disk representation substantially similar to the API exported. An intermediate representation may build on a CAS backend to perform the actual storage; an advanced implementation may eventually achieve high performance via direct disk access. Regardless, the behavior described in this section constitutes the principal API used by legacy applications and the Journal to access content.

Extended Attributes

The basic primitive in olpcfs is a file, with a rich set of metadata. The metadata includes typical POSIX attributes such as creation time, modification time, size, type, and permission bits. As in the Be File System, however, application-specific tagged data can also be added to a file. There is a reasonable (256-character UTF-8) limit on tag size, but no limit to the amount of data stored in a tag. We will discuss uses for tags in the Suggested metadata fields section below. Tagged metadata are indexed, as in BFS and XAM. For uniformity of access, the actual contents of a file are stored as the value of a 'content' tag. (This notion of arbitrary-sized metadata could be considered similar to the notion of alternate data streams in NTFS or resource forks in HFS. Wikipedia contains a lengthy discussion.)

We export our metadata in two different formats. The first is for human browsing and existing filesystem exploration tools. We will present it by example. Let us say that there exists a file 'foo' with content 'bar\n' in the current directory, and that there is an extended attribute 'mime-type' on 'foo' with content 'text/plain'. Then:

$ ls
foo
$ cat foo
bar
$ ls foo/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar
$ cat foo/mime-type
text/plain
$ cat foo/size
4
$ cat foo/xuid
deadbeef

That is, a 'file' is treated as just a special type of directory, with internal nodes representing the various tagged metadata. For compatibility with legacy software, opening the 'file' without a trailing slash on the name returns you the value of its 'content' tag. The mode bits for 'foo' report it as a standard file, so the only real 'hack' here is the ability to cd into a 'file' and list its attributes. The uniform treatment of "alternate data streams" as files means that one could write a utility to combine markup and text (see below) and invoke it as:

$ ./combine-markup foo/content foo/markup > marked-up.html
$ echo 'text/html' > marked-up.html/mime-type

The metadata attributes are also exported as standard POSIX xattrs, which allows efficient backup/recovery with the standard zip, tar, and rsync tools (which support xattrs). Unfortunately, POSIX xattr support is not well integrated into the standard Unix toolkit. There are no standard tools (other than archiving tools) for manipulating xattrs, although there is a partial standard defining naming conventions and common tags for better interoperability. Further, POSIX does not export a read/write API for metadata, which makes it impossible to (say) use the 'grep' tool over metadata content. Our "file as directory" hack addresses these infelicities.

Indexing

At the root, an olpcfs filesystem exports indexes of metadata, which appear as simple directories:

$ ls /.index/size/
123  4  56  592
$ ls /.index/size/4/
cafebabe  deadbeef
$ ls /.index/size/4/cafebabe
ctime mtime size content mime-type xuid parent
$ cat /.index/size/4/cafebabe/content
bar

Three basic issues arise with this fundamental idea:

  • How are duplicate keys handled?
  • How are "strange" attribute values handled?

We address the presence of duplicate keys by having the leaf of each index be a directory. The contents of that directory are named by xuid and enumerate the different files on the system which match the query. When an index is created, an attribute can be tagged as unique, in which case an attempt to create a duplicate value will return EEXIST. The leaf nodes for unique indexes are files, rather than (1-entry) directories. Combined with path-valued attributes, the standard directory view of a filesystem is (more or less) just the index on the "path" attribute.

The second issue is exemplified by typical values of the "mime-type" attribute: "text/plain", "text/html", etc. Since leaf nodes of the index are named by the attribute value, how do we handle attributes with illegal filename characters (like '/')? And what would we do if the user attempted to index the "content" attribute, which may contain gigabytes of data for certain files?

Possible solutions might be to display files with ls that one couldn't actually access with the POSIX API, or to hide "bad" entries, or to only export indices that didn't exhibit these problems. These solutions result in irregular indices which aren't actually usable for anything except browsing; any "real" queries would have to be performed with an alternate API.

We have chosen instead to make the index as close to fully-functional as possible, even if this means that manual browsing is a little more difficult. We URL-encode problematic characters in attribute values, and truncate them to the first 128 bytes. This results in a consistent programmatic API, even if it makes directory listings a little uglier. A command-line tool to URI-encode an input string allows shell tools to directly access the index, even if the desired key is "funny". Only searches through large attribute values require use of an alternate query API.

Query API

A full query API allows arbitrary boolean combinations of key terms, as well as range and subset queries. This document won't attempt to nail this interface down precisely, since the goal is a robustly functional POSIX API, but a BFS or SQLish query delivered by ioctl() or getxattr/setxattr call is a reasonable basis for discussion. (A FUSE implementation of olpcfs wouldn't permit an ioctl-based API.) For example:

char querytag[TAGLEN];

getxattr("/.query",
         "SELECT .xuid WHERE mime-type = 'text/html' OR mime-type = 'text/plain'",
         querytag, TAGLEN);

while (1) {
  char xuid[XUIDLEN], pathbuf[PATHLEN];
  struct stat st;

  if (!getxattr("/.query-result", querytag, xuid, XUIDLEN))
    break; /* no more results */

  /* do something with the file, for example: */
  snprintf(pathbuf, PATHLEN, "/.index/xuid/%.*s", XUIDLEN, xuid);
  stat(pathbuf, &st);
}

Helper functions which handled mount points properly, etc, could be added to a library to make this interface easier to use. "Live queries" (another debt to BFS) could also be supported if the getxattr calls above are allowed to block until additional matches are found, and to return a special "no longer a match" result for XUIDs modified since the start of a query.

If a minimal API is desired, efficient compound searches can be performed entirely in userland given only a means for performing a range query on a index directory. An API could be crafted around this mechanism, instead of the direct statement query interface shown above. All query processing could then occur in an application library rather than inside the filesystem. The trade-off would be that live queries and atomicity guarantees would be much more difficult.

Multilevel indexes

The views of indices we've shown so far allow for simple exact-match queries on a single term. We can extend the power of this interface to allow exact match queries on multiple ANDed terms by exporting a .index/ directory in the result directory of every index query:

$ ls /.index/size/
123  4  56  592
$ ls /.index/size/4/
cafebabe  deadbeef
$ ls /.index/size/4/.index/ctime/
20080303T123456Z  20071225012345Z

Since "unique" index terms have files, not directories, as leaves, unique indexes must appear only as the last term of a multilevel query. This non-orthogonality is a slightly unfortunate, since it requires a query tool using the POSIX API to handle a special case: if at any point in traversing the query path you end up with a file instead of a directory, there are either 0 or 1 results to your query. Further checks should be done directly on the extended attributes of the 1 possible result. This special case doesn't seem to be too onerous; after all, prioritizing unique indices is basic query optimization.

Typed attributes

A survey of interesting file attributes yields several special cases of structured data which deserve support: numeric attributes, path-valued attributes, and set-valued attributes.

Numeric attributes are the simplest: attributes such as "file size" or "inode number" should be sorted by their numeric value, not lexicographically by their string representation. BFS handled this case by adding type information for every attribute, with a simple type system including integers and floating point values of various lengths in addition to (UTF-8) string-valued attributes.

We prefer to handle all attribute values uniformly as octet streams, and specify additional type hints when indexes are created. This raises the possibility that non-numeric data might be stored in a "numeric" attribute. The index can handle this case by either refusing to index files with unparsable attibutes, or by assigning some arbitrary ordering -- for example, that all unparsable attributes are sorted lexicographically among themselves, and that they are sorted *after* all parsable attributes. This allows us to support numeric-sorted indices without complicating our API with type information.

Set-valued attributes have multiple notional values. A good example is a "tags" field on a file, which may have the values "schoolwork images insect bee". BFS had no special support for these "multi-valued" fields, and so a text searches in their file browser for "bee" invoked a wildcard query for "*bee*" which forced iteration through all tagged files -- and returned matches with the tags "beer" and "wildebeest"!

A record-separation convention could prevent the extraneous matches. We dictate that the null character '\0' is used to separate fields in set-valued attributes, and we reward adherence to this convention by allowing efficient searches for items in set-valued attributes. For example, the file described above would show up in /.index/tags/schoolwork/, /.index/tags/images/, /.index/tags/insect/, and /.index/tags/bee/. The null character as separator also has the advantage of (a) already being a forbidden character in path names (so it interacts well with path-valued attributes, below), and (b) adhering to the defacto standard of POSIX tools such as xargs and find, which have options to emit and accept null-delimited arguments. For example:

$ cat foo/tags | xargs -0 echo
schoolwork
images
insect
bee

Path-valued attributes are ordered lists, with elements separated by the slash character. For example:

$ cat foo/path
/home/olpc/Documents/foo

A path attribute on files can be both path-valued and set-valued:

$ cat bee/path | xargs -0 echo
/home/olpc/Documents/Schoolwork/bee-image
/home/olpc/Documents/bee

Searches on path-valued attributes traverse each level independently, for example:

$ ls /.index/mime_type/text
$ ls /.index/mime_type/text/html

Just to be clear: path-valued, set-valued, and numeric are properties of indices, not of the attributes themselves.

TO DO: how is the file with mime-type "text" listed in the index? Is there a separate leaf-value directory ".objects", or do we name files in the index so that they don't conflict with the directory names continuing the index?

TO DO: our index allows efficient enumeration of "all items below node X" (regardless of exact subdirectory: X/Y, X/Y/Z, X/Z, etc) and efficient "is X a parent of Y" tests (regardless of the length of the path between X and Y). How do we export these queries?

TO DO: exact correlation between searches on a 'path' attribute and "directory structure". Can we have .index in any directory, narrowing the implicit query on the "path" attribute?

Inodes

Proper support for hardlinks under POSIX semantics requires us to introduce an extra level of indirection between directory entries and file contents. Paths are technically properties of inodes, not files themselves. If file content is copied in the file system, there may be two or more inodes pointing to the same file. Writes to a file via an inode affect all the path references sharing the inode (and only those path references).

We treat the inode as a hidden indirection. Some metadata properties are actually properties of the inode, not of the file itself. These properties are merged with the attributes of the file when the file is accessed via a POSIX path, but may be absent when the file is enumerated by an index. TO DO: we may want to export a reverse index from file to inodes, so that we can obtain paths to content discovered by an index (if paths exist).

Versioning

Notes:
* restrict version graph to a tree (not a graph or a DAG).
* allow explicit traversal of version trees.  ie, root/mod1/mod2/mod3.
* hard link to a specific version tree identifier to give it a name?
* named versions are pairs: <root, current>, where 'root' identifies the
  'pristine' version of the named tree, and 'current' points at the
  last-modified copy.
* how do you delete or compress a version tree to free up its space?
* what happens when I make a new tree 'A', modify it to create 'B', then
  "name" the 'B' revision as tree "C", modify B to make BB and C to make CC,
  and then delete tree 'A'?  Ideally the tree A->B->BB path would be deleted,
  while the C->CC path would be untouched (and note that B==C).
* Each version only retains a link to its parent.  We never store the
  full version paths.  We use order-maintenance data structures to
  ensure that we can efficiently identify whether a version is on the
  path from 'root' to 'current'.
* Probably export filesystem version as a path-valued attribute; can
  we integrate this with a file-oriented version as well?  Further,
  we'd like Sugar to be able to maintain its own version information,
  and index this appropriately.

Metadata fields of inodes

Most of these names are taken from freedesktop's Shared File Metadata spec, although those names conflict with the recommendations in freedesktop's Common Extended Attributes spec. We're not going to take sides in that fight yet; these names are subject to change.

File.Name
POSIX name(s) excluding path; set-valued.
File.Path
POSIX path(s); path- and set-valued
File.Permissions
string in unix format; eg "-rw-r--r--" (We may diverge from spec and use an integer instead.)
File.Modified
Last modified datetime. Spec allows timezone and separators; we'll insist on 16-character ISO8601 in UTC and basic format as we do in the Firmware Key and Signature Formats
File.Accessed
Last access datetime. Format as above. We will probably support a relatime mount option.
File.Owner
unix owner of the file, as an integer. (Not in Shared File Metadata spec.)
File.Group
unix group of the file, as an integer. (Not in Shared File Metadata spec.)
File.Created
Creation datatime. (Not in Shared File Metadata spec.)

Metadata fields of files

These fields are automatically created for a files written with the standard POSIX API.

File.Size
numeric
File.Content
actual content of file. Although freedesktop spec recommends that this is the 'plain text' version suitable for indexers, we return the raw contents (but see the discussion on the 'markup' attribute below).
File.XUID
content-based unique identifier, conformant with XAM spec.

TODO: versioning-related metadata.

Suggested application-level metadata fields

These fields are not created without explicit intervention, but are used to support application-level functionality.

Doc.Markup
This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content.
Doc.Keywords
set-valued tags
Doc.Title
title of document.
Doc.RootXUID
XUID of the "base version" of this document. Doing a descendents search on RootXUID will return the complete version history of this document. XXX integrate with filesystem-scope versioning.
Doc.Action
this timestamp-valued attribute is used to group multiple documents into a the activity-centric journal view. Range queries on this attribute will sort the entries appropriately. XXX: perhaps this should be combined with File.Modified or Doc.Created to provide better integration with files created via the POSIX API. Otherwise the journal would have to manually search for files without a Doc.Action tag and synthesize an appropriate one.

Implementation

Efficient implementation of olpcfs revolves around only two fundamental data structures: fully-persistent B+trees, and external order-maintence trees. Nodes in path-valued attributes are internally numbered in preorder and postorder, and these numberings are maintained using an external version of Bender, Cole, Demaine, and Farach-Colton's algorithm. Ancestor and descendent testing and enumeration can efficiently be performed using the node numberings.

XXX: Flesh this out.

Absent content and merging remote stores

THIS SECTION IS TENTATIVE AND INCOMPLETE.'

Absent content is *not* present in directory structure, although it may appear in indexes. We don't try to make the local store completely network-transparent; management of local store is done by application, not by the filesystem.

Although we have treated files as named objects residing in a directory above, in reality objects live in a flat space and are simply named by various directory objects residing in the system. Directories are in fact a special type of index -- special, because the indexed data (the 'path name') does not in fact exist as an independent property of the object, since a given file can have many 'path names'.

Merging remote stores: merging happens at the index level, not the directory level.

Related work

The Be File System also uses a directory structure to list extended attributes, although the directory inode number was hidden from the user. It also introduced the idea that metadata indices and directories were both just B+Trees on disk, and could use the same representation. As with the attribute "directory", BFS allocated an "index directory" inode, with the same structure as other directories, but hid this inode from the user. In the spirit of Plan 9 ("everything is a file") we've chosen to expose these directories to the user for direct access and traversal. The idea of direct read/write access to attributes as streams comes from the XAM specification (which likely inherited it from Centera or another pre-existing CAS system).

BFS had difficulties with duplicate keys in their B+Trees, which were worked-around with various ad-hoc measures. In our system, we use the XUID as a suffix to any key type to break ties, avoiding the problem completely.

The Linux NTFS driver exports "alternate data streams" using a trailing colon syntax, for example: "filename:stream". A special tool is necessary to list the streams available for a given file.

The ext3cow filesystem exports versions using a trailing at-sign syntax. The version of a file or directory available at 2008-03-12 16:47:34 UTC (1,205,340,454 seconds since the epoch) can be accessed with "filename@1205340454". There is a nice Time-Travelling File Manager based on this interface, and the whitepaper contains a useful discussion of the application-level versus user-visible semantics of this naming scheme (starting on page 6).

Plutus inspired (will inspire?) the privacy and group management features of olpcfs.

There are three specifications for file metadata: freedesktop's Shared File Metadata spec; freedesktop's Common Extended Attributes spec; and Apple's Spotlight spec. The Shared File Metadata spec is closest to our goals, but it defines a comma-delimited "array of string" in the place of our null-delimited set and slash-delimited path attributes.

Open Questions

  • How should complex queries by exposed via the POSIX API? Only via ioctl/xattr (as described above), by read/write to a special 'query' file, via magic directories, or what?
  • Synchronization of object sets with a remote store is an application-level task. Some ideas on efficient synchronization algorithms have been described, we do not have a formal proof of their efficiency (and they may well fail to perform efficiently for certain inputs).

Notes

Notes and latest ideas:

* files in the search result are inodes?  can directly access xuid
  from inode.
* Hence mime-type is a
  special two-valued path key!  mime-subtype is a set key.
* further, there's are special 'direntry' objects in the system, with a
  path-valued 'path' entry and an numeric 'inode' property.  The
  inode.  So the root filesystem is actually just the index for
  'path'.
* How do "set-valued" and "path-valued" attributes interact with
  "unique" attributes?  Paths want to be set-valued, path-valued,
  *and* unique.
* How do unique attributes interact with versioning?  Only one
  instance of the given attribute in a given version, presumably.
  But queries with don't have a version context won't necessary
  have the 'unique' magic applied at the leaves.  Maybe we should
  just generalize unique to say that *any* query that would otherwise
  result in a 1-item directory instead results in the file itself.
* lots of good filesystem testing tools at
  http://ltp.sourceforge.net/tooltable.php
  That should make a robust implementation significantly easier.