Olpcfs

From OLPC
Revision as of 16:51, 12 March 2008 by CScott (talk | contribs) (Basic POSIX semantics; historical notes.)
Jump to navigation Jump to search

This document presents a design for a new filesystem to support:

  1. OLPC's journal and bulletin board UI abstractions (see also the second-generation journal design)
  2. Bitfrost P_SF_RUN and P_SF_CORE protections (current implementation)

In our first-generation software, the journal was based with the "datastore" and Bitfrost protections were intended to be provided with a copy-on-write filesystem, originally based on vserver. This new filesystem, which I am calling "olpcfs", attempts to unify these implementations based on a powerful distributed fully-persistent versioned filesystem abstraction.

Design Goals

In this section we enumerate a number of the design goals for the olpcfs filesystem, based on our experiences with the first-generation datastore.

POSIX semantics

Our first-generation datastore was a bespoke system primarily accessible through a Python API. The on-disk representation was opaque, and users frequently expressed frustration at their inability to make journal contents interact with non-XO systems. For example, simply locating a desired file in the datastore was difficult, as files were stored on disk named only by content hashes. In addition, making ordinary Linux applications work on the XO was complicated by the fact that saving and loading objects from the datastore required the use of our unique API.

A primary design goal for olpcfs is thus to have a sane and reasonable POSIX semantics, so that unmodified applications using the POSIX APIs (ie, linux, Windows, MacOS, BeOS,...) can interact sensibly with the journal. We do not need to export every functionality of olpcfs over POSIX APIs, but we strive to. (An ideological tip-of-the-hat here goes to Plan 9, which taught the virtues of uniform access methods.)

In particular, the journal should be able to direct existing applications to open a document from the datastore by passing in a POSIX pathname to the document contents, and applications should be able to save a file to an appropriate place (say, '~/Documents') and have the new content naturally show up in the Journal. These documents may not have the rich metadata associated with them which native XO applications can provide, but the basic journal features should "just work".

Content-Addressable

Cloud storage is emerging as an elegant method of providing globally-accessible document stores, which are key to providing the uniform location-independent model of sharing Sugar aims to provide. When documents are stored into a global pool, naming each uniquely and consistently becomes a concern. Content-addressable storage (CAS) provides a solution to these issues by using (a function of) the content itself as a name. There are many implementations of CAS floating around; we've been most influenced by the emerging XAM standard API, but Sun's Project Honeycomb, Apache's Jackrabbit and (again) Plan 9's Venti are all possible backing storage for a CAS-based olpcfs filesystem.

Continuous versioning

Many versioned filesystems (for example, Fossil, ext3cow, WAFL, and AFS) use snapshot-based versioning. Rather than store every version of every file, snapshots are taken at regular intervals, and only the versions present at the time of a snapshot are accessible.

This is not consistent with the user interaction model underlying the XO, which strives to allow *every* action to be revokable, whether or not a snapshot intervened, to support painless exploration and discovery. It should be impossible for a child to "break" his machine in a way which is not easy fixed — or which requires losing all work since the last snapshot in order to fix.

Further, snapshots are usually taken of a full filesystem. Instead, we want fine-grained browsing of revisions made to individual documents. Ultimately we would like to empower applications to be able to intelligently display and merge revisions; fine-grained storage of the modification chain is an important enabler of this capability.

Recent work has shown that continuous versioning can be implemented at reasonable cost. Apple's Time Machine, and the experimental systems CVFS, VersionFS, Wayback, and PersiFS have demonstrated that this goal is achievable. The on-disk costs of continuous versioning will be mitigated by aggressively pruning unneeded versions at application-level.

Fully-persistent versions

Many versioned filesystems are partially persistent; that is, they support modifications only to the "most recent" version of a file, although all versions are available for read access. Fully-persistent data structures also allow modification of any version. This is obviously necessary to enable distributed editing, since independent parties are not guaranteed to be able to synchronize their edits. The "P_SF_RUN" Bitfrost protection and OLPC update mechanism also require full persistence: new base system images are independent roots for modification, and a user can switch back and forth from one base system image to the other, potentially making (independent) changes to each.

As an aside, other desirable properties may include data structures which are functional or confluently persistent. Functional data structures are immutable after creation, which is a nice property for the CAS backend but not strictly necessary on local disk. Confluently persistent data structures allows a subtree from version A to be efficiently copied into version B. Confluent persistence could be desirable to allow changes to one base system image to be applied to another; but hyper-efficiency for this operation is probably unnecessary for our system. (An unpublished paper by Demaine, Langerman, and Price contains a good discussion of these varieties of persistence, with references and an several examples of functional and confluently-persistent data structures; when the paper is accepted hopefully I'll be able to cite it here.)

POSIX API

In this section we'll describe how files, metadata, and indices are exposed through the standard POSIX API. To some degree, this is independent of actual implementation: one can imagine several different on-disk representations that export the given API, with different costs and properties. We expect that the first implementations will use FUSE on top of an on-disk representation substantially similar to the API exported. An intermediate representation may build on a CAS backend to perform the actual storage; an advanced implementation may eventually achieve high performance via direct disk access. Regardless, the behavior described in this section constitutes the principal API used by legacy applications and the Journal to access content.

Extended Attributes

The basic primitive in olpcfs is a file, with a rich set of metadata. The metadata includes typical POSIX attributes such as creation time, modification time, size, type, and permission bits. As in the [http://en.wikipedia.org/wiki/Be_File_System Be File System], however, application-specific tagged data can also be added to a file. There is a reasonable (256-character UTF-8) limit on tag size, but no limit to the amount of data stored in a tag. We will discuss uses for tags in the Suggested metadata fields section below. Tagged metadata are indexed, as in BFS and XAM. For uniformity of access, the actual contents of a file are stored as the value of a 'content' tag. (This notion of arbitrary-sized metadata could be considered similar to the notion of alternate data streams in NTFS or resource forks in HFS. Wikipedia contains a lengthy discussion.)

We export our metadata in two different formats. The first is for human browsing and existing filesystem exploration tools. We will present it by example. Let us say that there exists a file 'foo' with content 'bar\n' in the current directory, and that there is an extended attribute 'mime-type' on 'foo' with content 'text/plain'. Then:

$ ls
foo
$ cat foo
bar
$ ls foo/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar
$ cat foo/mime-type
text/plain
$ cat foo/size
4
$ cat foo/xuid
deadbeef

That is, a 'file' is treated as just a special type of directory, with internal nodes representing the various tagged metadata. For compatibility with legacy software, opening the 'file' without a trailing slash on the name returns you the value of its 'content' tag. The mode bits for 'foo' report it as a standard file, so the only real 'hack' here is the ability to cd into a 'file' and list its attributes. The uniform treatment of "alternate data streams" as files means that one could write a utility to combine markup and text (see below) and invoke it as:

$ ./combine-markup foo/content foo/markup > marked-up.html
$ echo 'text/html' > marked-up.html/mime-type

The metadata attributes are also exported as standard POSIX xattrs, which allows efficient backup/recovery with the standard zip, tar, and rsync tools (which support xattrs). Unfortunately, POSIX xattr support is not well integrated into the standard Unix toolkit. There are no standard tools (other than archiving tools) for manipulating xattrs, although there is a partial standard defining naming conventions and common tags for better interoperability. Further, POSIX does not export a read/write API for metadata, which makes it impossible to (say) use the 'grep' tool over metadata content.

Indexing

At the root, an olpcfs filesystem exports indexes of metadata, which appear as simple directories:

$ ls /index/size/
4
56
56
123
592
$ ls /index/size/4/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar

Two difficulties: duplicate keys appear as duplicate filenames in a directory. The interface exported by the kernel includes an inode number, and the entries are distinguishable. FIXME

Accessing objects by XUID

Although we have treated files as named objects residing in a directory above, in reality objects live in a flat space and are simply named by various directory objects residing in the system. Directories are in fact a special type of index -- special, because the indexed data (the 'path name') does not in fact exist as an independent property of the object, since a given file can have many 'path names'. FIXME


Suggested metadata fields

markup
This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content.

Implementation

Indexing

Absent content

Merging remote stores

Historical note

The Be File System also used a directory structure to list extended attributes, although the directory inode number was hidden from the user. It also introduced the idea that metadata indices and directories were both just B+Trees on disk, and could use the same representation. As with the attribute "directory", BFS allocated an "index directory" inode, with the same structure as other directories, but hid this inode from the user. In the spirit of Plan 9 ("everything is a file") we've chosen to expose these directories to the user for direct access and traversal. The idea of direct read/write access to attributes as streams comes from the XAM specification (which no doubt inherited it from Centera or another other of the pre-existing CAS systems which influenced XAM's design).

BFS had difficulties with duplicate keys in their B+Trees, which were worked-around with various ad-hoc measures. In our system, we can use the XUID as a suffix to any key type to break ties, avoiding the problem completely.

The Linux NTFS driver exports "alternate data streams" using a trailing colon syntax, for example: "filename:stream". A special tool is necessary to list the streams available for a given file.

The ext3cow filesystem exports versions using a trailing at-sign syntax. The version of a file or directory available at 2008-03-12 16:47:34 UTC (1,205,340,454 seconds since the epoch) can be accessed with "filename@1205340454".

Open Questions

  • How should complex queries by exposed via the POSIX API? Only via

ioctl, or by read/write to a special 'query' file, or via magic directories, or what?