Olpcfs

From OLPC
Revision as of 13:04, 11 March 2008 by CScott (talk | contribs) (Statement of goals. Lots of content still to be written.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This document presents a design for a new filesystem to support:

  1. OLPC's journal and bulletin board UI abstractions (see also the second-generation journal design)
  2. Bitfrost P_SF_RUN and P_SF_CORE protections (current implementation)

In our first-generation software, the journal was based with the "datastore" and Bitfrost protections were intended to be provided with a copy-on-write filesystem, originally based on vserver. This new filesystem, which I am calling "olpcfs", attempts to unify these implementations based on a powerful distributed fully-persistent versioned filesystem abstraction.

Design Goals

In this section we enumerate a number of the design goals for the olpcfs filesystem, based on our experiences with the first-generation datastore.

POSIX semantics

Our first-generation datastore was a bespoke system primarily accessible through a Python API. The on-disk representation was opaque, and users frequently expressed frustration at their inability to make journal contents interact with non-XO systems. For example, simply locating a desired file in the datastore was difficult, as files were stored on disk named only by content hashes. In addition, making ordinary Linux applications work on the XO was complicated by the fact that saving and loading objects from the datastore required the use of our unique API.

A primary design goal for olpcfs is thus to have a sane and reasonable POSIX semantics, so that unmodified applications using the POSIX APIs (ie, linux, Windows, MacOS, BeOS,...) can interact sensibly with the journal. We do not need to export every functionality of olpcfs over POSIX APIs, but we strive to. (An ideological tip-of-the-hat here goes to Plan 9, which taught the virtues of uniform access methods.)

In particular, the journal should be able to direct existing applications to open a document from the datastore by passing in a POSIX pathname to the document contents, and applications should be able to save a file to an appropriate place (say, '~/Documents') and have the new content naturally show up in the Journal. These documents may not have the rich metadata associated with them which native XO applications can provide, but the basic journal features should "just work".

Content-Addressable

Cloud storage is emerging as an elegant method of providing globally-accessible document stores, which are key to providing the uniform location-independent model of sharing Sugar aims to provide. When documents are stored into a global pool, naming each uniquely and consistently becomes a concern. Content-addressable storage (CAS) provides a solution to these issues by using (a function of) the content itself as a name. There are many implementations of CAS floating around; we've been most influenced by the emerging XAM standard API, but Sun's Project Honeycomb, Apache's Jackrabbit and (again) Plan 9's Venti are all possible backing storage for a CAS-based olpcfs filesystem.

Continuous versioning

Many versioned filesystems (for example, Fossil, ext3cow, WAFL, and AFS) use snapshot-based versioning. Rather than store every version of every file, snapshots are taken at regular intervals, and only the versions present at the time of a snapshot are accessible.

This is not consistent with the user interaction model underlying the XO, which strives to allow *every* action to be revokable, whether or not a snapshot intervened, to support painless exploration and discovery. It should be impossible for a child to "break" his machine in a way which is not easy fixed — or which requires losing all work since the last snapshot in order to fix.

Further, snapshots are usually taken of a full filesystem. Instead, we want fine-grained browsing of revisions made to individual documents. Ultimately we would like to empower applications to be able to intelligently display and merge revisions; fine-grained storage of the modification chain is an important enabler of this capability.

Recent work has shown that continuous versioning can be implemented at reasonable cost. Apple's Time Machine, and the experimental systems CVFS, VersionFS, Wayback, and PersiFS have demonstrated that this goal is achievable.

Fully-persistent versions

Many versioned filesystems are partially persistent; that is, they support modifications only to the "most recent" version of a file, although all versions are available for read access. Fully-persistent data structures also allow modification of any version. This is obviously necessary to enable distributed editing, since independent parties are not guaranteed to be able to synchronize their edits. The "P_SF_RUN" Bitfrost protection and OLPC update mechanism also require full persistence: new base system images are independent roots for modification, and a user can switch back and forth from one base system image to the other, potentially making (independent) changes to each.

As an aside, other desirable properties may include data structures which are functional or confluently persistent. Functional data structures are immutable after creation, which is a nice property for the CAS backend but not strictly necessary on local disk. Confluently persistent data structures allows a subtree from version A to be efficiently copied into version B. Confluent persistence could be desirable to allow changes to one base system image to be applied to another; but hyper-efficiency for this operation is probably unnecessary for our system. (An unpublished paper by Demaine, Langerman, and Price contains a good discussion of these varieties of persistence, with references; when the paper is accepted hopefully I'll be able to cite it here.)

POSIX API

Notes:

 Export both an 'extended attribute' as well as a pseudo-directory
 view of file metadata; this allows efficient backup/recovery with
 the standard zip and tar tools.  The pseudo-directory view enables
 browsing with ls and friends.

Implementation

Indexing

Absent content

Merging remote stores