Olpcfs: Difference between revisions

From OLPC
Jump to navigation Jump to search
(Describe versioning features.)
(→‎Query API: fix link to FUSE ioctl FAQ)
 
(11 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{OLPC}}
{{draft}}
This document presents a design for a new filesystem to support:
This document presents a design for a new filesystem to support:


Line 10: Line 12:
calling "olpcfs", attempts to unify these implementations based on a
calling "olpcfs", attempts to unify these implementations based on a
powerful distributed fully-persistent versioned filesystem abstraction.
powerful distributed fully-persistent versioned filesystem abstraction.

'''A proof-of-concept partial implementation is available at http://dev.laptop.org/git?p=users/cscott/olpcfs1 with API documentation and source code browsable at http://dev.laptop.org/~cscott/olpcfs1/ . Slides from a talk on olpcfs are available from [[Presentations/April_2008_Technical_Mini-conference]].'''


== Design Goals ==
== Design Goals ==
Line 61: Line 65:
9's [http://en.wikipedia.org/wiki/Venti Venti] are all possible
9's [http://en.wikipedia.org/wiki/Venti Venti] are all possible
backing storage for a CAS-based olpcfs filesystem.
backing storage for a CAS-based olpcfs filesystem.
[http://doi.acm.org/10.1145/502034.502054 DHash] or
[http://www.opendht.org/ OpenDHT] would make interesting candidates
for a link-local
[http://www.usenix.org/events/fast04/tech/tolia.html look-aside cache]
to collaboratively cache shared documents, although
[http://www.usenix.org/event/worlds05/tech/freedman.html extending this through NATs] could be a challenge.


=== Continuous versioning ===
=== Continuous versioning ===
Line 168: Line 178:
$ cat foo
$ cat foo
bar
bar
$ ls foo/
$ ls foo:/
ctime mtime size content mime-type xuid parent
ctime mtime size content mime-type xuid parent
$ cat foo/content
$ cat foo:/content
bar
bar
$ cat foo/mime-type
$ cat foo:/mime-type
text/plain
text/plain
$ cat foo/size
$ cat foo:/size
4
4
$ cat foo/xuid
$ cat foo:/xuid
deadbeef
deadbeef


That is, a 'file' is treated as just a special type of directory, with
That is, by suffixing a filename with a colon, you get a special type of directory, with
internal nodes representing the various tagged metadata. For
internal nodes representing the various tagged metadata. For
compatibility with legacy software, opening the 'file' without a
compatibility with legacy software, opening the 'file' without a
trailing slash on the name returns you the value of its 'content' tag.
trailing slash on the name returns you the value of its 'content' tag.
We accept "file:attr" as a shortcut for "file:/attr".
The mode bits for 'foo' report it as a standard file, so the only real
The uniform treatment of "alternate data streams" as
'hack' here is the ability to cd into a 'file' and list its
attributes. The uniform treatment of "alternate data streams" as
files means that one could write a utility to combine markup and text
files means that one could write a utility to combine markup and text
(see below) and invoke it as:
(see below) and invoke it as:
Line 201: Line 210:
interoperability. Further, POSIX does not export a read/write API for
interoperability. Further, POSIX does not export a read/write API for
metadata, which makes it impossible to (say) use the 'grep' tool over
metadata, which makes it impossible to (say) use the 'grep' tool over
metadata content. Our "file as directory" hack addresses these infelicities.
metadata content. Our directory hack addresses these infelicities to make the metadata
more discoverable and mutable.


=== Indexing ===
=== Indexing ===
Line 255: Line 265:
key is "funny". Only searches through large attribute values require use of
key is "funny". Only searches through large attribute values require use of
an alternate query API.
an alternate query API.

==== Multilevel indexes ====
The views of indices we've shown so far allow for simple
exact-match queries on a single term. We can extend the power of this
interface to allow exact match queries on multiple ANDed terms by
exporting a <code>.index/</code> directory in the result directory of every
index query:
$ ls /.index/size/
123 4 56 592
$ ls /.index/size/4/
cafebabe deadbeef
$ ls /.index/size/4/.index/ctime/
20080303T123456Z 20071225012345Z

Since "unique" index terms have files, not directories, as leaves,
unique indexes must appear only as the last term of a multilevel
query. This non-orthogonality is a slightly unfortunate, since it requires
a query tool using the POSIX API to handle a special case: if at any
point in traversing the query path you end up with a file instead of a
directory, there are either 0 or 1 results to your query. Further
checks should be done directly on the extended attributes of the 1
possible result. This special case doesn't seem to be too onerous;
after all, prioritizing unique indices is basic query optimization.


==== Query API ====
==== Query API ====
Line 263: Line 296:
or getxattr/setxattr call is a reasonable basis for discussion.
or getxattr/setxattr call is a reasonable basis for discussion.
(A FUSE implementation of olpcfs
(A FUSE implementation of olpcfs
[http://fuse.sourceforge.net/wiki/index.php/FAQ#Why_doesnx27.t_FUSE_forward_ioctlx28.x29._calls_to_the_filesystemx3f. wouldn't permit an ioctl-based API].)
[http://apps.sourceforge.net/mediawiki/fuse/index.php?title=FAQ#Why_doesn.27t_FUSE_forward_ioctl.28.29_calls_to_the_filesystem.3F wouldn't permit an ioctl-based API].)
For example:
For example:
<pre><nowiki>
<pre><nowiki>
Line 298: Line 331:
library rather than inside the filesystem. The trade-off would be that
library rather than inside the filesystem. The trade-off would be that
live queries and atomicity guarantees would be much more difficult.
live queries and atomicity guarantees would be much more difficult.

==== Multilevel indexes ====
The views of indices we've shown so far allow for simple
exact-match queries on a single term. We can extend the power of this
interface to allow exact match queries on multiple ANDed terms by
exporting a <code>.index/</code> directory in the result directory of every
index query:
$ ls /.index/size/
123 4 56 592
$ ls /.index/size/4/
cafebabe deadbeef
$ ls /.index/size/4/.index/ctime/
20080303T123456Z 20071225012345Z

Since "unique" index terms have files, not directories, as leaves,
unique indexes must appear only as the last term of a multilevel
query. This non-orthogonality is a slightly unfortunate, since it requires
a query tool using the POSIX API to handle a special case: if at any
point in traversing the query path you end up with a file instead of a
directory, there are either 0 or 1 results to your query. Further
checks should be done directly on the extended attributes of the 1
possible result. This special case doesn't seem to be too onerous;
after all, prioritizing unique indices is basic query optimization.


==== Typed attributes ====
==== Typed attributes ====
Line 404: Line 414:
directory, narrowing the implicit query on the "path" attribute?
directory, narrowing the implicit query on the "path" attribute?


==== Inodes ====
=== Inodes ===
Proper support for hardlinks under POSIX semantics requires us to
Proper support for hardlinks under POSIX semantics requires us to
introduce an extra level of indirection between directory entries and
introduce an extra level of indirection between directory entries and
Line 414: Line 424:


We treat the inode as a hidden indirection. Some metadata properties
We treat the inode as a hidden indirection. Some metadata properties
are actually properties of the inode, not of the file itself. These
are actually properties of the inode, not of the file itself;
the [[#Metadata fields of inodes]] section contains a list. These
properties are merged with the attributes of the file when the file is
properties are merged with the attributes of the file when the file is
accessed via a POSIX path, but may be absent when the file is enumerated
accessed via a POSIX path, but may be absent when the file is enumerated
Line 440: Line 451:
is equivalent to "the end of time"; that is, it will always correspond
is equivalent to "the end of time"; that is, it will always correspond
to the latest version on the branch.
to the latest version on the branch.

The above syntax is actually only sugar for a '@' suffix, used in the same way as the ':' suffix described for metadata:
$ cd foo@/
$ ls
20080101T010203 20080415T112312
$ cat foo@/20080101T010203
Hello, there.


Time-travel is read-only by default; in order to modify files on a
Time-travel is read-only by default; in order to modify files on a
Line 492: Line 510:
a given filesystem view is consistent while time-travelling.
a given filesystem view is consistent while time-travelling.


<!--
We provide a special <code>tls</code> tool to list the various
We provide a special <code>tls</code> tool to list the various
versions. It might also be possible to export this list as a metadata
versions. It might also be possible to export this list as a metadata
attribute, but the best means for doing so is not known.
attribute, but the best means for doing so is not known.
-->


Import/export mechanisms allow us to export the entire version history
Import/export mechanisms allow us to export the entire version history
Line 586: Line 606:


; Doc.Markup
; Doc.Markup
: This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content.
: This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content. This mechanism can also be used to localize portions of source code.
; Doc.Keywords
; Doc.Keywords
: set-valued tags
: set-valued tags
Line 594: Line 614:
: XUID of the "base version" of this document. Doing a descendents search on RootXUID will return the complete version history of this document. '''XXX''' integrate with filesystem-scope versioning.
: XUID of the "base version" of this document. Doing a descendents search on RootXUID will return the complete version history of this document. '''XXX''' integrate with filesystem-scope versioning.
; Doc.Action
; Doc.Action
: this timestamp-valued attribute is used to group multiple documents into a the [[Designs/Journal|activity-centric]] journal view. Range queries on this attribute will sort the entries appropriately. '''XXX''': perhaps this should be combined with <code>File.Modified</code> or <code>Doc.Created</code> to provide better integration with files created via the POSIX API &mdash; otherwise the journal would have to manually search for files without a Doc.Action tag and synthesize an appropriate one.
: this timestamp-valued attribute is used to group multiple documents into a the [[Designs/Journal|activity-centric]] journal view. Range queries on this attribute will sort the entries appropriately. '''XXX''': perhaps this should be combined with <code>File.Modified</code> or <code>Doc.Created</code> to provide better integration with files created via the POSIX API &mdash; otherwise the journal would have to explicitly search for files in <code>~/Documents</code> without a Doc.Action tag and synthesize an appropriate one.


== Implementation ==
== Implementation ==
Line 600: Line 620:
Efficient implementation of olpcfs revolves around only two
Efficient implementation of olpcfs revolves around only two
fundamental data structures: fully-persistent B+trees, and
fundamental data structures: fully-persistent B+trees, and
external order-maintence trees. Nodes in path-valued attributes are
external order-maintenance trees. Nodes in path-valued attributes are
internally numbered in preorder and postorder, and these numberings
internally numbered in preorder and postorder, and these numberings
are maintained using an external version of
are maintained using an external version of
[http://erikdemaine.org/papers/DietzSleator_ESA2002/ Bender, Cole, Demaine, and Farach-Colton]'s
[http://erikdemaine.org/papers/DietzSleator_ESA2002/ Bender, Cole, Demaine, Farach-Colton, and Zito]'s
algorithm. Ancestor and descendent testing and enumeration can
algorithm. Ancestor and descendant testing and enumeration can
efficiently be performed using the node numberings.
efficiently be performed using the node numberings.


'''XXX''': Flesh this out.
'''XXX''': Flesh this out.

Notes:
* use Adaptive Packed Memory Array for flash-based CAS.
http://doi.acm.org/10.1145/1292609.1292616
fixed (small) block size certainly works; maybe variable-length keys work, too.
* Delta-encoding can yield as much space savings as compression. Use this
* Delta-encoding can yield as much space savings as compression. Use this
in underlying filesystem, or in user-space gc?
in underlying filesystem, or in user-space gc?
http://www.usenix.org/events/usenix03/tech/douglis.html
http://www.usenix.org/events/usenix03/tech/douglis.html
http://www.usenix.org/publications/library/proceedings/sf94/manber.1.html
* Order-maintenance data structures to do efficient ancestor queries
http://dx.doi.org/10.1007/3-540-45123-4
in version graph.
Store Rabin fingerprints in metadata when files are written, to allow efficient
identification of delta-encoding targets during filesystem version gc.

=== Proof-of-concept: olpcfs1 ===
The olpcfs1 proof-of-concept implementation uses Berkeley DB's B-Trees, and the order-maintenance data structure for versions is computed incrementally on-demand. Version indices are added to the B-Tree keys to implement full persistence, although performance with large numbers of versions of a specific file is not as good as could be obtained with a specialized fully-persistent B-Tree implementation.

=== Implementations on flash memory ===
Implementations tuned for flash memory can use [http://doi.acm.org/10.1145/1292609.1292616 Adaptive Packed Memory Arrays] for efficient CAS. This is an open research topic, but there is some evidence that a high-performance scalable flash-based filesystem can more profitably be implemented using cache-oblivious B-Trees and APMAs, rather than more "traditional" filesystem structures.

=== Implementation on removable media ===

An important practical concern for our datastore is file transfer to removable media devices. Removable media tend to be removed at the worst possible time, and our first-generation datastore suffered frequent data and index corruption when used on removable media devices.

This design proposes a specialized implementation of olpcfs for removable devices. This allows the Journal interface to the filesystem on these devices to be unchanged, while trading off index performance and versioning in exchange for reliable fault-tolerant storage.

In particular, an olpcfs implementation for removable devices would have the following features:
;No versioning.
:The API related to versioning would remain the same (there would still be @ directories, for example), but every file would report that it had only a single version, with an appropriate timestamp.
;No on-disk index.
:Maintaining index consistency on removable devices is a real challenge, especially since the filesystem may be mounted and modified in arbitrary ways on non-XO machines. Again, the index API will remain the same, but the search results will be assembled by exhaustive search through the filesystem contents, not by consulting an on-disk data structure.
:* Caching searches or maintaining a partial in-memory index may improve performance, if that is necessary.
:* Alternatively, we can use on-disk search caches and rigorously sanity-check and verify their results before reporting them. This would allow quick reporting of the first matches for a search, although the complete exhaustive search would still need to be performed before we can be certain that all matches have been reported (since matching files may have been added to the disk without updating the cache).
;Files are stored in their canonical locations on-disk.
:Implementations which store multiple versions will typically store the files in some form of Content-Addressable data structure in order to gracefully handle the reality of multiple file contents with the same path. For removable devices, we want the files to be browsable even if the device is mounted using a legacy file system (like FAT or FAT32), so we store the files at their "real" names -- a task made easier by refusing to store multiple versions.
;Metadata is stored separate from the file.
:As with [http://en.wikipedia.org/wiki/File_Allocation_Table#FAT_and_Alternate_Data_Streams Windows, MacOS X, and OS/2], metadata not supported by VFAT is stored in a hidden directory in the root. We aggressively validate this data, so that changes "behind the back" of olpcfs do not break the filesystem.

This simplified olpcfs allows us to use the same datastore representation and journal code for removable media, while addressing the necessary fragility and compatibility issues.


== Absent content and merging remote stores ==
== Absent content and merging remote stores ==
Line 645: Line 688:
replacing the current version of the file. This choice is left to
replacing the current version of the file. This choice is left to
application/user level. '''XXX''': how is import exposed to user?
application/user level. '''XXX''': how is import exposed to user?

The content-addressable store is the foundation for this functionality: it provides a concise unique name (the xuid) for the absent content, so that it can be referred to by local data structures and recognized when found. When an SD card is inserted with a collection of objects, existing references on the laptop can be resolved with content from the SD card.


== Related work ==
== Related work ==
Line 714: Line 759:
already true of hash-based CAS. This should be straightforward to
already true of hash-based CAS. This should be straightforward to
implement when private blocks are exported from the system.
implement when private blocks are exported from the system.

[http://doi.acm.org/10.1145/502034.502054 DHash] or
[http://www.opendht.org/ OpenDHT] would make interesting candidates
for a link-local "look-aside cache" to collaboratively cache
shared documents.


The filesystem testing tools listed at
The filesystem testing tools listed at
Line 734: Line 774:


* Can an [http://doi.acm.org/10.1145/1292609.1292616 Adaptive Packed Memory Array] be adapted to efficiently handle variable-length keys? If so, it may yield an efficient CAS filesystem for MTD (flash) devices.
* Can an [http://doi.acm.org/10.1145/1292609.1292616 Adaptive Packed Memory Array] be adapted to efficiently handle variable-length keys? If so, it may yield an efficient CAS filesystem for MTD (flash) devices.

== Credits ==
; Author
: C. Scott Ananian (cscott a t laptop.org)


== Notes ==
== Notes ==
Line 754: Line 798:
just generalize unique to say that *any* query that would otherwise
just generalize unique to say that *any* query that would otherwise
result in a 1-item directory instead results in the file itself.
result in a 1-item directory instead results in the file itself.
* http://pages.cs.wisc.edu/~driscoll/fuse-nt.pdf presents a summary of

the difficulties of implementing FUSE on NT (thanks, NoiseEHC)
== Credits ==
; Author
: C. Scott Ananian (cscott a t laptop.org)

Latest revision as of 18:12, 10 March 2009

  This page is monitored by the OLPC team.


Pencil.png NOTE: The contents of this page are not set in stone, and are subject to change!

This page is a draft in active flux ...
Please leave suggestions on the talk page.

Pencil.png

This document presents a design for a new filesystem to support:

  1. OLPC's journal and bulletin board UI abstractions (see also the second-generation journal design)
  2. Bitfrost P_SF_RUN and P_SF_CORE protections (current implementation)

In our first-generation software, the journal was based with the "datastore" and Bitfrost protections were intended to be provided with a copy-on-write filesystem, originally based on vserver. This new filesystem, which I am calling "olpcfs", attempts to unify these implementations based on a powerful distributed fully-persistent versioned filesystem abstraction.

A proof-of-concept partial implementation is available at http://dev.laptop.org/git?p=users/cscott/olpcfs1 with API documentation and source code browsable at http://dev.laptop.org/~cscott/olpcfs1/ . Slides from a talk on olpcfs are available from Presentations/April_2008_Technical_Mini-conference.

Design Goals

In this section we enumerate a number of the design goals for the olpcfs filesystem, based on our experiences with the first-generation datastore.

POSIX semantics

Our first-generation datastore was a bespoke system primarily accessible through a Python API. The on-disk representation was opaque, and users frequently expressed frustration at their inability to make journal contents interact with non-XO systems. For example, simply locating a desired file in the datastore was difficult, as files were stored on disk named only by content hashes. In addition, making ordinary Linux applications work on the XO was complicated by the fact that saving and loading objects from the datastore required the use of our unique API.

A primary design goal for olpcfs is thus to have a sane and reasonable POSIX semantics, so that unmodified applications using the POSIX APIs (ie, linux, Windows, MacOS, BeOS,...) can interact sensibly with the journal. We do not need to export every functionality of olpcfs over POSIX APIs, but we strive to. (An tip of the hat here goes to Plan 9, which taught the virtues of uniform access methods.)

In particular, the journal should be able to direct existing applications to open a document from the datastore by passing in a POSIX pathname to the document contents, and applications should be able to save a file to an appropriate place (say, '~/Documents') and have the new content naturally show up in the Journal. These documents may not have the rich metadata associated with them which native XO applications can provide, but the basic journal features should "just work".

Content-addressable

Cloud storage is emerging as an elegant method of providing globally-accessible document stores, which are key to providing the uniform location-independent model of sharing Sugar aims to provide. When documents are stored into a global pool, naming each uniquely and consistently becomes a concern. Content-addressable storage (CAS) provides a solution to these issues by using (a function of) the content itself as a name. There are many implementations of CAS floating around; we've been most influenced by the emerging XAM standard API, but Sun's Project Honeycomb, Apache's Jackrabbit and (again) Plan 9's Venti are all possible backing storage for a CAS-based olpcfs filesystem. DHash or OpenDHT would make interesting candidates for a link-local look-aside cache to collaboratively cache shared documents, although extending this through NATs could be a challenge.

Continuous versioning

Many versioned filesystems (for example, Fossil, ext3cow, WAFL, and AFS) use snapshot-based versioning. Rather than store every version of every file, snapshots are taken at regular intervals, and only the versions present at the time of a snapshot are accessible.

This is not consistent with the user interaction model underlying the XO, which strives to allow *every* action to be revokable, whether or not a snapshot intervened, to support painless exploration and discovery. It should be impossible for a child to "break" his machine in a way which is not easy fixed — or which requires losing all work since the last snapshot in order to fix.

Further, snapshots are usually taken of a full filesystem. Instead, we want fine-grained browsing of revisions made to individual documents. Ultimately we would like to empower applications to be able to intelligently display and merge revisions; fine-grained storage of the modification chain is an important enabler of this capability.

Recent work has shown that continuous versioning can be implemented at reasonable cost. Apple's Time Machine, and the experimental systems CVFS, VersionFS, Wayback, and PersiFS have demonstrated that this goal is achievable. The on-disk costs of continuous versioning will be mitigated by aggressively pruning unneeded versions at application-level, using mechanisms similar to the Elephant File System.

Fully-persistent versions

Many versioned filesystems are partially persistent; that is, they support modifications only to the "most recent" version of a file, although all versions are available for read access. Fully-persistent data structures also allow modification of any version. This is obviously necessary to enable distributed editing, since independent parties are not guaranteed to be able to synchronize their edits. The "P_SF_RUN" Bitfrost protection and OLPC update mechanism also require full persistence: new base system images are independent roots for modification, and a user can switch back and forth from one base system image to the other, potentially making (independent) changes to each.

As an aside, other desirable properties may include data structures which are functional or confluently persistent. Functional data structures are immutable after creation, which is a nice property for the CAS backend but not strictly necessary on local disk. Confluently persistent data structures allows a subtree from version A to be efficiently copied into version B. Confluent persistence could be desirable to allow changes to one base system image to be applied to another; but hyper-efficiency for this operation is probably unnecessary for our system. (An unpublished paper by Demaine, Langerman, and Price contains a good discussion of these varieties of persistence, with references and an several examples of functional and confluently-persistent data structures; when the paper is accepted hopefully I'll be able to cite it here.)

POSIX API

In this section we'll describe how files, metadata, and indices are exposed through the standard POSIX API. To some degree, this is independent of actual implementation: one can imagine several different on-disk representations that export the given API, with different costs and properties. We expect that the first implementations will use FUSE on top of an on-disk representation substantially similar to the API exported. An intermediate representation may build on a CAS backend to perform the actual storage; an advanced implementation may eventually achieve high performance via direct disk access. Regardless, the behavior described in this section constitutes the principal API used by legacy applications and the Journal to access content.

Extended Attributes

The basic primitive in olpcfs is a file, with a rich set of metadata. The metadata includes typical POSIX attributes such as creation time, modification time, size, type, and permission bits. As in the Be File System, however, application-specific tagged data can also be added to a file. There is a reasonable (256-character UTF-8) limit on tag size, but no limit to the amount of data stored in a tag. We will discuss uses for tags in the Suggested metadata fields section below. Tagged metadata are indexed, as in BFS and XAM. For uniformity of access, the actual contents of a file are stored as the value of a 'content' tag. (This notion of arbitrary-sized metadata could be considered similar to the notion of alternate data streams in NTFS or resource forks in HFS. Wikipedia contains a lengthy discussion.)

We export our metadata in two different formats. The first is for human browsing and existing filesystem exploration tools. We will present it by example. Let us say that there exists a file 'foo' with content 'bar\n' in the current directory, and that there is an extended attribute 'mime-type' on 'foo' with content 'text/plain'. Then:

$ ls
foo
$ cat foo
bar
$ ls foo:/
ctime mtime size content mime-type xuid parent
$ cat foo:/content
bar
$ cat foo:/mime-type
text/plain
$ cat foo:/size
4
$ cat foo:/xuid
deadbeef

That is, by suffixing a filename with a colon, you get a special type of directory, with internal nodes representing the various tagged metadata. For compatibility with legacy software, opening the 'file' without a trailing slash on the name returns you the value of its 'content' tag. We accept "file:attr" as a shortcut for "file:/attr". The uniform treatment of "alternate data streams" as files means that one could write a utility to combine markup and text (see below) and invoke it as:

$ ./combine-markup foo/content foo/markup > marked-up.html
$ echo 'text/html' > marked-up.html/mime-type

The metadata attributes are also exported as standard POSIX xattrs, which allows efficient backup/recovery with the standard zip, tar, and rsync tools (which support xattrs). Unfortunately, POSIX xattr support is not well integrated into the standard Unix toolkit. There are no standard tools (other than archiving tools) for manipulating xattrs, although there is a partial standard defining naming conventions and common tags for better interoperability. Further, POSIX does not export a read/write API for metadata, which makes it impossible to (say) use the 'grep' tool over metadata content. Our directory hack addresses these infelicities to make the metadata more discoverable and mutable.

Indexing

At the root, an olpcfs filesystem exports indexes of metadata, which appear as simple directories:

$ ls /.index/size/
123  4  56  592
$ ls /.index/size/4/
cafebabe  deadbeef
$ ls /.index/size/4/cafebabe
ctime mtime size content mime-type xuid parent
$ cat /.index/size/4/cafebabe/content
bar

Three basic issues arise with this fundamental idea:

  • How are duplicate keys handled?
  • How are "strange" attribute values handled?

We address the presence of duplicate keys by having the leaf of each index be a directory. The contents of that directory are named by xuid and enumerate the different files on the system which match the query. When an index is created, an attribute can be tagged as unique, in which case an attempt to create a duplicate value will return EEXIST. The leaf nodes for unique indexes are files, rather than (1-entry) directories. Combined with path-valued attributes, the standard directory view of a filesystem is (more or less) just the index on the "path" attribute.

The second issue is exemplified by typical values of the "mime-type" attribute: "text/plain", "text/html", etc. Since leaf nodes of the index are named by the attribute value, how do we handle attributes with illegal filename characters (like '/')? And what would we do if the user attempted to index the "content" attribute, which may contain gigabytes of data for certain files?

Possible solutions might be to display files with ls that one couldn't actually access with the POSIX API, or to hide "bad" entries, or to only export indices that didn't exhibit these problems. These solutions result in irregular indices which aren't actually usable for anything except browsing; any "real" queries would have to be performed with an alternate API.

We have chosen instead to make the index as close to fully-functional as possible, even if this means that manual browsing is a little more difficult. We URL-encode problematic characters in attribute values, and truncate them to the first 128 bytes. This results in a consistent programmatic API, even if it makes directory listings a little uglier. A command-line tool to URI-encode an input string allows shell tools to directly access the index, even if the desired key is "funny". Only searches through large attribute values require use of an alternate query API.

Multilevel indexes

The views of indices we've shown so far allow for simple exact-match queries on a single term. We can extend the power of this interface to allow exact match queries on multiple ANDed terms by exporting a .index/ directory in the result directory of every index query:

$ ls /.index/size/
123  4  56  592
$ ls /.index/size/4/
cafebabe  deadbeef
$ ls /.index/size/4/.index/ctime/
20080303T123456Z  20071225012345Z

Since "unique" index terms have files, not directories, as leaves, unique indexes must appear only as the last term of a multilevel query. This non-orthogonality is a slightly unfortunate, since it requires a query tool using the POSIX API to handle a special case: if at any point in traversing the query path you end up with a file instead of a directory, there are either 0 or 1 results to your query. Further checks should be done directly on the extended attributes of the 1 possible result. This special case doesn't seem to be too onerous; after all, prioritizing unique indices is basic query optimization.

Query API

A full query API allows arbitrary boolean combinations of key terms, as well as range and subset queries. This document won't attempt to nail this interface down precisely, since the goal is a robustly functional POSIX API, but a BFS or SQLish query delivered by ioctl() or getxattr/setxattr call is a reasonable basis for discussion. (A FUSE implementation of olpcfs wouldn't permit an ioctl-based API.) For example:

char querytag[TAGLEN];

getxattr("/.query",
         "SELECT .xuid WHERE mime-type = 'text/html' OR mime-type = 'text/plain'",
         querytag, TAGLEN);

while (1) {
  char xuid[XUIDLEN], pathbuf[PATHLEN];
  struct stat st;

  if (!getxattr("/.query-result", querytag, xuid, XUIDLEN))
    break; /* no more results */

  /* do something with the file, for example: */
  snprintf(pathbuf, PATHLEN, "/.index/xuid/%.*s", XUIDLEN, xuid);
  stat(pathbuf, &st);
}

Helper functions which handled mount points properly, etc, could be added to a library to make this interface easier to use. "Live queries" (another debt to BFS) could also be supported if the getxattr calls above are allowed to block until additional matches are found, and to return a special "no longer a match" result for XUIDs modified since the start of a query.

If a minimal API is desired, efficient compound searches can be performed entirely in userland given only a means for performing a range query on a index directory. An API could be crafted around this mechanism, instead of the direct statement query interface shown above. All query processing could then occur in an application library rather than inside the filesystem. The trade-off would be that live queries and atomicity guarantees would be much more difficult.

Typed attributes

A survey of interesting file attributes yields several special cases of structured data which deserve support: numeric attributes, path-valued attributes, and set-valued attributes.

Numeric attributes are the simplest: attributes such as "file size" or "inode number" should be sorted by their numeric value, not lexicographically by their string representation. BFS handled this case by adding type information for every attribute, with a simple type system including integers and floating point values of various lengths in addition to (UTF-8) string-valued attributes.

We prefer to handle all attribute values uniformly as octet streams, and specify additional type hints when indexes are created. This raises the possibility that non-numeric data might be stored in a "numeric" attribute. The index can handle this case by either refusing to index files with unparsable attibutes, or by assigning some arbitrary ordering -- for example, that all unparsable attributes are sorted lexicographically among themselves, and that they are sorted *after* all parsable attributes. This allows us to support numeric-sorted indices without complicating our API with type information.

Set-valued attributes have multiple notional values. A good example is a "tags" field on a file, which may have the values "schoolwork images insect bee". BFS had no special support for these "multi-valued" fields, and so a text searches in their file browser for "bee" invoked a wildcard query for "*bee*" which forced iteration through all tagged files -- and returned matches with the tags "beer" and "wildebeest"!

A record-separation convention could prevent the extraneous matches. We dictate that the null character '\0' is used to separate fields in set-valued attributes, and we reward adherence to this convention by allowing efficient searches for items in set-valued attributes. For example, the file described above would show up in /.index/tags/schoolwork/, /.index/tags/images/, /.index/tags/insect/, and /.index/tags/bee/. The null character as separator also has the advantage of (a) already being a forbidden character in path names (so it interacts well with path-valued attributes, below), and (b) adhering to the defacto standard of POSIX tools such as xargs and find, which have options to emit and accept null-delimited arguments. For example:

$ cat foo/tags | xargs -0 echo
schoolwork
images
insect
bee

Path-valued attributes are ordered lists, with elements separated by the slash character. For example:

$ cat foo/path
/home/olpc/Documents/foo

A path attribute on files can be both path-valued and set-valued:

$ cat bee/path | xargs -0 echo
/home/olpc/Documents/Schoolwork/bee-image
/home/olpc/Documents/bee

Searches on path-valued attributes traverse each level independently, for example:

$ ls /.index/mime_type/text
$ ls /.index/mime_type/text/html

Just to be clear: path-valued, set-valued, and numeric are properties of indices, not of the attributes themselves.

TO DO: how is the file with mime-type "text" listed in the index? Is there a separate leaf-value directory ".objects", or do we name files in the index so that they don't conflict with the directory names continuing the index?

TO DO: our index allows efficient enumeration of "all items below node X" (regardless of exact subdirectory: X/Y, X/Y/Z, X/Z, etc) and efficient "is X a parent of Y" tests (regardless of the length of the path between X and Y). How do we export these queries?

TO DO: exact correlation between searches on a 'path' attribute and "directory structure". Can we have .index in any directory, narrowing the implicit query on the "path" attribute?

Inodes

Proper support for hardlinks under POSIX semantics requires us to introduce an extra level of indirection between directory entries and file contents. Paths are technically properties of inodes, not files themselves. If file content is copied in the file system, there may be two or more inodes pointing to the same file. Writes to a file via an inode affect all the path references sharing the inode (and only those path references).

We treat the inode as a hidden indirection. Some metadata properties are actually properties of the inode, not of the file itself; the #Metadata fields of inodes section contains a list. These properties are merged with the attributes of the file when the file is accessed via a POSIX path, but may be absent when the file is enumerated by an index. TO DO: we may want to export a reverse index from file to inodes, so that we can obtain paths to content discovered by an index (if paths exist).

Versioning

We restrict the version graph for files and filesystems to be a tree, not a graph or a DAG. Merging operations are not supported, although copying between branches is clearly allowed.

Versions are named as branch_point.time, for example: update1.20080319T154100Z. The branch_point can be omitted, in which case the "current branch" is used. Access to older versions is consistent with the syntax used by the Elephant file system and ext3cow, that is:

$ cat foo@20080101

lists the contents of file 'foo' as it existed on January 1, 2008. Similarly,

$ cd .@20080101T010203

allows you to browse the contents of the current directory as of that date. As demonstrated, we allow partial date specifications as well as integer 'seconds since the epoch' values in time specifications for ease of use. A missing time specification or the special value 'now' is equivalent to "the end of time"; that is, it will always correspond to the latest version on the branch.

The above syntax is actually only sugar for a '@' suffix, used in the same way as the ':' suffix described for metadata:

$ cd foo@/
$ ls
20080101T010203  20080415T112312
$ cat foo@/20080101T010203
Hello, there.

Time-travel is read-only by default; in order to modify files on a different time-line you need to explicitly create a named branch using an olpcfs utility. Branch names implicitly denote a path from a single "most recent version" to the root of the branch; time-travel "into the future" occurs only along this path. This disambiguates the choice of future version when branches occur on a version graph.

The root of a branch is assumed to have existed for all times before the creation of the branch; thus:

$ cd /@build656.0

accesses the "pristine" original root version of the build656 branch, and

$ cd /@build656

or

$ cd /@build656.now

switches to the latest version on the build656 branch.

As in the Elephant system, separate version information is kept for directory entries and for inodes. For example:

$ cat foo@20080101

lists a past version of the inode currently corresponding to foo (whatever its former name), while

$ cat .@20080101/foo

lists the contents of the file named 'foo' at the given time (whatever its current inode number). As with similar systems, time specifications can be added at any place in the path, or even multiple places. For example, this command lists the Jan 1 version of the inode that was named /foo on Mar 3:

$ cat /@20080303/foo@20080101

Old versions are reclaimed according to policies set by an olpcfs utility. The policies correspond to those defined for the Elephant file system. The default policy keeps all versions in a "second-chance interval", so that all actions in the past (say) 2 days can be undone. In addition to the second-chance interval, the default policy also keeps "landmark" versions. Landmark versions are versions left untouched for more than the "landmark interval" (say, 1 hour), as well as any versions explicitly marked as landmarks by the user. Policies applied to a directory apply by default to all files contained in it; in particular, manually landmarking a directory ensures that all contained files have a landmark at that point. Alternate policies can be set for specific directories and wildcard patterns, so that temporary cache files and core files are not preserved by default.

Cleaning up old versions might mean that it is impossible to reconstruct the exact contents of a given directory at a specific time. For this reason, each past version also stores the version timestamp of the next file known to replace it, so that it is possible to tell whether a given filesystem view is consistent while time-travelling.


Import/export mechanisms allow us to export the entire version history of a file, as well as to import versions of a file from an external source. Imports of an externally modified file will typically create a new named branch; we do not usually attempt to automatically merge new versions of files, only to make them available.

The "cleaner" utility which removes old versions operates in userspace, via a special interface. This also allows an export utility to remove old versions once they've been backed up to external storage.

All indexes are temporal. By default, index searches will return only the current files on the current branch, but historic and range searches are also possible. In particular, the path name index is used to answer "what versions of /foo/bar/bat are available" and the inode index is used for "what versions of inode X are available"? Temporal index queries are performed via /.index@timespec. XXX: when we define a means to perform range queries via the POSIX API, it will presumably apply to temporal queries.

Notes:
* Branch tags need only be unique in their scope (directory/file).
  Traversal of multiple tags as: branch1/branch2/branch3.time?
* maybe use a hard link to a specific version to give it a branch name?
* how do you delete or compress a version tree to free up its space?
  can you just use 'rm'?  what's the interface the cleaner uses?
* what happens when I make a new tree 'A', modify it to create 'B', then
  "name" the 'B' revision as tree "C", modify B to make BB and C to make CC,
  and then delete tree 'A'?  Ideally the tree A->B->BB path would be deleted,
  while the C->CC path would be untouched (and note that B==C).
  ANSWER: branch roots must be landmarked if they are to be preserved.
* Each version only retains a link to its parent.  We never store the
  full version paths.  We use order-maintenance data structures to
  ensure that we can efficiently identify whether a version is on the
  path from 'root' to 'current'.
* Export filesystem version as a path-valued attribute? can
  we integrate this with a file-oriented version as well?  Further,
  we'd like Sugar to be able to maintain its own version information,
  and index this appropriately.

Metadata fields of inodes

Most of these names are taken from freedesktop's Shared File Metadata spec, although those names conflict with the recommendations in freedesktop's Common Extended Attributes spec. We're not going to take sides in that fight yet; these names are subject to change.

File.Name
POSIX name(s) excluding path; set-valued.
File.Path
POSIX path(s); path- and set-valued
File.Permissions
string in unix format; eg "-rw-r--r--" (We may diverge from spec and use an integer instead.)
File.Modified
Last modified datetime.
File.Accessed
Last access datetime. We will probably support a relatime mount option.
File.Owner
unix owner of the file, as an integer. (Not in Shared File Metadata spec.)
File.Group
unix group of the file, as an integer. (Not in Shared File Metadata spec.)
File.Created
Creation datetime. (Not in Shared File Metadata spec.)

The Shared File Metadata spec allows timezone specifications and separators for datetimes. In olpcfs, all datetimes are in 16-character basic format UTC ISO8601, as in the Firmware Key and Signature Formats. We may elect to depart slightly from the firmware format and use a fixed number of fractional second digits (say, 3).

Metadata fields of files

These fields are automatically created for a files written with the standard POSIX API.

File.Size
numeric
File.Content
actual content of file. Although freedesktop spec recommends that this is the 'plain text' version suitable for indexers, we return the raw contents (but see the discussion on the 'markup' attribute below).
File.XUID
content-based unique identifier, conformant with XAM spec.

TODO: versioning-related metadata.

Suggested application-level metadata fields

These fields are not created without explicit intervention, but are used to support application-level functionality.

Doc.Markup
This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content. This mechanism can also be used to localize portions of source code.
Doc.Keywords
set-valued tags
Doc.Title
title of document.
Doc.RootXUID
XUID of the "base version" of this document. Doing a descendents search on RootXUID will return the complete version history of this document. XXX integrate with filesystem-scope versioning.
Doc.Action
this timestamp-valued attribute is used to group multiple documents into a the activity-centric journal view. Range queries on this attribute will sort the entries appropriately. XXX: perhaps this should be combined with File.Modified or Doc.Created to provide better integration with files created via the POSIX API — otherwise the journal would have to explicitly search for files in ~/Documents without a Doc.Action tag and synthesize an appropriate one.

Implementation

Efficient implementation of olpcfs revolves around only two fundamental data structures: fully-persistent B+trees, and external order-maintenance trees. Nodes in path-valued attributes are internally numbered in preorder and postorder, and these numberings are maintained using an external version of Bender, Cole, Demaine, Farach-Colton, and Zito's algorithm. Ancestor and descendant testing and enumeration can efficiently be performed using the node numberings.

XXX: Flesh this out.

* Delta-encoding can yield as much space savings as compression.  Use this
  in underlying filesystem, or in user-space gc?
   http://www.usenix.org/events/usenix03/tech/douglis.html
   http://www.usenix.org/publications/library/proceedings/sf94/manber.1.html
   http://dx.doi.org/10.1007/3-540-45123-4
  Store Rabin fingerprints in metadata when files are written, to allow efficient
  identification of delta-encoding targets during filesystem version gc.

Proof-of-concept: olpcfs1

The olpcfs1 proof-of-concept implementation uses Berkeley DB's B-Trees, and the order-maintenance data structure for versions is computed incrementally on-demand. Version indices are added to the B-Tree keys to implement full persistence, although performance with large numbers of versions of a specific file is not as good as could be obtained with a specialized fully-persistent B-Tree implementation.

Implementations on flash memory

Implementations tuned for flash memory can use Adaptive Packed Memory Arrays for efficient CAS. This is an open research topic, but there is some evidence that a high-performance scalable flash-based filesystem can more profitably be implemented using cache-oblivious B-Trees and APMAs, rather than more "traditional" filesystem structures.

Implementation on removable media

An important practical concern for our datastore is file transfer to removable media devices. Removable media tend to be removed at the worst possible time, and our first-generation datastore suffered frequent data and index corruption when used on removable media devices.

This design proposes a specialized implementation of olpcfs for removable devices. This allows the Journal interface to the filesystem on these devices to be unchanged, while trading off index performance and versioning in exchange for reliable fault-tolerant storage.

In particular, an olpcfs implementation for removable devices would have the following features:

No versioning.
The API related to versioning would remain the same (there would still be @ directories, for example), but every file would report that it had only a single version, with an appropriate timestamp.
No on-disk index.
Maintaining index consistency on removable devices is a real challenge, especially since the filesystem may be mounted and modified in arbitrary ways on non-XO machines. Again, the index API will remain the same, but the search results will be assembled by exhaustive search through the filesystem contents, not by consulting an on-disk data structure.
  • Caching searches or maintaining a partial in-memory index may improve performance, if that is necessary.
  • Alternatively, we can use on-disk search caches and rigorously sanity-check and verify their results before reporting them. This would allow quick reporting of the first matches for a search, although the complete exhaustive search would still need to be performed before we can be certain that all matches have been reported (since matching files may have been added to the disk without updating the cache).
Files are stored in their canonical locations on-disk.
Implementations which store multiple versions will typically store the files in some form of Content-Addressable data structure in order to gracefully handle the reality of multiple file contents with the same path. For removable devices, we want the files to be browsable even if the device is mounted using a legacy file system (like FAT or FAT32), so we store the files at their "real" names -- a task made easier by refusing to store multiple versions.
Metadata is stored separate from the file.
As with Windows, MacOS X, and OS/2, metadata not supported by VFAT is stored in a hidden directory in the root. We aggressively validate this data, so that changes "behind the back" of olpcfs do not break the filesystem.

This simplified olpcfs allows us to use the same datastore representation and journal code for removable media, while addressing the necessary fragility and compatibility issues.

Absent content and merging remote stores

THIS SECTION IS TENTATIVE.

Absent content is not present in directory structure, although it may appear in indexes. We don't try to make the local store completely network-transparent; management of local store is done by application, not by the filesystem.

Although we have treated files as named objects residing in a directory above, in reality objects live in a flat space and are simply named by various directory objects residing in the system. Directories are in fact a special type of index -- special, because the indexed data (the 'path name') does not in fact exist as an independent property of the object, since a given file can have many "path names". Path names are sorted first by depth, then by full path name, so that members of a directory can be efficiently identified with a range search on the path index.

Merging remote stores: merging happens at the index level, not the directory level. When content is imported into the local filesystem, it may appear without naming the new content in a directory (available only via non-temporal-scoped index queries). It can also be explicitly imported into the directory structure either as a new file, a new tagged version of an existing file, or by replacing the current version of the file. This choice is left to application/user level. XXX: how is import exposed to user?

The content-addressable store is the foundation for this functionality: it provides a concise unique name (the xuid) for the absent content, so that it can be referred to by local data structures and recognized when found. When an SD card is inserted with a collection of objects, existing references on the laptop can be resolved with content from the SD card.

Related work

The Be File System also uses a directory structure to list extended attributes, although the directory inode number was hidden from the user. It also introduced the idea that metadata indices and directories were both just B+Trees on disk, and could use the same representation. As with the attribute "directory", BFS allocated an "index directory" inode, with the same structure as other directories, but hid this inode from the user. In the spirit of Plan 9 ("everything is a file") we've chosen to expose these directories to the user for direct access and traversal. The idea of direct read/write access to attributes as streams comes from the XAM specification (which likely inherited it from Centera or another pre-existing CAS system).

BFS had difficulties with duplicate keys in their B+Trees, which were worked-around with various ad-hoc measures. In our system, we use the XUID as a suffix to any key type to break ties, avoiding the problem.

The Linux NTFS driver exports "alternate data streams" using a trailing colon syntax, for example: "filename:stream". A special tool is necessary to list the streams available for a given file.

The ext3cow filesystem exports versions using a trailing at-sign syntax. The version of a file or directory available at 2008-03-12 16:47:34 UTC (1,205,340,454 seconds since the epoch) can be accessed with "filename@1205340454". There is a nice Time-Travelling File Manager based on this interface, and the whitepaper contains a useful discussion of the application-level versus user-visible semantics of this naming scheme (starting on page 6). Our suffix syntax is compatible, so the time-travelling file manager ought to work unmodified on olpcfs.

There are three specifications for file metadata: freedesktop's Shared File Metadata spec; freedesktop's Common Extended Attributes spec; and Apple's Spotlight spec. The Shared File Metadata spec is closest to our goals, but it defines a comma-delimited "array of string" in the place of our null-delimited set and slash-delimited path attributes.

TierStore is a distributed filesystem for disconnected networks. It presupposes a shared namespace for all nodes, which seems infeasible for journal content. Its version numbering scheme is very similar to the one we propose for Journal content, however. TierStore's use of Delay-Tolerant Networking is admirable, and its web and email distribution networks seem worthy of emulation. We prefer to expose "write conflicts" as divergence in the version tree, rather than as problems necessitating manual resolution. TierStore's current code base seems to have gone dormant.

Related work on security: Plutus has solid privacy and group management features we'd like to borrow for olpcfs. Pastiche borrowed from FarSite the use of convergent encryption to allow sharing of encrypted content blocks with minimal compromise in security: only users which already possess the content can verify the encrypted block's contents; a property already true of hash-based CAS. This should be straightforward to implement when private blocks are exported from the system.

The filesystem testing tools listed at http://ltp.sourceforge.net/tooltable.php should make developing a robust implementation significantly easier. The Python FUSE binding should make prototyping pleasant.

Open Questions

  • How should complex queries by exposed via the POSIX API? Only via ioctl/xattr (as described above), by read/write to a special 'query' file, via magic directories, or what?
  • Synchronization of object sets with a remote store is an application-level task. Some ideas on efficient synchronization algorithms have been described, we do not have a formal proof of their efficiency (and they may well fail to perform efficiently for certain inputs).
  • Can an Adaptive Packed Memory Array be adapted to efficiently handle variable-length keys? If so, it may yield an efficient CAS filesystem for MTD (flash) devices.

Credits

Author
C. Scott Ananian (cscott a t laptop.org)

Notes

Notes and latest ideas:

* files in the search result are inodes?  can directly access xuid
  from inode.
* Hence mime-type is a
  special two-valued path key!  mime-subtype is a set key.
* further, there's are special 'direntry' objects in the system, with a
  path-valued 'path' entry and an numeric 'inode' property.  The
  inode.  So the root filesystem is actually just the index for
  'path'.
* How do "set-valued" and "path-valued" attributes interact with
  "unique" attributes?  Paths want to be set-valued, path-valued,
  *and* unique.
* How do unique attributes interact with versioning?  Only one
  instance of the given attribute in a given version, presumably.
  But queries with don't have a version context won't necessary
  have the 'unique' magic applied at the leaves.  Maybe we should
  just generalize unique to say that *any* query that would otherwise
  result in a 1-item directory instead results in the file itself.
* http://pages.cs.wisc.edu/~driscoll/fuse-nt.pdf presents a summary of
  the difficulties of implementing FUSE on NT (thanks, NoiseEHC)