Olpcfs: Difference between revisions

From OLPC
Jump to navigation Jump to search
(Statement of goals. Lots of content still to be written.)
 
(Basic POSIX semantics; historical notes.)
Line 90: Line 90:
and the experimental systems CVFS, VersionFS, Wayback, and
and the experimental systems CVFS, VersionFS, Wayback, and
PersiFS have demonstrated that this goal is achievable.
PersiFS have demonstrated that this goal is achievable.
The on-disk costs of continuous versioning will be mitigated by
aggressively pruning unneeded versions at application-level.


=== Fully-persistent versions ===
=== Fully-persistent versions ===
Line 114: Line 116:
unnecessary for our system. (An unpublished paper by Demaine,
unnecessary for our system. (An unpublished paper by Demaine,
Langerman, and Price contains a good discussion of these varieties of
Langerman, and Price contains a good discussion of these varieties of
persistence, with references; when the paper is accepted hopefully
persistence, with references and an several examples of functional and
confluently-persistent data structures; when the paper is accepted
I'll be able to cite it here.)
hopefully I'll be able to cite it here.)


== POSIX API ==
== POSIX API ==


In this section we'll describe how files, metadata, and indices are
Notes:
exposed through the standard POSIX API. To some degree, this is
Export both an 'extended attribute' as well as a pseudo-directory
independent of actual implementation: one can imagine several
view of file metadata; this allows efficient backup/recovery with
different on-disk representations that export the given API, with
the standard zip and tar tools. The pseudo-directory view enables
different costs and properties. We expect that the first
browsing with ls and friends.
implementations will use
[http://en.wikipedia.org/wiki/Filesystem_in_Userspace FUSE] on top of
an on-disk representation substantially similar to the API exported.
An intermediate representation may build on a CAS backend to perform
the actual storage; an advanced implementation may eventually achieve
high performance via direct disk access. Regardless, the behavior
described in this section constitutes the principal API used by legacy
applications and the Journal to access content.

=== Extended Attributes ===
The basic primitive in olpcfs is a ''file'', with a rich set of
metadata. The metadata includes typical POSIX attributes such as
creation time, modification time, size, type, and permission bits.
As in the [http://en.wikipedia.org/wiki/Be_File_System Be File
System], however, application-specific tagged data can also be added
to a file. There is a reasonable (256-character UTF-8) limit on tag
size, but no limit to the amount of data stored in a tag.
We will discuss uses for tags in the
''[[#Suggested metadata fields|Suggested metadata fields]]'' section
below. Tagged metadata are indexed, as in BFS and XAM. For
uniformity of access, the actual contents of a file are stored as the
value of a 'content' tag. (This notion of arbitrary-sized metadata
could be considered similar to the notion of ''alternate data streams''
in [http://en.wikipedia.org/wiki/Ntfs NTFS] or ''resource forks'' in
[http://en.wikipedia.org/wiki/Hierarchical_File_System HFS].
Wikipedia contains a
[http://en.wikipedia.org/wiki/Fork_%28filesystem%29 lengthy discussion].)

We export our metadata in two different formats. The first is for
human browsing and existing filesystem exploration tools. We will
present it by example. Let us say that there exists a
file 'foo' with content 'bar\n' in the current directory, and that there
is an extended attribute 'mime-type' on 'foo' with content
'text/plain'. Then:

$ ls
foo
$ cat foo
bar
$ ls foo/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar
$ cat foo/mime-type
text/plain
$ cat foo/size
4
$ cat foo/xuid
deadbeef

That is, a 'file' is treated as just a special type of directory, with
internal nodes representing the various tagged metadata. For
compatibility with legacy software, opening the 'file' without a
trailing slash on the name returns you the value of its 'content' tag.
The mode bits for 'foo' report it as a standard file, so the only real
'hack' here is the ability to cd into a 'file' and list its
attributes. The uniform treatment of "alternate data streams" as
files means that one could write a utility to combine markup and text
(see below) and invoke it as:
$ ./combine-markup foo/content foo/markup > marked-up.html
$ echo 'text/html' > marked-up.html/mime-type

The metadata attributes are also exported as standard POSIX xattrs,
which allows efficient backup/recovery with the standard zip, tar, and rsync
tools (which support xattrs). Unfortunately, POSIX xattr support is not
well integrated into the standard Unix toolkit. There are no standard
tools (other than archiving tools) for manipulating xattrs, although
there is
[http://www.freedesktop.org/wiki/CommonExtendedAttributes a partial standard]
defining naming conventions and common tags for better
interoperability. Further, POSIX does not export a read/write API for
metadata, which makes it impossible to (say) use the 'grep' tool over
metadata content.

=== Indexing ===
At the root, an olpcfs filesystem exports indexes of metadata, which
appear as simple directories:

$ ls /index/size/
4
56
56
123
592
$ ls /index/size/4/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar

Two difficulties: duplicate keys appear as duplicate filenames in a
directory. The interface exported by the kernel includes an inode
number, and the entries are distinguishable. '''FIXME'''

=== Accessing objects by XUID ===
Although we have treated files as named objects residing in a
directory above, in reality objects live in a flat space and are
simply named by various directory objects residing in the system.
Directories are in fact a special type of index -- special, because
the indexed data (the 'path name') does not in fact exist as an
independent property of the object, since a given file can have many
'path names'. '''FIXME'''


===Suggested metadata fields===

; markup
: This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content.


== Implementation ==
== Implementation ==
Line 132: Line 242:


== Merging remote stores ==
== Merging remote stores ==

== Historical note ==
The Be File System also used a directory structure to list extended
attributes, although the directory inode number was hidden from the
user. It also introduced the idea that metadata indices and directories were
both just B+Trees on disk, and could use the same representation. As
with the attribute "directory", BFS allocated an "index directory"
inode, with the same structure as other directories, but hid this
inode from the user. In the spirit of Plan 9 ("everything is a file")
we've chosen to expose these directories to the user for direct access
and traversal. The idea of direct read/write access to attributes as
streams comes from the XAM specification (which no doubt inherited it
from Centera or another other of the pre-existing CAS systems which
influenced XAM's design).

BFS had difficulties with duplicate keys in their B+Trees, which were
worked-around with various ad-hoc measures. In our system, we can use
the XUID as a suffix to any key type to break ties, avoiding the
problem completely.

The [http://linux.die.net/man/8/ntfs-3g Linux NTFS driver] exports
"alternate data streams" using a trailing colon syntax, for example:
"filename:stream". A special tool is necessary to list the streams
available for a given file.

The [http://en.wikipedia.org/wiki/Ext3cow ext3cow] filesystem exports
versions using a trailing at-sign syntax. The version of a file or
directory available at 2008-03-12 16:47:34 UTC (1,205,340,454 seconds
since the epoch) can be accessed with "filename@1205340454".

== Open Questions ==
* How should complex queries by exposed via the POSIX API? Only via
ioctl, or by read/write to a special 'query' file, or via magic
directories, or what?

Revision as of 16:51, 12 March 2008

This document presents a design for a new filesystem to support:

  1. OLPC's journal and bulletin board UI abstractions (see also the second-generation journal design)
  2. Bitfrost P_SF_RUN and P_SF_CORE protections (current implementation)

In our first-generation software, the journal was based with the "datastore" and Bitfrost protections were intended to be provided with a copy-on-write filesystem, originally based on vserver. This new filesystem, which I am calling "olpcfs", attempts to unify these implementations based on a powerful distributed fully-persistent versioned filesystem abstraction.

Design Goals

In this section we enumerate a number of the design goals for the olpcfs filesystem, based on our experiences with the first-generation datastore.

POSIX semantics

Our first-generation datastore was a bespoke system primarily accessible through a Python API. The on-disk representation was opaque, and users frequently expressed frustration at their inability to make journal contents interact with non-XO systems. For example, simply locating a desired file in the datastore was difficult, as files were stored on disk named only by content hashes. In addition, making ordinary Linux applications work on the XO was complicated by the fact that saving and loading objects from the datastore required the use of our unique API.

A primary design goal for olpcfs is thus to have a sane and reasonable POSIX semantics, so that unmodified applications using the POSIX APIs (ie, linux, Windows, MacOS, BeOS,...) can interact sensibly with the journal. We do not need to export every functionality of olpcfs over POSIX APIs, but we strive to. (An ideological tip-of-the-hat here goes to Plan 9, which taught the virtues of uniform access methods.)

In particular, the journal should be able to direct existing applications to open a document from the datastore by passing in a POSIX pathname to the document contents, and applications should be able to save a file to an appropriate place (say, '~/Documents') and have the new content naturally show up in the Journal. These documents may not have the rich metadata associated with them which native XO applications can provide, but the basic journal features should "just work".

Content-Addressable

Cloud storage is emerging as an elegant method of providing globally-accessible document stores, which are key to providing the uniform location-independent model of sharing Sugar aims to provide. When documents are stored into a global pool, naming each uniquely and consistently becomes a concern. Content-addressable storage (CAS) provides a solution to these issues by using (a function of) the content itself as a name. There are many implementations of CAS floating around; we've been most influenced by the emerging XAM standard API, but Sun's Project Honeycomb, Apache's Jackrabbit and (again) Plan 9's Venti are all possible backing storage for a CAS-based olpcfs filesystem.

Continuous versioning

Many versioned filesystems (for example, Fossil, ext3cow, WAFL, and AFS) use snapshot-based versioning. Rather than store every version of every file, snapshots are taken at regular intervals, and only the versions present at the time of a snapshot are accessible.

This is not consistent with the user interaction model underlying the XO, which strives to allow *every* action to be revokable, whether or not a snapshot intervened, to support painless exploration and discovery. It should be impossible for a child to "break" his machine in a way which is not easy fixed — or which requires losing all work since the last snapshot in order to fix.

Further, snapshots are usually taken of a full filesystem. Instead, we want fine-grained browsing of revisions made to individual documents. Ultimately we would like to empower applications to be able to intelligently display and merge revisions; fine-grained storage of the modification chain is an important enabler of this capability.

Recent work has shown that continuous versioning can be implemented at reasonable cost. Apple's Time Machine, and the experimental systems CVFS, VersionFS, Wayback, and PersiFS have demonstrated that this goal is achievable. The on-disk costs of continuous versioning will be mitigated by aggressively pruning unneeded versions at application-level.

Fully-persistent versions

Many versioned filesystems are partially persistent; that is, they support modifications only to the "most recent" version of a file, although all versions are available for read access. Fully-persistent data structures also allow modification of any version. This is obviously necessary to enable distributed editing, since independent parties are not guaranteed to be able to synchronize their edits. The "P_SF_RUN" Bitfrost protection and OLPC update mechanism also require full persistence: new base system images are independent roots for modification, and a user can switch back and forth from one base system image to the other, potentially making (independent) changes to each.

As an aside, other desirable properties may include data structures which are functional or confluently persistent. Functional data structures are immutable after creation, which is a nice property for the CAS backend but not strictly necessary on local disk. Confluently persistent data structures allows a subtree from version A to be efficiently copied into version B. Confluent persistence could be desirable to allow changes to one base system image to be applied to another; but hyper-efficiency for this operation is probably unnecessary for our system. (An unpublished paper by Demaine, Langerman, and Price contains a good discussion of these varieties of persistence, with references and an several examples of functional and confluently-persistent data structures; when the paper is accepted hopefully I'll be able to cite it here.)

POSIX API

In this section we'll describe how files, metadata, and indices are exposed through the standard POSIX API. To some degree, this is independent of actual implementation: one can imagine several different on-disk representations that export the given API, with different costs and properties. We expect that the first implementations will use FUSE on top of an on-disk representation substantially similar to the API exported. An intermediate representation may build on a CAS backend to perform the actual storage; an advanced implementation may eventually achieve high performance via direct disk access. Regardless, the behavior described in this section constitutes the principal API used by legacy applications and the Journal to access content.

Extended Attributes

The basic primitive in olpcfs is a file, with a rich set of metadata. The metadata includes typical POSIX attributes such as creation time, modification time, size, type, and permission bits. As in the [http://en.wikipedia.org/wiki/Be_File_System Be File System], however, application-specific tagged data can also be added to a file. There is a reasonable (256-character UTF-8) limit on tag size, but no limit to the amount of data stored in a tag. We will discuss uses for tags in the Suggested metadata fields section below. Tagged metadata are indexed, as in BFS and XAM. For uniformity of access, the actual contents of a file are stored as the value of a 'content' tag. (This notion of arbitrary-sized metadata could be considered similar to the notion of alternate data streams in NTFS or resource forks in HFS. Wikipedia contains a lengthy discussion.)

We export our metadata in two different formats. The first is for human browsing and existing filesystem exploration tools. We will present it by example. Let us say that there exists a file 'foo' with content 'bar\n' in the current directory, and that there is an extended attribute 'mime-type' on 'foo' with content 'text/plain'. Then:

$ ls
foo
$ cat foo
bar
$ ls foo/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar
$ cat foo/mime-type
text/plain
$ cat foo/size
4
$ cat foo/xuid
deadbeef

That is, a 'file' is treated as just a special type of directory, with internal nodes representing the various tagged metadata. For compatibility with legacy software, opening the 'file' without a trailing slash on the name returns you the value of its 'content' tag. The mode bits for 'foo' report it as a standard file, so the only real 'hack' here is the ability to cd into a 'file' and list its attributes. The uniform treatment of "alternate data streams" as files means that one could write a utility to combine markup and text (see below) and invoke it as:

$ ./combine-markup foo/content foo/markup > marked-up.html
$ echo 'text/html' > marked-up.html/mime-type

The metadata attributes are also exported as standard POSIX xattrs, which allows efficient backup/recovery with the standard zip, tar, and rsync tools (which support xattrs). Unfortunately, POSIX xattr support is not well integrated into the standard Unix toolkit. There are no standard tools (other than archiving tools) for manipulating xattrs, although there is a partial standard defining naming conventions and common tags for better interoperability. Further, POSIX does not export a read/write API for metadata, which makes it impossible to (say) use the 'grep' tool over metadata content.

Indexing

At the root, an olpcfs filesystem exports indexes of metadata, which appear as simple directories:

$ ls /index/size/
4
56
56
123
592
$ ls /index/size/4/
ctime mtime size content mime-type xuid parent
$ cat foo/content
bar

Two difficulties: duplicate keys appear as duplicate filenames in a directory. The interface exported by the kernel includes an inode number, and the entries are distinguishable. FIXME

Accessing objects by XUID

Although we have treated files as named objects residing in a directory above, in reality objects live in a flat space and are simply named by various directory objects residing in the system. Directories are in fact a special type of index -- special, because the indexed data (the 'path name') does not in fact exist as an independent property of the object, since a given file can have many 'path names'. FIXME


Suggested metadata fields

markup
This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content.

Implementation

Indexing

Absent content

Merging remote stores

Historical note

The Be File System also used a directory structure to list extended attributes, although the directory inode number was hidden from the user. It also introduced the idea that metadata indices and directories were both just B+Trees on disk, and could use the same representation. As with the attribute "directory", BFS allocated an "index directory" inode, with the same structure as other directories, but hid this inode from the user. In the spirit of Plan 9 ("everything is a file") we've chosen to expose these directories to the user for direct access and traversal. The idea of direct read/write access to attributes as streams comes from the XAM specification (which no doubt inherited it from Centera or another other of the pre-existing CAS systems which influenced XAM's design).

BFS had difficulties with duplicate keys in their B+Trees, which were worked-around with various ad-hoc measures. In our system, we can use the XUID as a suffix to any key type to break ties, avoiding the problem completely.

The Linux NTFS driver exports "alternate data streams" using a trailing colon syntax, for example: "filename:stream". A special tool is necessary to list the streams available for a given file.

The ext3cow filesystem exports versions using a trailing at-sign syntax. The version of a file or directory available at 2008-03-12 16:47:34 UTC (1,205,340,454 seconds since the epoch) can be accessed with "filename@1205340454".

Open Questions

  • How should complex queries by exposed via the POSIX API? Only via

ioctl, or by read/write to a special 'query' file, or via magic directories, or what?