Olpcfs: Difference between revisions
(Basic POSIX semantics; historical notes.) |
(More work, mostly on indexes.) |
||
Line 145: | Line 145: | ||
size, but no limit to the amount of data stored in a tag. |
size, but no limit to the amount of data stored in a tag. |
||
We will discuss uses for tags in the |
We will discuss uses for tags in the |
||
''[[#Suggested metadata fields|Suggested metadata fields]]'' section |
''[[#Suggested application-level metadata fields|Suggested metadata fields]]'' section |
||
below. Tagged metadata are indexed, as in BFS and XAM. For |
below. Tagged metadata are indexed, as in BFS and XAM. For |
||
uniformity of access, the actual contents of a file are stored as the |
uniformity of access, the actual contents of a file are stored as the |
||
Line 199: | Line 199: | ||
interoperability. Further, POSIX does not export a read/write API for |
interoperability. Further, POSIX does not export a read/write API for |
||
metadata, which makes it impossible to (say) use the 'grep' tool over |
metadata, which makes it impossible to (say) use the 'grep' tool over |
||
metadata content. |
metadata content. Our 'file as directory' hack addresses these infelicities. |
||
=== Indexing === |
=== Indexing === |
||
Line 205: | Line 205: | ||
appear as simple directories: |
appear as simple directories: |
||
$ ls /index/size/ |
$ ls /.index/size/ |
||
123 4 56 592 |
|||
4 |
|||
$ ls /.index/size/4/ |
|||
56 |
|||
cafebabe deadbeef |
|||
56 |
|||
$ ls /.index/size/4/cafebabe |
|||
123 |
|||
592 |
|||
$ ls /index/size/4/ |
|||
ctime mtime size content mime-type xuid parent |
ctime mtime size content mime-type xuid parent |
||
$ cat |
$ cat /.index/size/4/cafebabe/content |
||
bar |
bar |
||
Three basic issues arise with this fundamental idea: |
|||
Two difficulties: duplicate keys appear as duplicate filenames in a |
|||
* How are duplicate keys handled? |
|||
directory. The interface exported by the kernel includes an inode |
|||
* How are "strange" attribute values handled? |
|||
number, and the entries are distinguishable. '''FIXME''' |
|||
We address the presence of duplicate keys by having the leaf of each |
|||
=== Accessing objects by XUID === |
|||
index be a directory. The contents of that directory are named by |
|||
Although we have treated files as named objects residing in a |
|||
[[#Accessing objects by XUID|xuid]] |
|||
directory above, in reality objects live in a flat space and are |
|||
and enumerate the different files on the system which match the |
|||
simply named by various directory objects residing in the system. |
|||
query. When an index is created, an attribute can be tagged as |
|||
Directories are in fact a special type of index -- special, because |
|||
''unique'', in which case an attempt to create a duplicate value will |
|||
the indexed data (the 'path name') does not in fact exist as an |
|||
return EEXIST. The leaf nodes for unique indexes are files, rather |
|||
independent property of the object, since a given file can have many |
|||
than (1-entry) directories. Combined with ''path-valued'' attributes, |
|||
'path names'. '''FIXME''' |
|||
the standard directory view of a filesystem is |
|||
([[#Special attributes|more or less]]) just the index on the |
|||
"path" attribute. |
|||
The second issue is exemplified by typical values of the "mime-type" |
|||
attribute: "text/plain", "text/html", etc. Since leaf nodes of the |
|||
index are named by the attribute value, how do we handle attributes |
|||
with illegal filename characters (like '/')? And what would we do if |
|||
the user attempted to index the "content" attribute, which may contain |
|||
gigabytes of data for certain files? |
|||
One solution would be to display files with 'ls' that one couldn't |
|||
===Suggested metadata fields=== |
|||
actually access with the POSIX API, or to hide "bad" entries, or |
|||
to only export indices that didn't exhibit these problems. These |
|||
solutions result in irregular indices which aren't actually usable for |
|||
anything except browsing; any "real" queries would have to be |
|||
performed with an alternate API. |
|||
We have chosen instead to make the index as close to fully-functional |
|||
as possible, even if this means that manual browsing is a little more |
|||
difficult. We URL-encode problematic characters in attribute values, |
|||
and truncate them to the first 128 bytes. This results in a |
|||
consistent programmatic API, even if it makes directory listings a |
|||
little uglier. A command-line tool to URI-encode an input string |
|||
allows shell tools to directly access the index, even if the desired |
|||
key is "funny". Only searches through large attributes require use of |
|||
a non-POSIX query API. |
|||
==== Query API ==== |
|||
A full query API would allow arbitrary boolean combinations of key |
|||
terms, as well as range and subset queries. This document won't |
|||
attempt to nail this interface down precisely, since the goal is a robustly |
|||
functional POSIX API, but a BFS or SQLish query delivered by ioctl() |
|||
or getxattr/setxattr call is a reasonable basis for discussion. |
|||
(A FUSE implementation of olpcfs |
|||
[http://fuse.sourceforge.net/wiki/index.php/FAQ#Why_doesnx27.t_FUSE_forward_ioctlx28.x29._calls_to_the_filesystemx3f. wouldn't permit an ioctl-based API].) |
|||
For example: |
|||
<pre><nowiki> |
|||
char querytag[TAGLEN]; |
|||
getxattr("/.query", |
|||
"SELECT .xuid WHERE mime-type = 'text/html' OR mime-type = 'text/plain'", |
|||
querytag, TAGLEN); |
|||
while (1) { |
|||
char xuid[XUIDLEN], pathbuf[PATHLEN]; |
|||
struct stat st; |
|||
if (!getxattr("/.query-result", querytag, xuid, XUIDLEN)) |
|||
break; /* no more results */ |
|||
/* do something with the file, for example: */ |
|||
snprintf(pathbuf, PATHLEN, "/.index/xuid/%.*s", XUIDLEN, xuid); |
|||
stat(pathbuf, &st); |
|||
} |
|||
</nowiki></pre> |
|||
Helper functions which handled mount points properly, etc, could be |
|||
added to a library to make this interface easier to use. |
|||
"Live queries" (another debt to BFS) could also be supported if the |
|||
getxattr calls above are allowed to block until additional matches are |
|||
found, and to return a special "no longer a match" result for XUIDs |
|||
modified since the start of a query. |
|||
If a minimal API is desired, efficient compound searches can be |
|||
performed entirely in userland given only a means for performing a |
|||
range query on a index directory. An API could be crafted |
|||
around this mechanism, instead of the direct statement query interface |
|||
shown above. All query processing could then happen in an application |
|||
library, rather than inside the filesystem. The trade-off would be that |
|||
live queries and atomicity guarantees would be much harder to guarantee. |
|||
==== Multilevel indexes ==== |
|||
The POSIX view of indices we've shown so far allows for simple |
|||
exact-match queries on a single term. We can extend the power of this |
|||
interface to allow exact match queries on multiple ANDed terms by |
|||
exporting a <code>.index/</code> directory in the result directory of every |
|||
index query: |
|||
$ ls /.index/size/ |
|||
123 4 56 592 |
|||
$ ls /.index/size/4/ |
|||
cafebabe deadbeef |
|||
$ ls /.index/size/4/.index/ctime/ |
|||
20080303T123456Z 20071225012345Z |
|||
Since "unique" index terms have files, not directories, as leaves, |
|||
unique indexes must appear only as the last term of a multilevel |
|||
query. This non-orthogonality is a slightly unfortunate, since it requires |
|||
a query tool using the POSIX API to handle a special case: if at any |
|||
point in traversing the query path you end up with a file instead of a |
|||
directory, there are either 0 or 1 results to the query. Further |
|||
checks should be done directly on the extended attributes of the file. |
|||
This special case doesn't seem to be too onerous; after all, |
|||
prioritizing unique indices is basic query optimization. |
|||
==== Special attributes ==== |
|||
Path-level attributes, set-valued attributes. |
|||
==== Inodes ==== |
|||
===Inherent metadata fields=== |
|||
These fields are automatically created for a files written with the |
|||
standard POSIX API. '''WRITEME''' |
|||
===Suggested application-level metadata fields=== |
|||
These fields are not created without explicit intervention, but are |
|||
used to support application-level functionality. |
|||
; markup |
; markup |
||
: This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content. |
: This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content. |
||
'''WRITEME''' |
|||
== Implementation == |
== Implementation == |
||
Line 273: | Line 376: | ||
== Open Questions == |
== Open Questions == |
||
* How should complex queries by exposed via the POSIX API? Only via |
* How should complex queries by exposed via the POSIX API? Only via ioctl, or by read/write to a special 'query' file, or via magic directories, or what? |
||
ioctl, or by read/write to a special 'query' file, or via magic |
|||
directories, or what? |
Revision as of 19:06, 13 March 2008
This document presents a design for a new filesystem to support:
- OLPC's journal and bulletin board UI abstractions (see also the second-generation journal design)
- Bitfrost P_SF_RUN and P_SF_CORE protections (current implementation)
In our first-generation software, the journal was based with the "datastore" and Bitfrost protections were intended to be provided with a copy-on-write filesystem, originally based on vserver. This new filesystem, which I am calling "olpcfs", attempts to unify these implementations based on a powerful distributed fully-persistent versioned filesystem abstraction.
Design Goals
In this section we enumerate a number of the design goals for the olpcfs filesystem, based on our experiences with the first-generation datastore.
POSIX semantics
Our first-generation datastore was a bespoke system primarily accessible through a Python API. The on-disk representation was opaque, and users frequently expressed frustration at their inability to make journal contents interact with non-XO systems. For example, simply locating a desired file in the datastore was difficult, as files were stored on disk named only by content hashes. In addition, making ordinary Linux applications work on the XO was complicated by the fact that saving and loading objects from the datastore required the use of our unique API.
A primary design goal for olpcfs is thus to have a sane and reasonable POSIX semantics, so that unmodified applications using the POSIX APIs (ie, linux, Windows, MacOS, BeOS,...) can interact sensibly with the journal. We do not need to export every functionality of olpcfs over POSIX APIs, but we strive to. (An ideological tip-of-the-hat here goes to Plan 9, which taught the virtues of uniform access methods.)
In particular, the journal should be able to direct existing applications to open a document from the datastore by passing in a POSIX pathname to the document contents, and applications should be able to save a file to an appropriate place (say, '~/Documents') and have the new content naturally show up in the Journal. These documents may not have the rich metadata associated with them which native XO applications can provide, but the basic journal features should "just work".
Content-Addressable
Cloud storage is emerging as an elegant method of providing globally-accessible document stores, which are key to providing the uniform location-independent model of sharing Sugar aims to provide. When documents are stored into a global pool, naming each uniquely and consistently becomes a concern. Content-addressable storage (CAS) provides a solution to these issues by using (a function of) the content itself as a name. There are many implementations of CAS floating around; we've been most influenced by the emerging XAM standard API, but Sun's Project Honeycomb, Apache's Jackrabbit and (again) Plan 9's Venti are all possible backing storage for a CAS-based olpcfs filesystem.
Continuous versioning
Many versioned filesystems (for example, Fossil, ext3cow, WAFL, and AFS) use snapshot-based versioning. Rather than store every version of every file, snapshots are taken at regular intervals, and only the versions present at the time of a snapshot are accessible.
This is not consistent with the user interaction model underlying the XO, which strives to allow *every* action to be revokable, whether or not a snapshot intervened, to support painless exploration and discovery. It should be impossible for a child to "break" his machine in a way which is not easy fixed — or which requires losing all work since the last snapshot in order to fix.
Further, snapshots are usually taken of a full filesystem. Instead, we want fine-grained browsing of revisions made to individual documents. Ultimately we would like to empower applications to be able to intelligently display and merge revisions; fine-grained storage of the modification chain is an important enabler of this capability.
Recent work has shown that continuous versioning can be implemented at reasonable cost. Apple's Time Machine, and the experimental systems CVFS, VersionFS, Wayback, and PersiFS have demonstrated that this goal is achievable. The on-disk costs of continuous versioning will be mitigated by aggressively pruning unneeded versions at application-level.
Fully-persistent versions
Many versioned filesystems are partially persistent; that is, they support modifications only to the "most recent" version of a file, although all versions are available for read access. Fully-persistent data structures also allow modification of any version. This is obviously necessary to enable distributed editing, since independent parties are not guaranteed to be able to synchronize their edits. The "P_SF_RUN" Bitfrost protection and OLPC update mechanism also require full persistence: new base system images are independent roots for modification, and a user can switch back and forth from one base system image to the other, potentially making (independent) changes to each.
As an aside, other desirable properties may include data structures which are functional or confluently persistent. Functional data structures are immutable after creation, which is a nice property for the CAS backend but not strictly necessary on local disk. Confluently persistent data structures allows a subtree from version A to be efficiently copied into version B. Confluent persistence could be desirable to allow changes to one base system image to be applied to another; but hyper-efficiency for this operation is probably unnecessary for our system. (An unpublished paper by Demaine, Langerman, and Price contains a good discussion of these varieties of persistence, with references and an several examples of functional and confluently-persistent data structures; when the paper is accepted hopefully I'll be able to cite it here.)
POSIX API
In this section we'll describe how files, metadata, and indices are exposed through the standard POSIX API. To some degree, this is independent of actual implementation: one can imagine several different on-disk representations that export the given API, with different costs and properties. We expect that the first implementations will use FUSE on top of an on-disk representation substantially similar to the API exported. An intermediate representation may build on a CAS backend to perform the actual storage; an advanced implementation may eventually achieve high performance via direct disk access. Regardless, the behavior described in this section constitutes the principal API used by legacy applications and the Journal to access content.
Extended Attributes
The basic primitive in olpcfs is a file, with a rich set of metadata. The metadata includes typical POSIX attributes such as creation time, modification time, size, type, and permission bits. As in the [http://en.wikipedia.org/wiki/Be_File_System Be File System], however, application-specific tagged data can also be added to a file. There is a reasonable (256-character UTF-8) limit on tag size, but no limit to the amount of data stored in a tag. We will discuss uses for tags in the Suggested metadata fields section below. Tagged metadata are indexed, as in BFS and XAM. For uniformity of access, the actual contents of a file are stored as the value of a 'content' tag. (This notion of arbitrary-sized metadata could be considered similar to the notion of alternate data streams in NTFS or resource forks in HFS. Wikipedia contains a lengthy discussion.)
We export our metadata in two different formats. The first is for human browsing and existing filesystem exploration tools. We will present it by example. Let us say that there exists a file 'foo' with content 'bar\n' in the current directory, and that there is an extended attribute 'mime-type' on 'foo' with content 'text/plain'. Then:
$ ls foo $ cat foo bar $ ls foo/ ctime mtime size content mime-type xuid parent $ cat foo/content bar $ cat foo/mime-type text/plain $ cat foo/size 4 $ cat foo/xuid deadbeef
That is, a 'file' is treated as just a special type of directory, with internal nodes representing the various tagged metadata. For compatibility with legacy software, opening the 'file' without a trailing slash on the name returns you the value of its 'content' tag. The mode bits for 'foo' report it as a standard file, so the only real 'hack' here is the ability to cd into a 'file' and list its attributes. The uniform treatment of "alternate data streams" as files means that one could write a utility to combine markup and text (see below) and invoke it as:
$ ./combine-markup foo/content foo/markup > marked-up.html $ echo 'text/html' > marked-up.html/mime-type
The metadata attributes are also exported as standard POSIX xattrs, which allows efficient backup/recovery with the standard zip, tar, and rsync tools (which support xattrs). Unfortunately, POSIX xattr support is not well integrated into the standard Unix toolkit. There are no standard tools (other than archiving tools) for manipulating xattrs, although there is a partial standard defining naming conventions and common tags for better interoperability. Further, POSIX does not export a read/write API for metadata, which makes it impossible to (say) use the 'grep' tool over metadata content. Our 'file as directory' hack addresses these infelicities.
Indexing
At the root, an olpcfs filesystem exports indexes of metadata, which appear as simple directories:
$ ls /.index/size/ 123 4 56 592 $ ls /.index/size/4/ cafebabe deadbeef $ ls /.index/size/4/cafebabe ctime mtime size content mime-type xuid parent $ cat /.index/size/4/cafebabe/content bar
Three basic issues arise with this fundamental idea:
- How are duplicate keys handled?
- How are "strange" attribute values handled?
We address the presence of duplicate keys by having the leaf of each index be a directory. The contents of that directory are named by xuid and enumerate the different files on the system which match the query. When an index is created, an attribute can be tagged as unique, in which case an attempt to create a duplicate value will return EEXIST. The leaf nodes for unique indexes are files, rather than (1-entry) directories. Combined with path-valued attributes, the standard directory view of a filesystem is (more or less) just the index on the "path" attribute.
The second issue is exemplified by typical values of the "mime-type" attribute: "text/plain", "text/html", etc. Since leaf nodes of the index are named by the attribute value, how do we handle attributes with illegal filename characters (like '/')? And what would we do if the user attempted to index the "content" attribute, which may contain gigabytes of data for certain files?
One solution would be to display files with 'ls' that one couldn't actually access with the POSIX API, or to hide "bad" entries, or to only export indices that didn't exhibit these problems. These solutions result in irregular indices which aren't actually usable for anything except browsing; any "real" queries would have to be performed with an alternate API.
We have chosen instead to make the index as close to fully-functional as possible, even if this means that manual browsing is a little more difficult. We URL-encode problematic characters in attribute values, and truncate them to the first 128 bytes. This results in a consistent programmatic API, even if it makes directory listings a little uglier. A command-line tool to URI-encode an input string allows shell tools to directly access the index, even if the desired key is "funny". Only searches through large attributes require use of a non-POSIX query API.
Query API
A full query API would allow arbitrary boolean combinations of key terms, as well as range and subset queries. This document won't attempt to nail this interface down precisely, since the goal is a robustly functional POSIX API, but a BFS or SQLish query delivered by ioctl() or getxattr/setxattr call is a reasonable basis for discussion. (A FUSE implementation of olpcfs wouldn't permit an ioctl-based API.) For example:
char querytag[TAGLEN]; getxattr("/.query", "SELECT .xuid WHERE mime-type = 'text/html' OR mime-type = 'text/plain'", querytag, TAGLEN); while (1) { char xuid[XUIDLEN], pathbuf[PATHLEN]; struct stat st; if (!getxattr("/.query-result", querytag, xuid, XUIDLEN)) break; /* no more results */ /* do something with the file, for example: */ snprintf(pathbuf, PATHLEN, "/.index/xuid/%.*s", XUIDLEN, xuid); stat(pathbuf, &st); }
Helper functions which handled mount points properly, etc, could be added to a library to make this interface easier to use. "Live queries" (another debt to BFS) could also be supported if the getxattr calls above are allowed to block until additional matches are found, and to return a special "no longer a match" result for XUIDs modified since the start of a query.
If a minimal API is desired, efficient compound searches can be performed entirely in userland given only a means for performing a range query on a index directory. An API could be crafted around this mechanism, instead of the direct statement query interface shown above. All query processing could then happen in an application library, rather than inside the filesystem. The trade-off would be that live queries and atomicity guarantees would be much harder to guarantee.
Multilevel indexes
The POSIX view of indices we've shown so far allows for simple
exact-match queries on a single term. We can extend the power of this
interface to allow exact match queries on multiple ANDed terms by
exporting a .index/
directory in the result directory of every
index query:
$ ls /.index/size/ 123 4 56 592 $ ls /.index/size/4/ cafebabe deadbeef $ ls /.index/size/4/.index/ctime/ 20080303T123456Z 20071225012345Z
Since "unique" index terms have files, not directories, as leaves, unique indexes must appear only as the last term of a multilevel query. This non-orthogonality is a slightly unfortunate, since it requires a query tool using the POSIX API to handle a special case: if at any point in traversing the query path you end up with a file instead of a directory, there are either 0 or 1 results to the query. Further checks should be done directly on the extended attributes of the file. This special case doesn't seem to be too onerous; after all, prioritizing unique indices is basic query optimization.
Special attributes
Path-level attributes, set-valued attributes.
Inodes
Inherent metadata fields
These fields are automatically created for a files written with the standard POSIX API. WRITEME
Suggested application-level metadata fields
These fields are not created without explicit intervention, but are used to support application-level functionality.
- markup
- This idea is borrowed from BeOS: some formatted text is stored as plain/text in the 'content' of a file. Markup information (ie, lists of spanned characters for HTML markup tags) is stored in the 'markup' stream of the file. This allows easy editing and searching of formatted text, and allows formatted text to be used directly as source code: the compiler or python runtime only sees the plain text content.
WRITEME
Implementation
Indexing
Absent content
Merging remote stores
Historical note
The Be File System also used a directory structure to list extended attributes, although the directory inode number was hidden from the user. It also introduced the idea that metadata indices and directories were both just B+Trees on disk, and could use the same representation. As with the attribute "directory", BFS allocated an "index directory" inode, with the same structure as other directories, but hid this inode from the user. In the spirit of Plan 9 ("everything is a file") we've chosen to expose these directories to the user for direct access and traversal. The idea of direct read/write access to attributes as streams comes from the XAM specification (which no doubt inherited it from Centera or another other of the pre-existing CAS systems which influenced XAM's design).
BFS had difficulties with duplicate keys in their B+Trees, which were worked-around with various ad-hoc measures. In our system, we can use the XUID as a suffix to any key type to break ties, avoiding the problem completely.
The Linux NTFS driver exports "alternate data streams" using a trailing colon syntax, for example: "filename:stream". A special tool is necessary to list the streams available for a given file.
The ext3cow filesystem exports versions using a trailing at-sign syntax. The version of a file or directory available at 2008-03-12 16:47:34 UTC (1,205,340,454 seconds since the epoch) can be accessed with "filename@1205340454".
Open Questions
- How should complex queries by exposed via the POSIX API? Only via ioctl, or by read/write to a special 'query' file, or via magic directories, or what?