Experiments with unordered paths
The Journal -- and many "Web 2.0" applications -- are built around the idea of tag search. In discussions about extending the Journal to more traditional file management tasks -- how should mounted USB keys appear in the Journal? how should the Journal appear if mounted as a filesystem -- I have always taken as an article of faith that "ordered tags" would be necessary to translate the directory tree metaphor into tag search. In filesystems, a/b is not the same file as b/a; in tag sets "a b" is exactly the same search as "b a".
I was challenged by Eben and Eduardo, among others, who were unconvinced by my intuition that ordering was important in filesystem paths. Their intuition told them that additional context was all that was necessary -- additional tags in the search. Sure
Bach/Disc1 was a different directory from
Beethoven/Disc1, but it was the "Bach" and "Beethoven" tags which were important, not the ordering.
Disc1/Bach might be the same thing, and that's okay.
I decided to actually do the experiment. I wrote a short script which went through all the files on my laptop -- crammed to the brim with stuff from the past decade, legacy code, various organizational strategies -- and try to prove that path component ordering was important. Surely this search would come up with some compelling examples of different directories that were identical if you ignored the order of the path components.
My first search found no ambiguities. My mind exploded.
Later, I found a bug in my script. Now I could find a handful of existing directories that were made ambiguous by ignoring the path ordering, but nothing compelling. Only 21 such directories in among the 900,000 files present in my home directory! It turns out that repeated components are important -- x/y/x is different than x/y -- but not ordering.
Further more, only about 3 unique tags were necessary to reach any directory in my home. Instead of:
$ cd ~/Projects/OLPC/git/sugar-toolkit/sugar/graphics
I ought to be able to use the tags "OLPC graphics" instead -- much shorter!
On this page I will collect some of my further experiments with "unordered paths", attempting to get some experience using a system structured in this fashion to inform the redesign of the Journal for 9.1.0.
- A "cd" replacement that uses tags instead of paths, implements intelligent tab-completion, and offers suggestions for how to reach places faster in the future. (draft source code)
- Links to Eduardo's walkthrough of the "dynamic tag" system in Epiphany, and how that might inform the next-gen Journal
- Implementing fast tag search and completion
- What this might look like as a filesystem
- Security considerations in an world with unordered paths (User:Mstone ought to help here!)
- Statistics and experience reports!
- Tagged Journal Proposal, based on this work
- Earlier Ephiphany discussion (thanks, Eduardo!)
- Security implications of search
- Building a vector space search engine in perl
- Files Apache Lucene's index formats
- Roll your own search engine (in perl)
- Organize framework
- VTagFS: a FUSE filesystem which is similar in concept
Other journal-like interfaces
- Tracker (wikipedia article)
- GNOME Storage -- ambitious, and dead
- Beagle (oink)
- Doodle -- suffix tree based
- Gnome Medusa (dead? can't find sources) suffix tree based?
- rlocate - an implementation of the "locate" command that is always up-to-date
- Swish++ / glimpse
- Shared file metadata spec
My own views: Pinot seems like the most promising Xapian-based search backend, but information is thin. It might require modification to expose the Xapian chronological-ordering mechanism. Good internationalization, active development, fedora-native. Nice RSS/Sherlock integration which goes well with Eben's long term ideas there.
Tracker seems the most mature of the mainstream solutions, but I'm not clear that it efficiently supports the chronological support we need, and since it is based on it's own sqlite database, it doesn't have native support for query completion or relevance ordering of tag suggestions. It's nice that Recoll is based on Xapian, which has many of the features we need (and many we don't: relevance sorting is only needed for query suggestions, not for the basic search), but the codebase seems too tightly tied to its GUI (and C++/Qt). Strigi has very nice bundle-search support, which could be used to implement zip files as the canonical container format in Sugar, but the realtime indexing support is "experimental", seems to support very few file types, has a KDE/Gnome impedance mismatch. Uses CLucene, which is promising.