Experiments with unordered paths: Difference between revisions

Latest revision as of 23:08, 15 December 2008

Introduction

The Journal -- and many "Web 2.0" applications -- are built around the idea of tag search. In discussions about extending the Journal to more traditional file management tasks -- how should mounted USB keys appear in the Journal? how should the Journal appear if mounted as a filesystem -- I have always taken as an article of faith that "ordered tags" would be necessary to translate the directory tree metaphor into tag search. In filesystems, a/b is not the same file as b/a; in tag sets "a b" is exactly the same search as "b a".

I was challenged by Eben and Eduardo, among others, who were unconvinced by my intuition that ordering was important in filesystem paths. Their intuition told them that additional context was all that was necessary -- additional tags in the search. Sure Bach/Disc1 was a different directory from Beethoven/Disc1, but it was the "Bach" and "Beethoven" tags which were important, not the ordering. Bach/Disc1 and Disc1/Bach might be the same thing, and that's okay.

I decided to actually do the experiment. I wrote a short script which went through all the files on my laptop -- crammed to the brim with stuff from the past decade, legacy code, various organizational strategies -- and try to prove that path component ordering was important. Surely this search would come up with some compelling examples of different directories that were identical if you ignored the order of the path components.

My first search found no ambiguities. My mind exploded.

...

Later, I found a bug in my script. Now I could find a handful of existing directories that were made ambiguous by ignoring the path ordering, but nothing compelling. Only 21 such directories in among the 900,000 files present in my home directory! It turns out that repeated components are important -- x/y/x is different than x/y -- but not ordering.

Further more, only about 3 unique tags were necessary to reach any directory in my home. Instead of:

$ cd ~/Projects/OLPC/git/sugar-toolkit/sugar/graphics

I ought to be able to use the tags "OLPC graphics" instead -- much shorter!

On this page I will collect some of my further experiments with "unordered paths", attempting to get some experience using a system structured in this fashion to inform the redesign of the Journal for 9.1.0.

Talk

User:CScott gave a talk based on these ideas on 2008-10-15 at 1cc. See Journal, reloaded for links to the talk and other materials.

To come:

A "cd" replacement that uses tags instead of paths, implements intelligent tab-completion, and offers suggestions for how to reach places faster in the future. (draft source code)
Links to Eduardo's walkthrough of the "dynamic tag" system in Epiphany, and how that might inform the next-gen Journal
Implementing fast tag search and completion
What this might look like as a filesystem
Security considerations in an world with unordered paths (User:Mstone ought to help here!)
Statistics and experience reports!

Other journal-like interfaces

Desktop search

Pinot
Tracker (wikipedia article)
Strigi
Recoll
GNOME Storage -- ambitious, and dead
Beagle (oink)
Doodle -- suffix tree based
Gnome Medusa (dead? can't find sources) suffix tree based?
Namazu
Teardrop
rlocate - an implementation of the "locate" command that is always up-to-date
Swish++ / glimpse
Olpcfs
Shared file metadata spec

Comparisons

My own views: Pinot seems like the most promising Xapian-based search backend, but information is thin. It might require modification to expose the Xapian chronological-ordering mechanism. Good internationalization, active development, fedora-native. Nice RSS/Sherlock integration which goes well with Eben's long term ideas there.

Tracker seems the most mature of the mainstream solutions, but I'm not clear that it efficiently supports the chronological support we need, and since it is based on it's own sqlite database, it doesn't have native support for query completion or relevance ordering of tag suggestions. It's nice that Recoll is based on Xapian, which has many of the features we need (and many we don't: relevance sorting is only needed for query suggestions, not for the basic search), but the codebase seems too tightly tied to its GUI (and C++/Qt). Strigi has very nice bundle-search support, which could be used to implement zip files as the canonical container format in Sugar, but the realtime indexing support is "experimental", seems to support very few file types, has a KDE/Gnome impedance mismatch. Uses CLucene, which is promising.

@@ Line 1: / Line 1: @@
+== Introduction ==
 The [[Journal]] -- and many "[http://en.wikipedia.org/wiki/Web_2.0 Web 2.0]" applications -- are built around the idea of tag search.  In discussions about extending the Journal to more traditional file management tasks -- how should mounted USB keys appear in the Journal?  how should the Journal appear if mounted as a filesystem -- [[User:CScott|I]] have always taken as an article of faith that "ordered tags" would be necessary to translate the directory tree metaphor into tag search.  In filesystems, a/b is not the same file as b/a; in tag sets "a b" is exactly the same search as "b a".
@@ Line 16: / Line 18: @@
 On this page I will collect some of my further experiments with "unordered paths", attempting to get some experience using a system structured in this fashion to inform the redesign of the Journal for [[9.1.0]].
+== Talk ==
+[[User:CScott]] gave a talk based on these ideas on 2008-10-15 at 1cc.  See [[Journal, reloaded]] for links to the talk and other materials.
 ==To come: ==
 * A "cd" replacement that uses tags instead of paths, implements intelligent tab-completion, and offers suggestions for how to reach places faster in the future. ([http://dev.laptop.org/git/users/cscott/tagcd draft source code])
 * Links to [[User:HoboPrimate|Eduardo]]'s walkthrough of the "dynamic tag" system in Epiphany, and how that might inform the next-gen Journal
@@ Line 32: / Line 38: @@
 * [http://lucene.apache.org/java/2_3_2/fileformats.html#Per-Index Files Apache Lucene's index formats]
 * [http://web.archive.org/web/20010711201252/hotwired.lycos.com/webmonkey/templates/print_template.htmlt?meta=/webmonkey/97/16/index2a_meta.html Roll your own search engine (in perl)]
+* [http://www.organise-fw.org/ Organize framework]
+* [http://seb.dbzteam.org/swp/pages/vtagfs.html VTagFS]: a FUSE filesystem which is similar in concept
 === Other journal-like interfaces ===
 * [http://www.iola.dk/nemo/ Nemo]
@@ Line 37: / Line 46: @@
 === Desktop search ===
+* [http://pinot.berlios.de/ Pinot]
+* [http://www.gnome.org/projects/tracker/faq.html Tracker] ([http://en.wikipedia.org/wiki/Tracker_(desktop_search_software) wikipedia article])
 * [http://strigi.sourceforge.net/ Strigi]
-* [http://en.wikipedia.org/wiki/Tracker_(desktop_search_software) Tracker]
-* [http://www.freedesktop.org/wiki/Specifications/shared-filemetadata-spec Shared file metadata spec]
 * [http://www.lesbonscomptes.com/recoll/usermanual/index.html Recoll]
 * [http://www.gnome.org/~seth/storage/ GNOME Storage] -- ambitious, and [http://en.wikipedia.org/wiki/GNOME_Storage dead]
 * [http://beagle-project.org/About Beagle] (oink)
-* [http://www.gnome.org/projects/tracker/faq.html Tracker]
+* [http://gnunet.org/doodle/ Doodle] -- suffix tree based
+* Gnome Medusa (dead?  can't find sources) suffix tree based?
+* [http://www.namazu.org/ Namazu]
+* [http://www.teardrop.fr/ Teardrop]
+* [http://sourceforge.net/projects/rlocate/ rlocate] - an implementation of the "locate" command that is always up-to-date
+* [http://swishplusplus.sourceforge.net/ Swish++] / [http://webglimpse.net/ glimpse]
 * [[Olpcfs]]
+* [http://www.freedesktop.org/wiki/Specifications/shared-filemetadata-spec Shared file metadata spec]
 ==== Comparisons ====
 * [http://www.wikinfo.org/index.php/Comparison_of_desktop_search_software Comparison of desktop search software]
 * [http://mail.gnome.org/archives/tracker-list/2007-January/msg00171.html Additional comparison]
 * [http://www.freesoftwaremagazine.com/columns/desktop_search_tools_gnu_linux_tracker_recoll_strigi_deskbar Desktop search wars]
+My own views: [http://pinot.berlios.de/ Pinot] seems like the most promising Xapian-based search backend, but information is thin.  It might require modification to expose the Xapian chronological-ordering mechanism.  Good internationalization, [http://www.ohloh.net/projects/3161/analyses/latest active development], fedora-native.  Nice RSS/Sherlock integration which goes well with Eben's long term ideas there.
+Tracker seems the most mature of the mainstream solutions, but I'm not clear that it efficiently supports the chronological support we need, and since it is based on it's own sqlite database, it doesn't have native support for query completion or relevance ordering of tag suggestions.  It's nice that Recoll is based on Xapian, which has many of the features we need (and many we don't: relevance sorting is only needed for query suggestions, not for the basic search), but the codebase seems too tightly tied to its GUI (and C++/Qt).  Strigi has very nice bundle-search support, which could be used to implement zip files as the canonical container format in Sugar, but the realtime indexing support is "experimental", seems to support very few file types, has a KDE/Gnome impedance mismatch.  Uses CLucene, which is promising.
 === Object-Relational Mappers ===
 * [http://www.sqlalchemy.org/ SQL Alchemy]
 * [http://www.sqlobject.org/ SQL Object]
-* [http://jystewart.net/process/2008/02/using-the-django-orm-as-a-standalone-component/ Using Django]
+* [http://jystewart.net/process/2008/02/using-the-django-orm-as-a-standalone-component/ Using Django] ([http://docs.djangoproject.com/en/dev/topics/db/queries/ more], [http://www.mercurytide.co.uk/whitepapers/django-full-text-search/ full-text search], [http://blog.capstrat.com/tags/standalone/ standalone])
 === Database tools ===
-* [] [http://xappy.org/docs/0.5/introduction.html Xappy]
+* [http://www.xapian.org/docs/ Xapian] [http://xappy.org/docs/0.5/introduction.html Xappy]
+** [http://www.xapian.org/docs/intro_ir.html Probabistic Information Retrieval background]
+** [http://www.xapian.org/docs/spelling.html spelling correction] and [http://www.xapian.org/docs/stemming.html stemming] support, and [http://www.xapian.org/docs/termgenerator.html details on how stemming and word-splitting are applied]
+** [http://www.xapian.org/docs/sorting.html sorting by date]
+** [http://www.xapian.org/docs/overview.html query overview]
+* [http://lucene.apache.org/ Lucene]

Experiments with unordered paths: Difference between revisions

Latest revision as of 23:08, 15 December 2008

Contents

Introduction

Talk

To come:

Random links

Other journal-like interfaces

Desktop search

Comparisons

Object-Relational Mappers

Database tools

Navigation menu

Experiments with unordered paths: Difference between revisions

Latest revision as of 23:08, 15 December 2008

Introduction

Talk

To come:

Random links

Other journal-like interfaces

Desktop search

Comparisons

Object-Relational Mappers

Database tools

Navigation menu

Search