Experiments with unordered paths: Difference between revisions
(minor edits) |
|||
(12 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
== Introduction == |
|||
The [[Journal]] -- and many "[http://en.wikipedia.org/wiki/Web_2.0 Web 2.0]" applications -- are built around the idea of tag search. In discussions about extending the Journal to more traditional file management tasks -- how should mounted USB keys appear in the Journal? how should the Journal appear if mounted as a filesystem -- [[User:CScott|I]] have always taken as an article of faith that "ordered tags" would be necessary to translate the directory tree metaphor into tag search. In filesystems, a/b is not the same file as b/a; in tag sets "a b" is exactly the same search as "b a". |
The [[Journal]] -- and many "[http://en.wikipedia.org/wiki/Web_2.0 Web 2.0]" applications -- are built around the idea of tag search. In discussions about extending the Journal to more traditional file management tasks -- how should mounted USB keys appear in the Journal? how should the Journal appear if mounted as a filesystem -- [[User:CScott|I]] have always taken as an article of faith that "ordered tags" would be necessary to translate the directory tree metaphor into tag search. In filesystems, a/b is not the same file as b/a; in tag sets "a b" is exactly the same search as "b a". |
||
Line 16: | Line 18: | ||
On this page I will collect some of my further experiments with "unordered paths", attempting to get some experience using a system structured in this fashion to inform the redesign of the Journal for [[9.1.0]]. |
On this page I will collect some of my further experiments with "unordered paths", attempting to get some experience using a system structured in this fashion to inform the redesign of the Journal for [[9.1.0]]. |
||
== Talk == |
|||
[[User:CScott]] gave a talk based on these ideas on 2008-10-15 at 1cc. See [[Journal, reloaded]] for links to the talk and other materials. |
|||
==To come: == |
==To come: == |
||
* A "cd" replacement that uses tags instead of paths, implements intelligent tab-completion, and offers suggestions for how to reach places faster in the future. |
* A "cd" replacement that uses tags instead of paths, implements intelligent tab-completion, and offers suggestions for how to reach places faster in the future. ([http://dev.laptop.org/git/users/cscott/tagcd draft source code]) |
||
* Links to [[User:HoboPrimate|Eduardo]]'s walkthrough of the "dynamic tag" system in Epiphany, and how that might inform the next-gen Journal |
* Links to [[User:HoboPrimate|Eduardo]]'s walkthrough of the "dynamic tag" system in Epiphany, and how that might inform the next-gen Journal |
||
* Implementing fast tag search and completion |
* Implementing fast tag search and completion |
||
Line 25: | Line 31: | ||
* Statistics and experience reports! |
* Statistics and experience reports! |
||
== Random |
== Random links == |
||
* [http://lists.laptop.org/pipermail/sugar/2008-September/008599.html Tagged Journal Proposal], based on this work |
|||
* [http://lists.laptop.org/pipermail/sugar/2008-September/008432.html Earlier Ephiphany discussion] (thanks, Eduardo!) |
|||
* [http://plg.uwaterloo.ca/~claclark/fast2005.pdf Security implications of search] |
* [http://plg.uwaterloo.ca/~claclark/fast2005.pdf Security implications of search] |
||
* [http://www.perl.com/pub/a/2003/02/19/engine.html?page=2 Building a vector space search engine in perl] |
* [http://www.perl.com/pub/a/2003/02/19/engine.html?page=2 Building a vector space search engine in perl] |
||
* [http://lucene.apache.org/java/2_3_2/fileformats.html#Per-Index Files Apache Lucene's index formats] |
* [http://lucene.apache.org/java/2_3_2/fileformats.html#Per-Index Files Apache Lucene's index formats] |
||
* [http://web.archive.org/web/20010711201252/hotwired.lycos.com/webmonkey/templates/print_template.htmlt?meta=/webmonkey/97/16/index2a_meta.html Roll your own search engine (in perl)] |
* [http://web.archive.org/web/20010711201252/hotwired.lycos.com/webmonkey/templates/print_template.htmlt?meta=/webmonkey/97/16/index2a_meta.html Roll your own search engine (in perl)] |
||
* [http://www.organise-fw.org/ Organize framework] |
|||
* [http://seb.dbzteam.org/swp/pages/vtagfs.html VTagFS]: a FUSE filesystem which is similar in concept |
|||
=== Other journal-like interfaces === |
|||
* [http://www.iola.dk/nemo/ Nemo] |
|||
* [http://live.gnome.org/PaperBox Paperbox] |
|||
=== Desktop search === |
|||
* [http://pinot.berlios.de/ Pinot] |
|||
* [http://www.gnome.org/projects/tracker/faq.html Tracker] ([http://en.wikipedia.org/wiki/Tracker_(desktop_search_software) wikipedia article]) |
|||
* [http://strigi.sourceforge.net/ Strigi] |
|||
* [http://www.lesbonscomptes.com/recoll/usermanual/index.html Recoll] |
|||
* [http://www.gnome.org/~seth/storage/ GNOME Storage] -- ambitious, and [http://en.wikipedia.org/wiki/GNOME_Storage dead] |
|||
* [http://beagle-project.org/About Beagle] (oink) |
|||
* [http://gnunet.org/doodle/ Doodle] -- suffix tree based |
|||
* Gnome Medusa (dead? can't find sources) suffix tree based? |
|||
* [http://www.namazu.org/ Namazu] |
|||
* [http://www.teardrop.fr/ Teardrop] |
|||
* [http://sourceforge.net/projects/rlocate/ rlocate] - an implementation of the "locate" command that is always up-to-date |
|||
* [http://swishplusplus.sourceforge.net/ Swish++] / [http://webglimpse.net/ glimpse] |
|||
* [[Olpcfs]] |
|||
* [http://www.freedesktop.org/wiki/Specifications/shared-filemetadata-spec Shared file metadata spec] |
|||
==== Comparisons ==== |
|||
* [http://www.wikinfo.org/index.php/Comparison_of_desktop_search_software Comparison of desktop search software] |
|||
* [http://mail.gnome.org/archives/tracker-list/2007-January/msg00171.html Additional comparison] |
|||
* [http://www.freesoftwaremagazine.com/columns/desktop_search_tools_gnu_linux_tracker_recoll_strigi_deskbar Desktop search wars] |
|||
My own views: [http://pinot.berlios.de/ Pinot] seems like the most promising Xapian-based search backend, but information is thin. It might require modification to expose the Xapian chronological-ordering mechanism. Good internationalization, [http://www.ohloh.net/projects/3161/analyses/latest active development], fedora-native. Nice RSS/Sherlock integration which goes well with Eben's long term ideas there. |
|||
Tracker seems the most mature of the mainstream solutions, but I'm not clear that it efficiently supports the chronological support we need, and since it is based on it's own sqlite database, it doesn't have native support for query completion or relevance ordering of tag suggestions. It's nice that Recoll is based on Xapian, which has many of the features we need (and many we don't: relevance sorting is only needed for query suggestions, not for the basic search), but the codebase seems too tightly tied to its GUI (and C++/Qt). Strigi has very nice bundle-search support, which could be used to implement zip files as the canonical container format in Sugar, but the realtime indexing support is "experimental", seems to support very few file types, has a KDE/Gnome impedance mismatch. Uses CLucene, which is promising. |
|||
=== Object-Relational Mappers === |
|||
* [http://www.sqlalchemy.org/ SQL Alchemy] |
|||
* [http://www.sqlobject.org/ SQL Object] |
|||
* [http://jystewart.net/process/2008/02/using-the-django-orm-as-a-standalone-component/ Using Django] ([http://docs.djangoproject.com/en/dev/topics/db/queries/ more], [http://www.mercurytide.co.uk/whitepapers/django-full-text-search/ full-text search], [http://blog.capstrat.com/tags/standalone/ standalone]) |
|||
=== Database tools === |
|||
* [http://www.xapian.org/docs/ Xapian] [http://xappy.org/docs/0.5/introduction.html Xappy] |
|||
** [http://www.xapian.org/docs/intro_ir.html Probabistic Information Retrieval background] |
|||
** [http://www.xapian.org/docs/spelling.html spelling correction] and [http://www.xapian.org/docs/stemming.html stemming] support, and [http://www.xapian.org/docs/termgenerator.html details on how stemming and word-splitting are applied] |
|||
** [http://www.xapian.org/docs/sorting.html sorting by date] |
|||
** [http://www.xapian.org/docs/overview.html query overview] |
|||
* [http://lucene.apache.org/ Lucene] |
Latest revision as of 23:08, 15 December 2008
Introduction
The Journal -- and many "Web 2.0" applications -- are built around the idea of tag search. In discussions about extending the Journal to more traditional file management tasks -- how should mounted USB keys appear in the Journal? how should the Journal appear if mounted as a filesystem -- I have always taken as an article of faith that "ordered tags" would be necessary to translate the directory tree metaphor into tag search. In filesystems, a/b is not the same file as b/a; in tag sets "a b" is exactly the same search as "b a".
I was challenged by Eben and Eduardo, among others, who were unconvinced by my intuition that ordering was important in filesystem paths. Their intuition told them that additional context was all that was necessary -- additional tags in the search. Sure Bach/Disc1
was a different directory from Beethoven/Disc1
, but it was the "Bach" and "Beethoven" tags which were important, not the ordering. Bach/Disc1
and Disc1/Bach
might be the same thing, and that's okay.
I decided to actually do the experiment. I wrote a short script which went through all the files on my laptop -- crammed to the brim with stuff from the past decade, legacy code, various organizational strategies -- and try to prove that path component ordering was important. Surely this search would come up with some compelling examples of different directories that were identical if you ignored the order of the path components.
My first search found no ambiguities. My mind exploded.
...
Later, I found a bug in my script. Now I could find a handful of existing directories that were made ambiguous by ignoring the path ordering, but nothing compelling. Only 21 such directories in among the 900,000 files present in my home directory! It turns out that repeated components are important -- x/y/x is different than x/y -- but not ordering.
Further more, only about 3 unique tags were necessary to reach any directory in my home. Instead of:
$ cd ~/Projects/OLPC/git/sugar-toolkit/sugar/graphics
I ought to be able to use the tags "OLPC graphics" instead -- much shorter!
On this page I will collect some of my further experiments with "unordered paths", attempting to get some experience using a system structured in this fashion to inform the redesign of the Journal for 9.1.0.
Talk
User:CScott gave a talk based on these ideas on 2008-10-15 at 1cc. See Journal, reloaded for links to the talk and other materials.
To come:
- A "cd" replacement that uses tags instead of paths, implements intelligent tab-completion, and offers suggestions for how to reach places faster in the future. (draft source code)
- Links to Eduardo's walkthrough of the "dynamic tag" system in Epiphany, and how that might inform the next-gen Journal
- Implementing fast tag search and completion
- What this might look like as a filesystem
- Security considerations in an world with unordered paths (User:Mstone ought to help here!)
- Statistics and experience reports!
Random links
- Tagged Journal Proposal, based on this work
- Earlier Ephiphany discussion (thanks, Eduardo!)
- Security implications of search
- Building a vector space search engine in perl
- Files Apache Lucene's index formats
- Roll your own search engine (in perl)
- Organize framework
- VTagFS: a FUSE filesystem which is similar in concept
Other journal-like interfaces
Desktop search
- Pinot
- Tracker (wikipedia article)
- Strigi
- Recoll
- GNOME Storage -- ambitious, and dead
- Beagle (oink)
- Doodle -- suffix tree based
- Gnome Medusa (dead? can't find sources) suffix tree based?
- Namazu
- Teardrop
- rlocate - an implementation of the "locate" command that is always up-to-date
- Swish++ / glimpse
- Olpcfs
- Shared file metadata spec
Comparisons
My own views: Pinot seems like the most promising Xapian-based search backend, but information is thin. It might require modification to expose the Xapian chronological-ordering mechanism. Good internationalization, active development, fedora-native. Nice RSS/Sherlock integration which goes well with Eben's long term ideas there.
Tracker seems the most mature of the mainstream solutions, but I'm not clear that it efficiently supports the chronological support we need, and since it is based on it's own sqlite database, it doesn't have native support for query completion or relevance ordering of tag suggestions. It's nice that Recoll is based on Xapian, which has many of the features we need (and many we don't: relevance sorting is only needed for query suggestions, not for the basic search), but the codebase seems too tightly tied to its GUI (and C++/Qt). Strigi has very nice bundle-search support, which could be used to implement zip files as the canonical container format in Sugar, but the realtime indexing support is "experimental", seems to support very few file types, has a KDE/Gnome impedance mismatch. Uses CLucene, which is promising.