WikiBrowse: Difference between revisions

From OLPC
Jump to navigation Jump to search
(update)
(update)
Line 18: Line 18:
* Redlink/greenlink/bluelink rendering
* Redlink/greenlink/bluelink rendering
* Image thumbnail retrieval
* Image thumbnail retrieval
* Automated subselection (currently: via a minimum # of inbound links)
* Automated subselection (currently: via a minimum # of inbound links and page view data)


A Ruby web server is working on a local machine (temporary: http://pullcord.laptop.org:9090/) serving up the whole Spanish wikipedia text (the compressed. bz2 file that pages are served out of is 400M, and the index is 10M); a subset of 35,000 articles can be stored in 80MB on an XO. "green" links can indicate articles which exist (on a local server, or on the internet) but not in the local dump, while red and blue links continue to indicate nonexistent and local existing pages. (greenlinks have yet to be implemented - --[[User:Sj|Sj]]&nbsp;[[User talk:Sj|<font style="color:#f70; font-size:70%">talk</font>]] 23:01, 7 May 2008 (EDT))
Link coloring is present -- green links indicate articles which exist (on a local server, or on the internet) but not in the local dump, while blue links indicate locally-existing pages.


The mediawiki data dump is stored as a [http://en.wikipedia.org/wiki/Bz2 .bz2 file], which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that
The mediawiki data dump is stored as a [http://en.wikipedia.org/wiki/Bz2 .bz2 file], which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that
* (1) it's really fast, since you're working with tiny compressed bundles and
* (1) it's really fast, since you're working with tiny compressed bundles and
* (2) it's really small, since you're only decompressing one tiny bundle at a time.
* (2) it's really small, since you're only decompressing one tiny bundle at a time.

=== What we're doing ===

We are working to make this into an XO activity.

The first step was to port the code from Ruby with Mongrel to Python with BaseHTTPServer so it'll run natively on the XO - details below in [[#Help wanted]]. After that, we'd love to have a few more things done, like...

* Wrapping this as a Sugar activity (code)
* Some article selection. Since it serves files out of the .xml.bz2, we can accomplish this by choosing what goes into the .xml.bz2 (perhaps there are already tools for doing this? I don't know much about it) as long as we deal with the link-breaking we do as a result. (content, curation)
* Add a subset of images. (curation)
* Finding some way to handle images - the current code only works with text, and image links are broken. (code)
* Removing the wikitext parser from the server and rewriting it as an independent plugin/middleware/etc architecture so that other wiki syntaxes can be supported. Javascript, slimming down the current Mediawiki php parser, and Python middleware are all options. The current solution is a very simple/incomplete parser within the server code itself. (code)


=== Download ===
=== Download ===
Line 51: Line 39:
* [[User:Sj|Sj]] - working on the featureset & tests
* [[User:Sj|Sj]] - working on the featureset & tests
* madprime - working on a Spanish portal page
* madprime - working on a Spanish portal page



=== Previous tasks ===
=== Previous tasks ===

Revision as of 21:43, 25 May 2008

For the old uncompressed activity, see WikiBrowse (old).
Olpc logo.jpg
WikiBrowse : a self-contained wiki server

A wiki webserver activity is being developed, as a self-contained browsable offline wikireader. Other efforts to generate and browse slices of online wikis are being discussed in the Wikislice project.

Introduction

The Wikipedia on an Iphone (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive).

How it works

The wikipedia-iphone project's goal is to serve wikipedia content out of a compressed copy of the XML dump file after indexing it. The architecture is that there are C functions to pull out articles, and several interfaces to those C functions: the main interface is the iPhone app, but there's also a web server (written in Ruby with Mongrel) that runs locally and serves up pages from the compressed archive.

Wade has helped port the original project's code to Python. Cjb and Wade are working on fixing some of the unfinished aspects of the original, particularly:

  • Template rendering
  • Redlink/greenlink/bluelink rendering
  • Image thumbnail retrieval
  • Automated subselection (currently: via a minimum # of inbound links and page view data)

Link coloring is present -- green links indicate articles which exist (on a local server, or on the internet) but not in the local dump, while blue links indicate locally-existing pages.

The mediawiki data dump is stored as a .bz2 file, which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that

  • (1) it's really fast, since you're working with tiny compressed bundles and
  • (2) it's really small, since you're only decompressing one tiny bundle at a time.

Download

The current release is here:

Wikipedia-6.xo

This is a very large activity bundle, make sure you have at least 100MB free before downloading. It takes Sugar quite some time to decompress it.

Help wanted

People

  • Chris Ball is the person to contact if you're unsure of how to get started.
  • Wade - working on the python port and can help people trying to implement any of the desierd featureset
  • Sj - working on the featureset & tests
  • madprime - working on a Spanish portal page

Previous tasks

Porting server code from Ruby to Python
Done. (wade)
  1. Embed the wiki content in a wrapper page which contains a search box, home link, etc.
  2. Find & fix rendering bugs in the wikitext parser.
  3. Implement Green-link, Red-link, Blue-link.
  4. Expand templates before processing the XML. Currently, templates are just stripped.
Creating a python activity wrapper
Done. (wade) -- though this needs testing for journal/sugar integration.
  1. Write a README on how to run your ported code and include it in the bundle.
    • It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris made a start on this.
    • Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for.
Creating a spanish-language portal page
mad is working on this Done.

Testing

  1. Wikitext rendering.
    • There is no formal specification for how to convert Wikitext to HTML, so we are relying on a modified version of InstaView to do this for us, but we have founds bugs already in our limited tests, so this is one of the bigger areas that needs testing.
    • Download the activity bundle click around links, looking for errors in rendering. Check these errors against Wikipedia, and report them in Trac if there is a real difference.
    • Note that templates (including info boxes) are currently stripped from the pages due to an issue with how the database is processed, so please don't report missing infoboxes.
    • We should prepare an automated test suite that attempts to render all portal-page-linked articles and checks the results.

Testers

If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information).