WikiBrowse: Difference between revisions
m (→How it works) |
(minor changes) |
||
Line 6: | Line 6: | ||
== Introduction == |
== Introduction == |
||
The [http://collison.ie/wikipedia-iphone/ Wikipedia on an Iphone] (WOAI) project by Patrick |
The [http://collison.ie/wikipedia-iphone/ Wikipedia on an Iphone] (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive). |
||
=== How it works === |
=== How it works === |
||
Line 14: | Line 14: | ||
I have this (the Ruby web server) working on a local machine here (temporary: http://pullcord.laptop.org:9090/) serving up the entirety of the Spanish wikipedia text (the compressed. bz2 file that pages are served out of is 400M, and the index is 10M). |
I have this (the Ruby web server) working on a local machine here (temporary: http://pullcord.laptop.org:9090/) serving up the entirety of the Spanish wikipedia text (the compressed. bz2 file that pages are served out of is 400M, and the index is 10M). |
||
The mediawiki data dump is stored as a [http://en.wikipedia.org/wiki/Bz2 .bz2 file], which is made of smaller compressed blocks ( |
The mediawiki data dump is stored as a [http://en.wikipedia.org/wiki/Bz2 .bz2 file], which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that (1) it's really fast, since you're working with tiny compressed bundles and (2) it's really small, since you're only decompressing one tiny bundle at a time. For example, the compressed text for all of Spanish Wikipedia is 400M, with a block size of 400KB. |
||
=== What we're doing === |
=== What we're doing === |
||
Line 27: | Line 27: | ||
* Finding some way to handle images - the current code only works with text, and image links are broken. (code) |
* Finding some way to handle images - the current code only works with text, and image links are broken. (code) |
||
* Removing the wikitext parser from the server and rewriting it as an independent plugin/middleware/etc architecture so that other wiki syntaxes can be supported. Javascript, slimming down the current Mediawiki php parser, and Python middleware are all options. The current solution is a very simple/incomplete parser within the server code itself. (code) |
* Removing the wikitext parser from the server and rewriting it as an independent plugin/middleware/etc architecture so that other wiki syntaxes can be supported. Javascript, slimming down the current Mediawiki php parser, and Python middleware are all options. The current solution is a very simple/incomplete parser within the server code itself. (code) |
||
* Use 7zip instead of bz2 -- for the Spanish wikipedia text with history, this is the difference between 1.6GB and 8.3GB. With this, we could expect our 40M snapshot to drop below 100M, which is in the realm of being able to include it on every XO in a deployment. I'm most interested in Spanish because it's the language that most of OLPC's current deployments use. |
|||
== Help wanted == |
== Help wanted == |
||
Line 51: | Line 50: | ||
# When you have the first hint of functional progress (and ''definitely'' when you finish)... |
# When you have the first hint of functional progress (and ''definitely'' when you finish)... |
||
#* let [[User:Cjb|Chris Ball]], the [http://lists.laptop.org/listinfo/library library] list, and the [http://lists.laptop.org/listinfo/wikireader wikireader] list know. |
#* let [[User:Cjb|Chris Ball]], the [http://lists.laptop.org/listinfo/library library] list, and the [http://lists.laptop.org/listinfo/wikireader wikireader] list know. |
||
#* It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. |
#* It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris [http://groups.google.com/group/wikipedia-iphone/browse_thread/thread/58fe1472eea3f117 made a start] on this. |
||
#* This would also be a good time to apply for [[Project hosting]]. |
#* This would also be a good time to apply for [[Project hosting]]. |
||
#* Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for. |
#* Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for. |
Revision as of 18:52, 2 May 2008
Introduction
The Wikipedia on an Iphone (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive).
How it works
The wikipedia-iphone project's goal is to serve wikipedia content out of a compressed copy of the XML dump file after indexing it. The architecture is that there are C functions to pull out articles, and several interfaces to those C functions: the main interface is the iPhone app, but there's also a web server (written in Ruby with Mongrel) that runs locally and serves up pages from the compressed archive.
I have this (the Ruby web server) working on a local machine here (temporary: http://pullcord.laptop.org:9090/) serving up the entirety of the Spanish wikipedia text (the compressed. bz2 file that pages are served out of is 400M, and the index is 10M).
The mediawiki data dump is stored as a .bz2 file, which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that (1) it's really fast, since you're working with tiny compressed bundles and (2) it's really small, since you're only decompressing one tiny bundle at a time. For example, the compressed text for all of Spanish Wikipedia is 400M, with a block size of 400KB.
What we're doing
We'd love to make this into an XO activity.
The first step is to port the code from Ruby to Python so it'll run natively on the XO - details below in #Help wanted. After that, we'd love to have a few more things done, like...
- Wrapping this as a Sugar activity (code)
- Some article selection. Since it serves files out of the .xml.bz2, we can accomplish this by choosing what goes into the .xml.bz2 (perhaps there are already tools for doing this? I don't know much about it) as long as we deal with the link-breaking we do as a result. (content, curation)
- Add a subset of images. (curation)
- Finding some way to handle images - the current code only works with text, and image links are broken. (code)
- Removing the wikitext parser from the server and rewriting it as an independent plugin/middleware/etc architecture so that other wiki syntaxes can be supported. Javascript, slimming down the current Mediawiki php parser, and Python middleware are all options. The current solution is a very simple/incomplete parser within the server code itself. (code)
Help wanted
Chris Ball is the person to contact if you're unsure of how to get started.
Porting server code from Ruby to Python
We need a Python programmer (or a small team of Python programmers working together) to start the project off by porting the server code from Ruby to Python so it'll be easier to run on the XO. Ruby's quite easy to read and you don't have to be a Ruby programmer to do this (but it helps if you know Python). The code is very simple and short (less than 300 lines), so this should take no more than a weekend. Here's a suggested how-to-do-it procedure.
- Read this page to get an idea of what we're trying to do.
- Read the project homepage to get an overview of what the app does. Also see the google code project.
- Download the source code and take a look around. Notice how most of the code is either shell scripts or C, but there's a folder of ruby (rb) code. This is the stuff we want to port.
- (Optional but recommended): Download and install Ruby and test out the existing code so you can see the app in action. Follow the instructions in the "Getting Started" section of the README file (in the source code you just downloaded) to get a wikipedia datafile parsed and the web server running. We'd recommend using a smaller wikipedia than the English language one.
- Take a look at the files in the rb folder. There are four main ones to port to Python (the rest are very short "helper" files and should take just a few minutes to rewrite).
- bzipreader.rb (ruby interface to c/bzipreader.c; supports streaming bz2 files) - probably the most difficult, since you'll have to interface your python code with C (bzipreader.c). If someone has a tutorial or resources on how to do this, please post the link here.
- index.rb (generate an article-to-block index using bzipreader.rb)
- server.rb (Mongrel-based server for using WP dumps with a web browser) - we'd suggest using the built-in Python webserver, BaseHTTPServer, for this.
- xmlprocess.rb (generate stripped, XML-less file from a vanilla WP dump)
- Put the new files (bzipreader.py, index.py, server.py... etc) in a "py" folder and delete the "rb" one when you're done porting.
- Remember to license your work under the GPL (you must, since the original code is GPL) by putting a copy of the license in your folder (or just leaving the COPYING file from the original source in).
- Write a README on how to run your ported code and include it in the bundle.
- When you have the first hint of functional progress (and definitely when you finish)...
- let Chris Ball, the library list, and the wikireader list know.
- It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris made a start on this.
- This would also be a good time to apply for Project hosting.
- Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for.
Testers
If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information).