WikiBrowse: Difference between revisions
(minor changes) |
|||
(62 intermediate revisions by 24 users not shown) | |||
Line 1: | Line 1: | ||
<div style="align:center; border:1px solid #ccc; padding:3px; margin:3px; background-color:#cef">''For the old uncompressed activity, see [[WikiBrowse (old)]].''</div> |
|||
__NOTOC__ |
|||
{{translations}} |
|||
[[Image:Olpc_logo.jpg|center]] |
|||
<center><span style="font-size:200%">Wiki server</span></center> |
|||
[[{{PAGENAME}}]] is a wiki server and compressed set of wiki pages that together act as a self-contained browsable offline wikireader. |
|||
== Introduction == |
|||
=== Releases === |
|||
The [http://collison.ie/wikipedia-iphone/ Wikipedia on an Iphone] (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive). |
|||
* See [[Activities/WikiBrowse_Spanish_(latest)]] |
|||
* See [[Activities/WikiBrowse English (latest)]] |
|||
=== |
==== Source download ==== |
||
The project's developed inside our git repository: |
|||
* '[http://dev.laptop.org/git/projects/wikiserver wikiserver]' code |
|||
Chris's initial [http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo blog post] mentions how this project got started. |
|||
The wikipedia-iphone project's goal is to serve wikipedia content out of a compressed copy of the XML dump file after indexing it. The architecture is that there are C functions to pull out articles, and several interfaces to those C functions: the main interface is the iPhone app, but there's also a web server (written in Ruby with Mongrel) that runs locally and serves up pages from the compressed archive. |
|||
==== Developers ==== |
|||
I have this (the Ruby web server) working on a local machine here (temporary: http://pullcord.laptop.org:9090/) serving up the entirety of the Spanish wikipedia text (the compressed. bz2 file that pages are served out of is 400M, and the index is 10M). |
|||
* [[User:Wade|Wade Brainerd]], [[User:Cjb|Chris Ball]], [http://mad.printf.net Madeleine Ball], [[User:Bemasc|Ben Schwartz]] |
|||
* [[User:Martinlanghoff]], [[User:Godiard|Gonzalo Odiard]], Rob Ochshorn |
|||
=== Testing === |
|||
The mediawiki data dump is stored as a [http://en.wikipedia.org/wiki/Bz2 .bz2 file], which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that (1) it's really fast, since you're working with tiny compressed bundles and (2) it's really small, since you're only decompressing one tiny bundle at a time. For example, the compressed text for all of Spanish Wikipedia is 400M, with a block size of 400KB. |
|||
# Wikitext rendering. |
|||
=== What we're doing === |
|||
#* Download the activity bundle click around links, looking for errors in rendering. Check these errors against [http://es.wikipedia.org/ Wikipedia], and report them in Trac if there is a real difference. |
|||
#* We should prepare an automated test suite that attempts to render all portal-page-linked articles and checks the results. |
|||
# Blacklisting |
|||
#* Please report any inappropriate articles or images with the link at the top of each page. |
|||
We'd love to make this into an XO activity. |
|||
If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information). |
|||
The first step is to port the code from Ruby to Python so it'll run natively on the XO - details below in [[#Help wanted]]. After that, we'd love to have a few more things done, like... |
|||
* [[User:Mchua|Mel Chua]] |
|||
* Wrapping this as a Sugar activity (code) |
|||
* [[User:RafaelOrtiz | Rafael Ortiz]] |
|||
* Some article selection. Since it serves files out of the .xml.bz2, we can accomplish this by choosing what goes into the .xml.bz2 (perhaps there are already tools for doing this? I don't know much about it) as long as we deal with the link-breaking we do as a result. (content, curation) |
|||
* [[User:Sj|Sj]] |
|||
* Add a subset of images. (curation) |
|||
* [[User:Googleplex12|Evan]] |
|||
* Finding some way to handle images - the current code only works with text, and image links are broken. (code) |
|||
* [[User:DyD|Ian Daniher]] |
|||
* Removing the wikitext parser from the server and rewriting it as an independent plugin/middleware/etc architecture so that other wiki syntaxes can be supported. Javascript, slimming down the current Mediawiki php parser, and Python middleware are all options. The current solution is a very simple/incomplete parser within the server code itself. (code) |
|||
== Help wanted == |
|||
== Creating a slice of wikipedia == |
|||
[[User:cjb|Chris Ball]] is the person to contact if you're unsure of how to get started. |
|||
'''NOTE: this process was updated. To have information about the new process read http://wiki.sugarlabs.org/go/Activities/Wikipedia/HowTo''' |
|||
=== Porting server code from Ruby to Python === |
|||
The current wikipedia slice was created based on the traffic statistics for Spanish wikipedia for the month February 2008. This file was provided by [http://en.wikipedia.org/wiki/User:Henrik User:Henrik] on English Wikipedia, who has collected traffic statistics for many different wikipedias (you can browse some of this at http://stats.grok.se ). Try approaching him if you want to acquire traffic stats for another wikipedia, or approach [[User:Madeleine Price Ball|Madeleine]] or [[Profiles/cjb|Chris]] if you want to get your hands on the Spanish or English Feb 2008 traffic stats. |
|||
We need a Python programmer (or a small team of Python programmers working together) to start the project off by porting the server code from Ruby to Python so it'll be easier to run on the XO. Ruby's quite easy to read and you don't have to be a Ruby programmer to do this (but it helps if you know Python). The code is very simple and short (less than 300 lines), so this should take no more than a weekend. Here's a suggested how-to-do-it procedure. |
|||
* Parse the wikipedia xml dump to get a list of all articles and a list of all redirects ( GetPages.pl ) |
|||
# Read this page to get an idea of what we're trying to do. |
|||
* Parse the wikipedia xml dump to get a list of all links to other pages made by each page ( PageLinks.pl ) |
|||
* Recount the Feb 2008 traffic list to combine traffic with redirects with the traffic of main pages, output each as having the same summed traffic stat. ( GetPageCounts.pl ) |
|||
# [http://collison.ie/wikipedia-iphone/wikipedia-iphone-0.1.tar.bz2 Download the source code] and take a look around. Notice how most of the code is either shell scripts or C, but there's a folder of ruby (rb) code. This is the stuff we want to port. |
|||
* Take the top N articles. A lot of these will get trimmed in a later step. For the current slice the top 70k articles was taken here. |
|||
# (Optional but recommended): Download and install [http://ruby-lang.org Ruby] and test out the existing code so you can see the app in action. Follow the instructions in the "Getting Started" section of the README file (in the source code you just downloaded) to get a wikipedia datafile parsed and the web server running. We'd recommend using a smaller wikipedia than the English language one. |
|||
* (optional) Remove blacklisted articles. ( RemoveBlacklist.pl ) |
|||
# Take a look at the files in the rb folder. There are four main ones to port to Python (the rest are very short "helper" files and should take just a few minutes to rewrite). |
|||
* Use the output of PageLinks.pl to assess the number of incoming links for each article within the current set. Remove all "orphaned" articles (articles with zero incoming links). ( RemoveUnlinked.pl ) |
|||
## '''bzipreader.rb''' (ruby interface to c/bzipreader.c; supports streaming bz2 files) - probably the most difficult, since you'll have to interface your python code with C (bzipreader.c). If someone has a tutorial or resources on how to do this, please post the link here. |
|||
:* This step also removes many articles with certain name types: Wikipedia:, Ayuda:, Wikiproyecto:, MediaWiki:, Plantilla:, WP:, Portal:, and Categoría:. |
|||
## '''index.rb''' (generate an article-to-block index using bzipreader.rb) |
|||
:* This step should be repeated until there are no orphaned articles left. About 26k articles are removed in the first round. There are about 200 articles that become newly-orphaned if you repeat it. |
|||
## '''server.rb''' (Mongrel-based server for using WP dumps with a web browser) - we'd suggest using the built-in Python webserver, [http://docs.python.org/lib/module-BaseHTTPServer.html BaseHTTPServer], for this. |
|||
* Re-add all template articles ("Plantilla:"). We ignored them because their traffic stats do not reflect their usage. |
|||
## '''xmlprocess.rb''' (generate stripped, XML-less file from a vanilla WP dump) |
|||
# Put the new files (bzipreader.py, index.py, server.py... etc) in a "py" folder and delete the "rb" one when you're done porting. |
|||
# Remember to license your work under the GPL (you must, since the original code is GPL) by putting a copy of the license in your folder (or just leaving the COPYING file from the original source in). |
|||
=== Creating and editing a slice of wikipedia === |
|||
See [[WikiBrowse Editing]] for a technique to edit and polish a wikipedia slice. |
|||
=== How well does this work? === |
|||
There are concerns that the traffic based method will bias too heavily for "popular culture". The results seem to be better than other schemes tried so far (eg. ranking by number of incoming links). The vast majority of articles on the list are there due to constant traffic rather than a local peak in popularity. With ~ 24k articles (not counting redirects) the space gained by manually removing some articles is negligible. |
|||
The top 20 articles of the Feb 2008 Spanish wikipedia traffic stats are: |
|||
1 769326 Wikipedia |
|||
2 342777 Día de San Valentín |
|||
3 285346 Célula |
|||
4 281076 México |
|||
5 273326 Filosofía |
|||
6 270628 Ciencia |
|||
7 254431 España |
|||
8 245171 Física |
|||
9 233579 Ecología |
|||
10 226516 Naruto |
|||
11 225685 Calentamiento global |
|||
12 223276 Estados Unidos |
|||
13 217996 Valor |
|||
14 211550 Sistema operativo |
|||
15 208141 Psicología |
|||
16 207869 Wiki |
|||
17 197387 Computadora |
|||
18 197126 Biología |
|||
19 194576 Comunicación |
|||
20 193129 Baloncesto |
|||
===Method used on the English Wikipedia=== |
|||
''See [http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/SelectionBot SelectionBot] and its [http://toolserver.org/~cbm/release-data/2008-3-7/HTML/ March 2008 results].'' |
|||
Another approach, as used on the English Wikipedia, combines '''four''' importance parameters using a logarithmic function, then adds in a factor for article quality. The four importance parameters are: |
|||
* Importance rating by WikiProject |
|||
* Number of internal links into the page |
|||
* Number of interwiki versions of the article (i.e., other language versions) |
|||
* Number of hits (as used in the previous Spanish example) |
|||
This combination of four parameters helps to smooth out the selection. Data are collected on the toolserver via a bot. Work is currently being done (July 2008) to balance the selections between different WikiProjects, and the system is expected to be optimized by August 2008. |
|||
=== Old tasks === |
|||
;Porting server code from Ruby to Python : {{done}} (wade) |
|||
# Embed the wiki content in a wrapper page which contains a search box, home link, etc. |
|||
# Find & fix rendering bugs in the wikitext parser. |
|||
# Implement Green-link, Red-link, Blue-link. |
|||
# Expand templates before processing the XML. |
|||
;Creating a python activity wrapper : {{done}} (wade) -- though this needs testing for journal/sugar integration. |
|||
# Write a README on how to run your ported code and include it in the bundle. |
# Write a README on how to run your ported code and include it in the bundle. |
||
# When you have the first hint of functional progress (and ''definitely'' when you finish)... |
|||
#* let [[User:Cjb|Chris Ball]], the [http://lists.laptop.org/listinfo/library library] list, and the [http://lists.laptop.org/listinfo/wikireader wikireader] list know. |
|||
#* It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris [http://groups.google.com/group/wikipedia-iphone/browse_thread/thread/58fe1472eea3f117 made a start] on this. |
#* It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris [http://groups.google.com/group/wikipedia-iphone/browse_thread/thread/58fe1472eea3f117 made a start] on this. |
||
#* This would also be a good time to apply for [[Project hosting]]. |
|||
#* Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for. |
#* Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for. |
||
;Creating a spanish-language portal page: {{done}} mad is working on this. |
|||
=== Testers === |
|||
== See also == |
|||
If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information). |
|||
* [[Wikislices]] a collection of articles pulled from Wikipedia (without the wiki web server). |
|||
* [[Wikibooks]] a set of wiki pages in PDF format. |
|||
* [http://sugarlabs.org/go/Activities/InfoSlicer InfoSlicer] is a tool to create a collection from content (such as Wikipedia pages) on the Internet. |
|||
* [[User:Mchua|Mel Chua]] |
|||
[[Category:Open projects]] |
|||
[[Category:Content]] |
|||
[[Category:Wiki texts]] |
|||
{{Activity page |
|||
|icon=Image:Activity-wikiserver.svg |
|||
|genre=General Search and Discovery |
|||
|activity group=Activities/G1G1 |
|||
|short description=Wikibrowse is a self-contained wiki server. |
|||
|long description=A wiki webserver activity is being developed, as a self-contained browsable offline wikireader. Other efforts to generate and browse slices of online wikis are being discussed in the [[Wikislice]] project. |
|||
== Introduction == |
|||
The [http://collison.ie/wikipedia-iphone/ Wikipedia on an Iphone] (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive). |
|||
=== How it works === |
|||
The wikipedia-iphone project's goal is to serve wikipedia content out of a compressed copy of the XML dump file after indexing it. The architecture is that there are C functions to pull out articles, and several interfaces to those C functions: the main interface is the iPhone app, but there's also a web server (written in Ruby with Mongrel) that runs locally and serves up pages from the compressed archive. |
|||
Link coloring is present -- green links indicate articles which exist (on a local server, or on the internet) but not in the local dump, while blue links indicate locally-existing pages. |
|||
The mediawiki data dump is stored as a [http://en.wikipedia.org/wiki/Bz2 .bz2 file], which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that |
|||
* (1) it's really fast, since you're working with tiny compressed bundles and |
|||
* (2) it's really small, since you're only decompressing one tiny bundle at a time. |
|||
=== Downloads === |
|||
This is a very large activity bundle, make sure you have at least 100MB free before downloading. It takes Sugar a couple of minutes to decompress it, during which time the UI will be frozen. |
|||
|contact person=User:Cjb |
|||
|activity source=http://dev.laptop.org/git/projects/wikiserver |
|||
|language=fracais |
|||
|bundle URL=http://dev.laptop.org/~cjb/eswiki/Wikipedia-10.xo |
|||
|activity version=10 |
|||
|releases=8.2.0 (767) |
|||
|devel status=5. Production-stable |
|||
}} |
Latest revision as of 21:08, 11 January 2012
WikiBrowse is a wiki server and compressed set of wiki pages that together act as a self-contained browsable offline wikireader.
Releases
Source download
The project's developed inside our git repository:
- 'wikiserver' code
Chris's initial blog post mentions how this project got started.
Developers
- Wade Brainerd, Chris Ball, Madeleine Ball, Ben Schwartz
- User:Martinlanghoff, Gonzalo Odiard, Rob Ochshorn
Testing
- Wikitext rendering.
- Download the activity bundle click around links, looking for errors in rendering. Check these errors against Wikipedia, and report them in Trac if there is a real difference.
- We should prepare an automated test suite that attempts to render all portal-page-linked articles and checks the results.
- Blacklisting
- Please report any inappropriate articles or images with the link at the top of each page.
If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information).
Creating a slice of wikipedia
NOTE: this process was updated. To have information about the new process read http://wiki.sugarlabs.org/go/Activities/Wikipedia/HowTo
The current wikipedia slice was created based on the traffic statistics for Spanish wikipedia for the month February 2008. This file was provided by User:Henrik on English Wikipedia, who has collected traffic statistics for many different wikipedias (you can browse some of this at http://stats.grok.se ). Try approaching him if you want to acquire traffic stats for another wikipedia, or approach Madeleine or Chris if you want to get your hands on the Spanish or English Feb 2008 traffic stats.
- Parse the wikipedia xml dump to get a list of all articles and a list of all redirects ( GetPages.pl )
- Parse the wikipedia xml dump to get a list of all links to other pages made by each page ( PageLinks.pl )
- Recount the Feb 2008 traffic list to combine traffic with redirects with the traffic of main pages, output each as having the same summed traffic stat. ( GetPageCounts.pl )
- Take the top N articles. A lot of these will get trimmed in a later step. For the current slice the top 70k articles was taken here.
- (optional) Remove blacklisted articles. ( RemoveBlacklist.pl )
- Use the output of PageLinks.pl to assess the number of incoming links for each article within the current set. Remove all "orphaned" articles (articles with zero incoming links). ( RemoveUnlinked.pl )
- This step also removes many articles with certain name types: Wikipedia:, Ayuda:, Wikiproyecto:, MediaWiki:, Plantilla:, WP:, Portal:, and Categoría:.
- This step should be repeated until there are no orphaned articles left. About 26k articles are removed in the first round. There are about 200 articles that become newly-orphaned if you repeat it.
- Re-add all template articles ("Plantilla:"). We ignored them because their traffic stats do not reflect their usage.
Creating and editing a slice of wikipedia
See WikiBrowse Editing for a technique to edit and polish a wikipedia slice.
How well does this work?
There are concerns that the traffic based method will bias too heavily for "popular culture". The results seem to be better than other schemes tried so far (eg. ranking by number of incoming links). The vast majority of articles on the list are there due to constant traffic rather than a local peak in popularity. With ~ 24k articles (not counting redirects) the space gained by manually removing some articles is negligible.
The top 20 articles of the Feb 2008 Spanish wikipedia traffic stats are:
1 769326 Wikipedia 2 342777 Día de San Valentín 3 285346 Célula 4 281076 México 5 273326 Filosofía 6 270628 Ciencia 7 254431 España 8 245171 Física 9 233579 Ecología 10 226516 Naruto 11 225685 Calentamiento global 12 223276 Estados Unidos 13 217996 Valor 14 211550 Sistema operativo 15 208141 Psicología 16 207869 Wiki 17 197387 Computadora 18 197126 Biología 19 194576 Comunicación 20 193129 Baloncesto
Method used on the English Wikipedia
See SelectionBot and its March 2008 results. Another approach, as used on the English Wikipedia, combines four importance parameters using a logarithmic function, then adds in a factor for article quality. The four importance parameters are:
- Importance rating by WikiProject
- Number of internal links into the page
- Number of interwiki versions of the article (i.e., other language versions)
- Number of hits (as used in the previous Spanish example)
This combination of four parameters helps to smooth out the selection. Data are collected on the toolserver via a bot. Work is currently being done (July 2008) to balance the selections between different WikiProjects, and the system is expected to be optimized by August 2008.
Old tasks
- Porting server code from Ruby to Python
- Done. (wade)
- Embed the wiki content in a wrapper page which contains a search box, home link, etc.
- Find & fix rendering bugs in the wikitext parser.
- Implement Green-link, Red-link, Blue-link.
- Expand templates before processing the XML.
- Creating a python activity wrapper
- Done. (wade) -- though this needs testing for journal/sugar integration.
- Write a README on how to run your ported code and include it in the bundle.
- It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris made a start on this.
- Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for.
- Creating a spanish-language portal page
- Done. mad is working on this.
See also
- Wikislices a collection of articles pulled from Wikipedia (without the wiki web server).
- Wikibooks a set of wiki pages in PDF format.
- InfoSlicer is a tool to create a collection from content (such as Wikipedia pages) on the Internet.
Activity Summary
Icon: | Sugar icon::Image:Activity-wikiserver.svg |
Genre: | Activity genre::General Search and Discovery |
Activity group: | ,|x|Activity group::x}} |
Short description: | Short description::Wikibrowse is a self-contained wiki server. |
Description: | [[Description::A wiki webserver activity is being developed, as a self-contained browsable offline wikireader. Other efforts to generate and browse slices of online wikis are being discussed in the Wikislice project.
IntroductionThe Wikipedia on an Iphone (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive). How it worksThe wikipedia-iphone project's goal is to serve wikipedia content out of a compressed copy of the XML dump file after indexing it. The architecture is that there are C functions to pull out articles, and several interfaces to those C functions: the main interface is the iPhone app, but there's also a web server (written in Ruby with Mongrel) that runs locally and serves up pages from the compressed archive. Link coloring is present -- green links indicate articles which exist (on a local server, or on the internet) but not in the local dump, while blue links indicate locally-existing pages. The mediawiki data dump is stored as a .bz2 file, which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that
DownloadsThis is a very large activity bundle, make sure you have at least 100MB free before downloading. It takes Sugar a couple of minutes to decompress it, during which time the UI will be frozen.]] |
Maintainers: | ,|x|Contact person::x}} |
Repository URL: | Source code::http://dev.laptop.org/git/projects/wikiserver |
Available languages: | ,|x|Available languages::x}} |
Available languages (codes): | ,|x|Language code::x}} |
Pootle URL: | |
Related projects: | Related projects,|x|Related projects::x}} |
Contributors: | ,|x|Team member::x}} |
URL from which to download the latest .xo bundle | Activity bundle::http://dev.laptop.org/~cjb/eswiki/Wikipedia-10.xo |
Last tested version number: | Activity version::10 |
The releases with which this version of the activity has been tested. | ,|x|Software release::x}} |
Development status: | Devel status::5. Production-stable |
Ready for testing (development has progressed to the point where testers should try it out): | ,|x|Ready for testing::x}} |
smoke tested : | |
test plan available : | |
test plan executed : | |
developer response to testing : |