WikiBrowse/lang-es is a wiki server and compressed set of wiki pages that together act as a self-contained browsable offline wikireader.


Source download

The project's developed inside our git repository:

Chris's initial blog post mentions how this project got started.



  1. Wikitext rendering.
    • Download the activity bundle click around links, looking for errors in rendering. Check these errors against Wikipedia, and report them in Trac if there is a real difference.
    • We should prepare an automated test suite that attempts to render all portal-page-linked articles and checks the results.
  2. Blacklisting
    • Please report any inappropriate articles or images with the link at the top of each page.

If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information).

Creating a slice of wikipedia

NOTE: this process was updated. To have information about the new process read

The current wikipedia slice was created based on the traffic statistics for Spanish wikipedia for the month February 2008. This file was provided by User:Henrik on English Wikipedia, who has collected traffic statistics for many different wikipedias (you can browse some of this at ). Try approaching him if you want to acquire traffic stats for another wikipedia, or approach Madeleine or Chris if you want to get your hands on the Spanish or English Feb 2008 traffic stats.

  • Parse the wikipedia xml dump to get a list of all articles and a list of all redirects ( )
  • Parse the wikipedia xml dump to get a list of all links to other pages made by each page ( )
  • Recount the Feb 2008 traffic list to combine traffic with redirects with the traffic of main pages, output each as having the same summed traffic stat. ( )
  • Take the top N articles. A lot of these will get trimmed in a later step. For the current slice the top 70k articles was taken here.
  • (optional) Remove blacklisted articles. ( )
  • Use the output of to assess the number of incoming links for each article within the current set. Remove all "orphaned" articles (articles with zero incoming links). ( )
  • This step also removes many articles with certain name types: Wikipedia:, Ayuda:, Wikiproyecto:, MediaWiki:, Plantilla:, WP:, Portal:, and Categoría:.
  • This step should be repeated until there are no orphaned articles left. About 26k articles are removed in the first round. There are about 200 articles that become newly-orphaned if you repeat it.
  • Re-add all template articles ("Plantilla:"). We ignored them because their traffic stats do not reflect their usage.

Creating and editing a slice of wikipedia

See WikiBrowse Editing for a technique to edit and polish a wikipedia slice.

How well does this work?

There are concerns that the traffic based method will bias too heavily for "popular culture". The results seem to be better than other schemes tried so far (eg. ranking by number of incoming links). The vast majority of articles on the list are there due to constant traffic rather than a local peak in popularity. With ~ 24k articles (not counting redirects) the space gained by manually removing some articles is negligible.

The top 20 articles of the Feb 2008 Spanish wikipedia traffic stats are:

1 769326       Wikipedia
2 342777       Día de San Valentín
3 285346       Célula
4 281076       México
5 273326       Filosofía
6 270628       Ciencia
7 254431       España
8 245171       Física
9 233579       Ecología
10 226516      Naruto
11 225685      Calentamiento global
12 223276      Estados Unidos
13 217996      Valor
14 211550      Sistema operativo
15 208141      Psicología
16 207869      Wiki
17 197387      Computadora
18 197126      Biología
19 194576      Comunicación
20 193129      Baloncesto

Method used on the English Wikipedia

See SelectionBot and its March 2008 results. Another approach, as used on the English Wikipedia, combines four importance parameters using a logarithmic function, then adds in a factor for article quality. The four importance parameters are:

  • Importance rating by WikiProject
  • Number of internal links into the page
  • Number of interwiki versions of the article (i.e., other language versions)
  • Number of hits (as used in the previous Spanish example)

This combination of four parameters helps to smooth out the selection. Data are collected on the toolserver via a bot. Work is currently being done (July 2008) to balance the selections between different WikiProjects, and the system is expected to be optimized by August 2008.

Old tasks

Porting server code from Ruby to Python 
Done. (wade)
  1. Embed the wiki content in a wrapper page which contains a search box, home link, etc.
  2. Find & fix rendering bugs in the wikitext parser.
  3. Implement Green-link, Red-link, Blue-link.
  4. Expand templates before processing the XML.
Creating a python activity wrapper 
Done. (wade) -- though this needs testing for journal/sugar integration.
  1. Write a README on how to run your ported code and include it in the bundle.
    • It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris made a start on this.
    • Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for.
Creating a spanish-language portal page
Done. mad is working on this.

See also

  • Wikislices a collection of articles pulled from Wikipedia (without the wiki web server).
  • Wikibooks a set of wiki pages in PDF format.
  • InfoSlicer is a tool to create a collection from content (such as Wikipedia pages) on the Internet.

