WikiBrowse: Difference between revisions
No edit summary |
(→Creating a slice of wikipedia: corrections) |
||
Line 76: | Line 76: | ||
The current wikipedia slice was created based on the traffic statistics for Spanish wikipedia for the month February 2008. This file was provided by [http://en.wikipedia.org/wiki/User:Henrik User:Henrik] on English Wikipedia, who has collected traffic statistics for many different wikipedias (you can browse some of this at http://stats.grok.se ). Try approaching him if you want to acquire traffic stats for another wikipedia, or approach [[User:Madeleine Price Ball|Madeleine]] or [[Profiles/cjb|Chris]] if you want to get your hands on the Spanish or English Feb 2008 traffic stats. |
The current wikipedia slice was created based on the traffic statistics for Spanish wikipedia for the month February 2008. This file was provided by [http://en.wikipedia.org/wiki/User:Henrik User:Henrik] on English Wikipedia, who has collected traffic statistics for many different wikipedias (you can browse some of this at http://stats.grok.se ). Try approaching him if you want to acquire traffic stats for another wikipedia, or approach [[User:Madeleine Price Ball|Madeleine]] or [[Profiles/cjb|Chris]] if you want to get your hands on the Spanish or English Feb 2008 traffic stats. |
||
* Parse the |
* Parse the wikipedia xml dump to get a list of all articles and a list of all redirects ( GetPages.pl ) |
||
* Parse the |
* Parse the wikipedia xml dump to get a list of all links to other pages made by each page ( PageLinks.pl ) |
||
* Recount the traffic list to combine traffic with redirects with the traffic of main pages, output each as having the same summed traffic stat. ( GetPageCounts.pl ) |
* Recount the Feb 2008 traffic list to combine traffic with redirects with the traffic of main pages, output each as having the same summed traffic stat. ( GetPageCounts.pl ) |
||
* Take the top N articles. A lot of these will get trimmed in a later step. For the current slice the top 70k articles was taken here. |
* Take the top N articles. A lot of these will get trimmed in a later step. For the current slice the top 70k articles was taken here. |
||
* (optional) Remove blacklisted articles. ( RemoveBlacklist.pl ) |
* (optional) Remove blacklisted articles. ( RemoveBlacklist.pl ) |
||
* Use the output of PageLinks.pl to assess the number of incoming links for each article within the current |
* Use the output of PageLinks.pl to assess the number of incoming links for each article within the current set. Remove all "orphaned" articles (articles with zero incoming links). ( RemoveUnlinked.pl ) |
||
:* This step also removes many articles with certain name types: Wikipedia:, Ayuda:, Wikiproyecto:, MediaWiki:, Plantilla:, WP:, Portal:, and Categoría:. |
:* This step also removes many articles with certain name types: Wikipedia:, Ayuda:, Wikiproyecto:, MediaWiki:, Plantilla:, WP:, Portal:, and Categoría:. |
||
:* This step should be repeated until there are no orphaned articles left. About 26k articles are removed in the first round. There are about 200 articles that become newly-orphaned if you repeat it. |
:* This step should be repeated until there are no orphaned articles left. About 26k articles are removed in the first round. There are about 200 articles that become newly-orphaned if you repeat it. |
Revision as of 00:06, 16 June 2008
A wiki webserver activity is being developed, as a self-contained browsable offline wikireader. Other efforts to generate and browse slices of online wikis are being discussed in the Wikislice project.
Introduction
The Wikipedia on an Iphone (WOAI) project by Patrick Collison makes it possible to have a working, usable mediawiki (read: wikipedia) dump in a very small space (read: the XO's flash drive).
How it works
The wikipedia-iphone project's goal is to serve wikipedia content out of a compressed copy of the XML dump file after indexing it. The architecture is that there are C functions to pull out articles, and several interfaces to those C functions: the main interface is the iPhone app, but there's also a web server (written in Ruby with Mongrel) that runs locally and serves up pages from the compressed archive.
Wade has helped port the original project's code to Python. Cjb and Wade are working on fixing some of the unfinished aspects of the original, particularly:
- Template rendering
- Redlink/greenlink/bluelink rendering
- Image thumbnail retrieval
- Automated subselection (currently: via a minimum # of inbound links and page view data)
Link coloring is present -- green links indicate articles which exist (on a local server, or on the internet) but not in the local dump, while blue links indicate locally-existing pages.
The mediawiki data dump is stored as a .bz2 file, which is made of smaller compressed blocks (which each contain multiple articles). The WOAI code, among other things, goes through and makes an index of which block each article is in. That way, when you want to read an article, your computer only uncompresses the tiny block it's in - the rest of the huge mediawiki dump stays compressed. This means that
- (1) it's really fast, since you're working with tiny compressed bundles and
- (2) it's really small, since you're only decompressing one tiny bundle at a time.
Downloads
This is a very large activity bundle, make sure you have at least 100MB free before downloading. It takes Sugar a couple of minutes to decompress it, during which time the UI will be frozen.
Current release : v.8
The current release is faster; it takes ~5 seconds to load a long page.
Previous releases
- Wikipedia v.6 - complete, small improvements over prev version; working on corner cases in rendering.
Source download
The project's developed inside our GIT repository.
Previous tasks
- Porting server code from Ruby to Python
- Done. (wade)
- Embed the wiki content in a wrapper page which contains a search box, home link, etc.
- Find & fix rendering bugs in the wikitext parser.
- Implement Green-link, Red-link, Blue-link.
- Expand templates before processing the XML.
- Creating a python activity wrapper
- Done. (wade) -- though this needs testing for journal/sugar integration.
- Write a README on how to run your ported code and include it in the bundle.
- It would also be super nice to contact Patrick, the original "wikipedia on the iphone" developer, and work with that community to integrate your code into theirs. Chris made a start on this.
- Contact the testers who have signed up below, and give them instructions on how you'd like them to try out the code you've written, and what kind of feedback you're looking for.
- Creating a spanish-language portal page
- Done. mad is working on this.
Testing
- Wikitext rendering.
- Download the activity bundle click around links, looking for errors in rendering. Check these errors against Wikipedia, and report them in Trac if there is a real difference.
- We should prepare an automated test suite that attempts to render all portal-page-linked articles and checks the results.
- Blacklisting
- Please report any inappropriate articles or images with the link at the top of each page.
Testers
If you're willing to test code once the developers have it working, place your name below (make sure it links to some form of contact information).
Creating a slice of wikipedia
The current wikipedia slice was created based on the traffic statistics for Spanish wikipedia for the month February 2008. This file was provided by User:Henrik on English Wikipedia, who has collected traffic statistics for many different wikipedias (you can browse some of this at http://stats.grok.se ). Try approaching him if you want to acquire traffic stats for another wikipedia, or approach Madeleine or Chris if you want to get your hands on the Spanish or English Feb 2008 traffic stats.
- Parse the wikipedia xml dump to get a list of all articles and a list of all redirects ( GetPages.pl )
- Parse the wikipedia xml dump to get a list of all links to other pages made by each page ( PageLinks.pl )
- Recount the Feb 2008 traffic list to combine traffic with redirects with the traffic of main pages, output each as having the same summed traffic stat. ( GetPageCounts.pl )
- Take the top N articles. A lot of these will get trimmed in a later step. For the current slice the top 70k articles was taken here.
- (optional) Remove blacklisted articles. ( RemoveBlacklist.pl )
- Use the output of PageLinks.pl to assess the number of incoming links for each article within the current set. Remove all "orphaned" articles (articles with zero incoming links). ( RemoveUnlinked.pl )
- This step also removes many articles with certain name types: Wikipedia:, Ayuda:, Wikiproyecto:, MediaWiki:, Plantilla:, WP:, Portal:, and Categoría:.
- This step should be repeated until there are no orphaned articles left. About 26k articles are removed in the first round. There are about 200 articles that become newly-orphaned if you repeat it.
- Re-add all template articles ("Plantilla:"). We ignored them because their traffic stats do not reflect their usage.
How well does this work?
There are concerns that the traffic based method will bias too heavily for "popular culture". The results seem to be better than other schemes tried so far (eg. ranking by number of incoming links). The vast majority of articles on the list are there due to constant traffic rather than a local peak in popularity. With ~ 24k articles (not counting redirects) the space gained by manually removing some articles is negligible.
The top 20 articles of the Feb 2008 Spanish wikipedia traffic stats are:
1 769326 Wikipedia 2 342777 Día de San Valentín 3 285346 Célula 4 281076 México 5 273326 Filosofía 6 270628 Ciencia 7 254431 España 8 245171 Física 9 233579 Ecología 10 226516 Naruto 11 225685 Calentamiento global 12 223276 Estados Unidos 13 217996 Valor 14 211550 Sistema operativo 15 208141 Psicología 16 207869 Wiki 17 197387 Computadora 18 197126 Biología 19 194576 Comunicación 20 193129 Baloncesto