Wiki Slice
Goal
The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material. A prototype example is at http://dev.laptop.org/pub/content/wp/en/index.html .
Code
The code will be hosted on dev.laptop soon. For the moment, you can pull from Julius Lucks' public git repo for the project. Do
git clone http://slyjbl.hopto.org/~olpc/git/wikipedia_olpc_scripts.git
or if you already have a git repo
git fetch http://slyjbl.hopto.org/~olpc/git/wikipedia_olpc_scripts.git master:my_new_branch
Until we get hosting on dev.laptop.org, feel free to send me (julius at (younglucks (dot (com)))) patches and I will add them to this repo.
Files
See 'Implementation' below for an explanation of these files
old_perl_scripts/ # the old perl implementation op.pl #mncharity's new perl implementation wiki_slice.py # stub python implementation wiki_page.py wiki_slice_config.y sandbox/ # some python play-around code site/ #includes olpc_wrapper.html for wrapping page html in, and other necc. css files, etc.
Implementation
Right now there is an existing perl implementation that I don't know too much about. You can have a look inside the old_perl_scripts directory. There is a new perl proof-of-concept/implementation in perl called op.pl by mncharity, and a new outline implementation in python. Both of these are outlined below.
Workflow
Taking wikipedia as an example of a wiki that we would want to slice, the general workflow is as follows:
- Get a list of pages to be included in the slice. This can be done via a configuration file, or can be 'harvested' from a web-page.
- For each of the pages:
- grab the wikitext of the page using the action=raw cgi parameter passed to the wiki's index.php script
- remove unwanted templates. These are specified in a configuration file.
- get the wiki text from expanding the templates (wia the Special:ExpandTemplates page)
- for wiki-links that refer to pages also included in the slice, turn these into local anchor tag links
- for wiki-links that refer to pages not included in the slice - de-link them (possibly highlight them)
- generate html. There are several methods to do this:
- create a sandbox entry and generate the page
- create a page preview (does not always work)
- use the preview generated by Special:ExpandTemplates
- save the html
- find <img> tags and:
- download the image
- make the url refer to the local image
- wrap the html inside olpc's html wrapper
- save the html
op.pl
The op.pl implementation uses command-line options, local file caching, and pipelining to step through the workflow process. A typical usage is:
./op --process-article article_list_file en A 92413199
where the article_list_file contains a list of articles in (lang,page_name,oldid) tuples. The last three entries on this line is one of these tuples. oldid specifies the version of the page to be grabbed. A pipeline can be seen with
./op --stage1 article_list_file en Addition 92943836 |./op --wikitext-templates|./op --templates-curate en |sort -u > tmp.html
or
find en_raw_files -type f|./nl0|xargs -0 cat| ./op --wikitext-templates|sort -u
Essentially there are command-line options for each stage in the pipeline, with the stages
- "working with templates, prior to template expansion"(stage1)
- "template expansion"
- "working with wikitext, which must follow template expansion"(stage2)
- "rendering of wikitext as html"
- "working with html"(stage3)
op.pl is a single perl file.
wiki_slice.py
The wiki_slice python implementation is a little different in spirit. Rather than have each step of the workflow implemented as a command line option, they are implemented as methods in a WikiPage object. The file wiki_page.py defines three classes:
- WikiPage - represents a single wiki page and has methods for performing the workflow above
- WikiPageContainer - a container for a group of wiki pages - it tells the wikipages to run their workflow and tells them about blacklisted templates and what other pages are in their group
- WikiSite - a container for information about the wiki site the pages are coming from. WikiPage's own a WikiSite object.
A sample driver script that uses these classes is in wiki_slice.py, with config information written in wiki_slice_config.py. The driver looks something like:
import wiki_page import wiki_slice_config #Setup the site and the page container site = wiki_page.WikiSite('http://en.wikipedia.org/w/index.php?') container = wiki_page.WikiPageContainer(wiki_slice_config.base_dir) for name in wiki_slice_config.template_blacklist: container.add_blacklist_template(name) #create pages and add to the container page = wiki_page.WikiPage('A','en','92413199',site) container.add_page(page) pictogram_page = wiki_page.WikiPage('pictogram','en','146198358',site) container.add_page(pictogram_page) #process the pages container.process_pages() #output pages container.dump_pages()
The idea is for batch jobs to be run with scripts like these, while more trouble-shooting jobs can be done within the python interpreter. Caching is not implemented yet, although there are notes in wiki_page.py on one way to make WikiPage objects cache their history. We could then cache WikiPage objects with pickle, or cache some dumped representation of them.
Useful Links
- Mediawiki API specification - we are already using &action=raw, which is apparently part of this api. The API is HTTP-based. We should try to use as much of it as possible.