Wiki Slice

From OLPC
Revision as of 21:43, 26 July 2007 by Lucks (talk | contribs) (added useful links)
Jump to navigation Jump to search

Goal

The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material. A prototype example is at http://dev.laptop.org/pub/content/wp/en/index.html .

Code

The code will be hosted on dev.laptop soon. For the moment, you can pull from Julius Lucks' public git repo for the project. Do

git clone http://slyjbl.hopto.org/~olpc/git/wikipedia_olpc_scripts.git

or if you already have a git repo

git fetch http://slyjbl.hopto.org/~olpc/git/wikipedia_olpc_scripts.git master:my_new_branch

Until we get hosting on dev.laptop.org, feel free to send me (julius at (younglucks (dot (com)))) patches and I will add them to this repo.

Files

See 'Implementation' below for an explanation of these files

old_perl_scripts/ # the old perl implementation
op.pl #mncharity's new perl implementation
wiki_slice.py # stub python implementation
wiki_page.py
wiki_slice_config.y
sandbox/ # some python play-around code
site/ #includes olpc_wrapper.html for wrapping page html in, and other necc. css files, etc.

Implementation

Right now there is an existing perl implementation that I don't know too much about. You can have a look inside the old_perl_scripts directory. There is a new perl proof-of-concept/implementation in perl called op.pl by mncharity, and a new outline implementation in python. Both of these are outlined below.

Workflow

Taking wikipedia as an example of a wiki that we would want to slice, the general workflow is as follows:

  1. Get a list of pages to be included in the slice. This can be done via a configuration file, or can be 'harvested' from a web-page.
  2. For each of the pages:
    1. grab the wikitext of the page using the action=raw cgi parameter passed to the wiki's index.php script
    2. remove unwanted templates. These are specified in a configuration file.
    3. get the wiki text from expanding the templates (wia the Special:ExpandTemplates page)
    4. for wiki-links that refer to pages also included in the slice, turn these into local anchor tag links
    5. for wiki-links that refer to pages not included in the slice - de-link them (possibly highlight them)
    6. generate html. There are several methods to do this:
      • create a sandbox entry and generate the page
      • create a page preview (does not always work)
      • use the preview generated by Special:ExpandTemplates
    7. save the html
    8. find <img> tags and:
      1. download the image
      2. make the url refer to the local image
    9. wrap the html inside olpc's html wrapper
    10. save the html

op.pl

The op.pl implementation uses command-line options, local file caching, and pipelining to step through the workflow process. A typical usage is:

./op --process-article article_list_file en A 92413199

where the article_list_file contains a list of articles in (lang,page_name,oldid) tuples. The last three entries on this line is one of these tuples. oldid specifies the version of the page to be grabbed. A pipeline can be seen with

./op --stage1 article_list_file en Addition 92943836 |./op --wikitext-templates|./op --templates-curate en |sort -u > tmp.html

or

find en_raw_files -type f|./nl0|xargs -0 cat| ./op --wikitext-templates|sort -u

Essentially there are command-line options for each stage in the pipeline, with the stages

  • "working with templates, prior to template expansion"(stage1)
  • "template expansion"
  • "working with wikitext, which must follow template expansion"(stage2)
  • "rendering of wikitext as html"
  • "working with html"(stage3)

op.pl is a single perl file.

wiki_slice.py

The wiki_slice python implementation is a little different in spirit. Rather than have each step of the workflow implemented as a command line option, they are implemented as methods in a WikiPage object. The file wiki_page.py defines three classes:

  • WikiPage - represents a single wiki page and has methods for performing the workflow above
  • WikiPageContainer - a container for a group of wiki pages - it tells the wikipages to run their workflow and tells them about blacklisted templates and what other pages are in their group
  • WikiSite - a container for information about the wiki site the pages are coming from. WikiPage's own a WikiSite object.

A sample driver script that uses these classes is in wiki_slice.py, with config information written in wiki_slice_config.py. The driver looks something like:

    import wiki_page
    import wiki_slice_config
    
    #Setup the site and the page container
    site = wiki_page.WikiSite('http://en.wikipedia.org/w/index.php?')
    container = wiki_page.WikiPageContainer(wiki_slice_config.base_dir)
    
    for name in wiki_slice_config.template_blacklist:
        container.add_blacklist_template(name)
    
    #create pages and add to the container
    page = wiki_page.WikiPage('A','en','92413199',site)
    container.add_page(page)
    
    pictogram_page = wiki_page.WikiPage('pictogram','en','146198358',site)
    container.add_page(pictogram_page)
    
    #process the pages
    container.process_pages()
    
    #output pages
    container.dump_pages()

The idea is for batch jobs to be run with scripts like these, while more trouble-shooting jobs can be done within the python interpreter. Caching is not implemented yet, although there are notes in wiki_page.py on one way to make WikiPage objects cache their history. We could then cache WikiPage objects with pickle, or cache some dumped representation of them.

Useful Links

  • Mediawiki API specification - we are already using &action=raw, which is apparently part of this api. The API is HTTP-based. We should try to use as much of it as possible.