Wiki Slice

From OLPC

Jump to: navigation, search

Contents

[edit] Goal

The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material. A prototype example is at http://dev.laptop.org/pub/content/wp/en/index.html .

[edit] Code

You can pull or clone from the code hosted at dev.laptop.org:

git clone git://dev.laptop.org/projects/wikislice

or if you already have a git repo

git fetch git://dev.laptop.org/projects/wikislice master:my_new_branch

Also see the README.txt file.

[edit] Contributing

If you want to contribute files or patches, please send an email to library at lists.laptop.org with the subject heading 'Wikislice Contribution'. Please indicate what your are adding/patching, and one of the maintainers will review and push your code. Alternatively you can specify a public git repo to pull from. You can also chat with one of the maintainers on #olpc-content on irc at irc.freenode.net.

[edit] Implementation

Right now there is an existing perl implementation that I don't know too much about. You can have a look inside the old_perl_scripts directory. There is a new perl proof-of-concept/implementation in perl called op.pl by mncharity, and a new outline implementation in python. Both of these are outlined below.

[edit] Workflow

Taking wikipedia as an example of a wiki that we would want to slice, the general workflow is as follows:

  1. Get a list of pages to be included in the slice. This can be done via a configuration file, or can be 'harvested' from a web-page.
  2. For each of the pages:
    1. grab the wikitext of the page using the action=raw cgi parameter passed to the wiki's index.php script
    2. remove unwanted templates. These are specified in a configuration file.
    3. get the wiki text from expanding the templates (wia the Special:ExpandTemplates page)
    4. for wiki-links that refer to pages also included in the slice, turn these into local anchor tag links
    5. for wiki-links that refer to pages not included in the slice - de-link them (possibly highlight them)
    6. generate html. There are several methods to do this:
      • create a sandbox entry and generate the page
      • create a page preview (does not always work)
      • use the preview generated by Special:ExpandTemplates
    7. save the html
    8. find <img> tags and:
      1. download the image
      2. make the url refer to the local image
    9. wrap the html inside olpc's html wrapper
    10. save the html

[edit] op.pl

The op.pl implementation uses command-line options, local file caching, and pipelining to step through the workflow process. A typical usage is:

./op --process-article article_list_file en A 92413199

where the article_list_file contains a list of articles in (lang,page_name,oldid) tuples. The last three entries on this line is one of these tuples. oldid specifies the version of the page to be grabbed. A pipeline can be seen with

./op --stage1 article_list_file en Addition 92943836 |./op --wikitext-templates|./op --templates-curate en |sort -u > tmp.html

or

find en_raw_files -type f|./nl0|xargs -0 cat| ./op --wikitext-templates|sort -u

Essentially there are command-line options for each stage in the pipeline, with the stages

  • "working with templates, prior to template expansion"(stage1)
  • "template expansion"
  • "working with wikitext, which must follow template expansion"(stage2)
  • "rendering of wikitext as html"
  • "working with html"(stage3)

op.pl is a single perl file.

[edit] wiki_slice.py

The wiki_slice python implementation is a little different in spirit. Rather than have each step of the workflow implemented as a command line option, they are implemented as methods in a WikiPage object. The file wiki_page.py defines three classes:

  • WikiPage - represents a single wiki page and has methods for performing the workflow above
  • WikiPageContainer - a container for a group of wiki pages - it tells the wikipages to run their workflow and tells them about blacklisted templates and what other pages are in their group
  • WikiSite - a container for information about the wiki site the pages are coming from. WikiPage's own a WikiSite object.

A sample driver script that uses these classes is in wiki_slice.py, with config information written in wiki_slice_config.py. The driver looks something like:

    import wiki_page
    import wiki_slice_config
    
    #Setup the site and the page container
    site = wiki_page.WikiSite('http://en.wikipedia.org/w/index.php?')
    container = wiki_page.WikiPageContainer(wiki_slice_config.base_dir)
    
    for name in wiki_slice_config.template_blacklist:
        container.add_blacklist_template(name)
    
    #create pages and add to the container
    page = wiki_page.WikiPage('A','en','92413199',site)
    container.add_page(page)
    
    pictogram_page = wiki_page.WikiPage('pictogram','en','146198358',site)
    container.add_page(pictogram_page)
    
    #process the pages
    container.process_pages()
    
    #output pages
    container.dump_pages()

The idea is for batch jobs to be run with scripts like these, while more trouble-shooting jobs can be done within the python interpreter. Caching is not implemented yet, although there are notes in wiki_page.py on one way to make WikiPage objects cache their history. We could then cache WikiPage objects with pickle, or cache some dumped representation of them.

[edit] History

  • 2007-07 - A python-based infrastructure was begun. (wikisnap, and the spike, were perl).
  • 2007-07 - An experimental spike explored doing manipulation in wikitext, rather than just in html (as wikisnap did). Using the originating wikipedias to expand templates, and to render wikitext to html.
  • 2007-07 - For Trial-2, a smaller, ~20MB wikipedia snapshot was needed. A subset of the 2006 snapshot was made, based on User:Sj/wp-small.
  • 2007-07 - projects/wikislice created. wikisnap added.
  • 2006 - wikisnap was created, and used to generate a wikipedia snapshot based on User:Sj/wp.
  • 2006? - A 10 article x 6 language demo wikipedia snapshot was done.

[edit] Useful Links

Personal tools
  • Log in / create account
  • Login with OpenID
About OLPC
About the XO
Projects
OLPC wiki
Toolbox