Wiki Slice: Difference between revisions
m (→Files) |
(→Implementation: added op.pl notes) |
||
Line 48: | Line 48: | ||
=== op.pl === |
=== op.pl === |
||
The op.pl implementation uses command-line options, local file caching, and pipelining to step through the workflow process. A typical usage is: |
|||
./op --process-article article_list_file en A 92413199 |
|||
where the article_list_file contains a list of articles in (lang,page_name,oldid) tuples. The last three entries on this line is one of these tuples. oldid specifies the version of the page to be grabbed. A pipeline can be seen with |
|||
./op --stage1 article_list_file en Addition 92943836 |./op --wikitext-templates|./op --templates-curate en |sort -u > tmp.html |
|||
or |
|||
find en_raw_files -type f|./nl0|xargs -0 cat| ./op --wikitext-templates|sort -u |
|||
Essentially there are command-line options for each stage in the pipeline, with the stages |
|||
* "working with templates, prior to template expansion"(stage1) |
|||
* "template expansion" |
|||
* "working with wikitext, which must follow template expansion"(stage2) |
|||
* "rendering of wikitext as html" |
|||
* "working with html"(stage3) |
|||
op.pl is a single perl file. |
|||
=== wiki_slice.py === |
=== wiki_slice.py === |
Revision as of 21:03, 26 July 2007
Goal
The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material.
Code
The code will be hosted on dev.laptop soon. For the moment, you can pull from Julius Lucks' public git repo for the project. Do
git clone http://slyjbl.hopto.org/~olpc/git/wikipedia_olpc_scripts.git
or if you already have a git repo
git fetch http://slyjbl.hopto.org/~olpc/git/wikipedia_olpc_scripts.git master:my_new_branch
Files
See 'Implementation' below for an explanation of these files
old_perl_scripts/ # the old perl implementation op.pl #mncharity's new perl implementation wiki_slice.py # stub python implementation wiki_page.py wiki_slice_config.y sandbox/ # some python play-around code site/ #includes olpc_wrapper.html for wrapping page html in, and other necc. css files, etc.
Implementation
Right now there is an existing perl implementation that I don't know too much about. You can have a look inside the old_perl_scripts directory. There is a new perl proof-of-concept/implementation in perl called op.pl by mncharity, and a new outline implementation in python. Both of these are outlined below.
Workflow
Taking wikipedia as an example of a wiki that we would want to slice, the general workflow is as follows:
- Get a list of pages to be included in the slice. This can be done via a configuration file, or can be 'harvested' from a web-page.
- For each of the pages:
- grab the wikitext of the page using the action=raw cgi parameter passed to the wiki's index.php script
- remove unwanted templates. These are specified in a configuration file.
- get the wiki text from expanding the templates (wia the Special:ExpandTemplates page)
- for wiki-links that refer to pages also included in the slice, turn these into local anchor tag links
- for wiki-links that refer to pages not included in the slice - de-link them (possibly highlight them)
- generate html. There are several methods to do this:
- create a sandbox entry and generate the page
- create a page preview (does not always work)
- use the preview generated by Special:ExpandTemplates
- save the html
- find <img> tags and:
- download the image
- make the url refer to the local image
- wrap the html inside olpc's html wrapper
- save the html
op.pl
The op.pl implementation uses command-line options, local file caching, and pipelining to step through the workflow process. A typical usage is:
./op --process-article article_list_file en A 92413199
where the article_list_file contains a list of articles in (lang,page_name,oldid) tuples. The last three entries on this line is one of these tuples. oldid specifies the version of the page to be grabbed. A pipeline can be seen with
./op --stage1 article_list_file en Addition 92943836 |./op --wikitext-templates|./op --templates-curate en |sort -u > tmp.html
or
find en_raw_files -type f|./nl0|xargs -0 cat| ./op --wikitext-templates|sort -u
Essentially there are command-line options for each stage in the pipeline, with the stages
- "working with templates, prior to template expansion"(stage1)
- "template expansion"
- "working with wikitext, which must follow template expansion"(stage2)
- "rendering of wikitext as html"
- "working with html"(stage3)
op.pl is a single perl file.