Wiki Slice: Difference between revisions
(→Implementation: added op.pl notes) |
(fix link) |
||
(14 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Goal == |
== Goal == |
||
The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material. |
The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material. A prototype example is at http://dev.laptop.org/pub/content/wp/en/index.html . |
||
== Code == |
== Code == |
||
You can pull or clone from the code hosted at [http://dev.laptop.org/git.do?p=projects/wikislice;a=summary dev.laptop.org]: |
|||
The code will be hosted on dev.laptop soon. For the moment, you can pull from [[User:Lucks|Julius Lucks']] public git repo for the project. Do |
|||
git clone |
git clone git://dev.laptop.org/projects/wikislice |
||
or if you already have a git repo |
or if you already have a git repo |
||
git fetch |
git fetch git://dev.laptop.org/projects/wikislice master:my_new_branch |
||
Also see the [http://dev.laptop.org/git.do?p=projects/wikislice;a=blob;f=README.txt;hb=master README.txt] file. |
|||
=== Files === |
|||
=== Contributing === |
|||
See 'Implementation' below for an explanation of these files |
|||
If you want to contribute files or patches, please send an email to library at lists.laptop.org with the subject heading 'Wikislice Contribution'. Please indicate what your are adding/patching, and one of the maintainers will review and push your code. Alternatively you can specify a public git repo to pull from. You can also chat with one of the maintainers on #olpc-content on irc at irc.freenode.net. |
|||
⚫ | |||
old_perl_scripts/ # the old perl implementation |
|||
op.pl #mncharity's new perl implementation |
|||
wiki_slice.py # stub python implementation |
|||
wiki_page.py |
|||
⚫ | |||
sandbox/ # some python play-around code |
|||
site/ #includes olpc_wrapper.html for wrapping page html in, and other necc. css files, etc. |
|||
⚫ | |||
== Implementation == |
== Implementation == |
||
Line 66: | Line 58: | ||
=== wiki_slice.py === |
=== wiki_slice.py === |
||
The wiki_slice python implementation is a little different in spirit. Rather than have each step of the workflow implemented as a command line option, they are implemented as methods in a WikiPage object. The file wiki_page.py defines three classes: |
|||
* WikiPage - represents a single wiki page and has methods for performing the workflow above |
|||
* WikiPageContainer - a container for a group of wiki pages - it tells the wikipages to run their workflow and tells them about blacklisted templates and what other pages are in their group |
|||
* WikiSite - a container for information about the wiki site the pages are coming from. WikiPage's own a WikiSite object. |
|||
A sample driver script that uses these classes is in wiki_slice.py, with config information written in wiki_slice_config.py. The driver looks something like: |
|||
⚫ | |||
import wiki_page |
|||
⚫ | |||
#Setup the site and the page container |
|||
site = wiki_page.WikiSite('http://en.wikipedia.org/w/index.php?') |
|||
container = wiki_page.WikiPageContainer(wiki_slice_config.base_dir) |
|||
for name in wiki_slice_config.template_blacklist: |
|||
container.add_blacklist_template(name) |
|||
#create pages and add to the container |
|||
page = wiki_page.WikiPage('A','en','92413199',site) |
|||
container.add_page(page) |
|||
pictogram_page = wiki_page.WikiPage('pictogram','en','146198358',site) |
|||
container.add_page(pictogram_page) |
|||
#process the pages |
|||
container.process_pages() |
|||
#output pages |
|||
container.dump_pages() |
|||
⚫ | |||
The idea is for batch jobs to be run with scripts like these, while more trouble-shooting jobs can be done within the python interpreter. Caching is not implemented yet, although there are notes in wiki_page.py on one way to make WikiPage objects cache their history. We could then cache WikiPage objects with pickle, or cache some dumped representation of them. |
|||
== History == |
|||
*2007-07 - A python-based infrastructure was begun. (wikisnap, and the spike, were perl). |
|||
*2007-07 - An experimental spike explored doing manipulation in wikitext, rather than just in html (as wikisnap did). Using the originating wikipedias to expand templates, and to render wikitext to html. |
|||
*2007-07 - For Trial-2, a smaller, ~20MB wikipedia snapshot was needed. A subset of the 2006 snapshot was made, based on [http://en.wikipedia.org/wiki/User:Sj/wp-small User:Sj/wp-small]. |
|||
*2007-07 - [http://dev.laptop.org/git?p=projects/wikislice;a=summary projects/wikislice] created. wikisnap added. |
|||
*2006 - wikisnap was created, and used to generate a wikipedia snapshot based on [http://en.wikipedia.org/wiki/User:Sj/wp User:Sj/wp]. |
|||
*2006? - A 10 article x 6 language demo wikipedia snapshot was done. |
|||
== Useful Links == |
|||
* [http://www.mediawiki.org/wiki/API Mediawiki API specification] - we are already using &action=raw, which is apparently part of this api. The API is HTTP-based. We should try to use as much of it as possible. |
|||
** [http://sourceforge.net/projects/pywikipediabot/ PyWikipediaBot] - something using these api's written in python |
|||
*http://en.wikipedia.org/wiki/User:Sj/wp |
|||
*http://en.wikipedia.org/wiki/User:Sj/wp-small |
Latest revision as of 05:22, 20 March 2008
Goal
The Wiki Slice project aims to create snapshots of wiki sites for use as offline reading material. A prototype example is at http://dev.laptop.org/pub/content/wp/en/index.html .
Code
You can pull or clone from the code hosted at dev.laptop.org:
git clone git://dev.laptop.org/projects/wikislice
or if you already have a git repo
git fetch git://dev.laptop.org/projects/wikislice master:my_new_branch
Also see the README.txt file.
Contributing
If you want to contribute files or patches, please send an email to library at lists.laptop.org with the subject heading 'Wikislice Contribution'. Please indicate what your are adding/patching, and one of the maintainers will review and push your code. Alternatively you can specify a public git repo to pull from. You can also chat with one of the maintainers on #olpc-content on irc at irc.freenode.net.
Implementation
Right now there is an existing perl implementation that I don't know too much about. You can have a look inside the old_perl_scripts directory. There is a new perl proof-of-concept/implementation in perl called op.pl by mncharity, and a new outline implementation in python. Both of these are outlined below.
Workflow
Taking wikipedia as an example of a wiki that we would want to slice, the general workflow is as follows:
- Get a list of pages to be included in the slice. This can be done via a configuration file, or can be 'harvested' from a web-page.
- For each of the pages:
- grab the wikitext of the page using the action=raw cgi parameter passed to the wiki's index.php script
- remove unwanted templates. These are specified in a configuration file.
- get the wiki text from expanding the templates (wia the Special:ExpandTemplates page)
- for wiki-links that refer to pages also included in the slice, turn these into local anchor tag links
- for wiki-links that refer to pages not included in the slice - de-link them (possibly highlight them)
- generate html. There are several methods to do this:
- create a sandbox entry and generate the page
- create a page preview (does not always work)
- use the preview generated by Special:ExpandTemplates
- save the html
- find <img> tags and:
- download the image
- make the url refer to the local image
- wrap the html inside olpc's html wrapper
- save the html
op.pl
The op.pl implementation uses command-line options, local file caching, and pipelining to step through the workflow process. A typical usage is:
./op --process-article article_list_file en A 92413199
where the article_list_file contains a list of articles in (lang,page_name,oldid) tuples. The last three entries on this line is one of these tuples. oldid specifies the version of the page to be grabbed. A pipeline can be seen with
./op --stage1 article_list_file en Addition 92943836 |./op --wikitext-templates|./op --templates-curate en |sort -u > tmp.html
or
find en_raw_files -type f|./nl0|xargs -0 cat| ./op --wikitext-templates|sort -u
Essentially there are command-line options for each stage in the pipeline, with the stages
- "working with templates, prior to template expansion"(stage1)
- "template expansion"
- "working with wikitext, which must follow template expansion"(stage2)
- "rendering of wikitext as html"
- "working with html"(stage3)
op.pl is a single perl file.
wiki_slice.py
The wiki_slice python implementation is a little different in spirit. Rather than have each step of the workflow implemented as a command line option, they are implemented as methods in a WikiPage object. The file wiki_page.py defines three classes:
- WikiPage - represents a single wiki page and has methods for performing the workflow above
- WikiPageContainer - a container for a group of wiki pages - it tells the wikipages to run their workflow and tells them about blacklisted templates and what other pages are in their group
- WikiSite - a container for information about the wiki site the pages are coming from. WikiPage's own a WikiSite object.
A sample driver script that uses these classes is in wiki_slice.py, with config information written in wiki_slice_config.py. The driver looks something like:
import wiki_page import wiki_slice_config #Setup the site and the page container site = wiki_page.WikiSite('http://en.wikipedia.org/w/index.php?') container = wiki_page.WikiPageContainer(wiki_slice_config.base_dir) for name in wiki_slice_config.template_blacklist: container.add_blacklist_template(name) #create pages and add to the container page = wiki_page.WikiPage('A','en','92413199',site) container.add_page(page) pictogram_page = wiki_page.WikiPage('pictogram','en','146198358',site) container.add_page(pictogram_page) #process the pages container.process_pages() #output pages container.dump_pages()
The idea is for batch jobs to be run with scripts like these, while more trouble-shooting jobs can be done within the python interpreter. Caching is not implemented yet, although there are notes in wiki_page.py on one way to make WikiPage objects cache their history. We could then cache WikiPage objects with pickle, or cache some dumped representation of them.
History
- 2007-07 - A python-based infrastructure was begun. (wikisnap, and the spike, were perl).
- 2007-07 - An experimental spike explored doing manipulation in wikitext, rather than just in html (as wikisnap did). Using the originating wikipedias to expand templates, and to render wikitext to html.
- 2007-07 - For Trial-2, a smaller, ~20MB wikipedia snapshot was needed. A subset of the 2006 snapshot was made, based on User:Sj/wp-small.
- 2007-07 - projects/wikislice created. wikisnap added.
- 2006 - wikisnap was created, and used to generate a wikipedia snapshot based on User:Sj/wp.
- 2006? - A 10 article x 6 language demo wikipedia snapshot was done.
Useful Links
- Mediawiki API specification - we are already using &action=raw, which is apparently part of this api. The API is HTTP-based. We should try to use as much of it as possible.
- PyWikipediaBot - something using these api's written in python
- http://en.wikipedia.org/wiki/User:Sj/wp
- http://en.wikipedia.org/wiki/User:Sj/wp-small