User:Godiard/WkipediaDataRebuild

From OLPC
Jump to navigation Jump to search

Updating data in Wikipedia activities

Processing new wikipedia datadumps

Wikipedia provide a daily xml files dump fro every language. My test were done with the spanish dump. The file used was eswiki-20110810-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/

After decompress the file, I processed it with pages_parser.py

This parser generate the following files :

5759736764 sep  2 18:02 eswiki-20110810-pages-articles.xml
 264264855 sep  9 19:30 eswiki-20110810-pages-articles.xml.blacklisted
 609941327 sep 19 01:13 eswiki-20110810-pages-articles.xml.links
4328870460 sep  9 19:30 eswiki-20110810-pages-articles.xml.processed
  66735340 sep 18 20:50 eswiki-20110810-pages-articles.xml.redirects
  45633058 sep  9 19:30 eswiki-20110810-pages-articles.xml.templates
  24481677 sep  9 19:30 eswiki-20110810-pages-articles.xml.titles


The file blacklisted have the content of all the pages in blacklisted namespaces ('Wikipedia:', 'MediaWiki:'). The list of blacklisted namespaces is configured in pages_parser.py

The file links have the same structure than the file all_links generated by PageLinks.pl (see Original process) Every page is stored in a line, the first word is the page title, and next, space separated, are all the links in the page. The spaces in the name of the page or the links are replaced by "_"

The file processed have the same structure than the old all_pages file:

'\01\n'
'%s\n' % title
'%d\n' % len(page)
'\02\n
'%s\n' % page
'\03\n'

processed, like all_pages have only the real pages, not the redirects.

The file redirects have a list of all the redirects, and the same format than the old all_redirects

[ [title ] ]\t[ [destination] ]

(without spaces between, but wiki is confussed)

Right now, the spanish wikipedia have two different tags to identify a redirect '#REDIRECT' and '#REDIRECCIÓN', and 1214139 pages were identified like redirects. The tags are configured in pages_parser.py

The file templates have the same format than the processed file.

The file titles have the same format than the old all_pages file, one line by page with the page title. Excluding the blacklisted, the templates and the redirects, this file have reference of 1060777 pages.

Selecting a piece

When the first version version of wikiserver was prepared, the spanish version had between 200.000 and 300.000 articles. [1] Selecting 40.000 articles means between a 20% and a 13%. The blog post from Chris Ball [2] suggest this percentage was near 20% too.

At the moment, Spanish wikipedia is near 1.000.000 articles, then selecting 40.000 articles is only a 4%. Doing a good selection is difficult, and will be more and more difficult with the time.


-rw-rw-r--. 1 gonzalo gonzalo   79295775 sep 19 01:19 eswiki-20110810-pages-articles.xml.links_counted
-rw-rw-r--. 1 gonzalo gonzalo    1128569 sep 23 00:45 eswiki-20110810-pages-articles.xml.pages_selected-level-1
-rw-rw-r--. 1 gonzalo gonzalo     250428 sep 21 02:58 eswiki-20110810-pages-articles.xml.pages_selected-level-1.favorites
-rw-rw-r--. 1 gonzalo gonzalo    3869636 sep 21 06:05 eswiki-20110810-pages-articles.xml.pages_selected-level-2


[1] http://es.wikipedia.org/wiki/Wikipedia_en_espa%C3%B1ol

[2] http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo

Original process

cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/GetPages.pl all_pages all_redirects
[gonzalo@aronax NEW_DATA]$ wc -l all_pages 
2328436 all_pages
[gonzalo@aronax NEW_DATA]$ wc -l all_redirects 
531754 all_redirects
NOTA: No me da la misma cantidad, probablemente porque esta excluyendo categorias.
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/PageLinks.pl all_pages all_redirects > all_links
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/CountLinks.pl all_pages all_redirects  all_links > count_links
sort -nr count_links > count_links_sorted

Links

http://wiki.laptop.org/go/WikiBrowse

http://meta.wikimedia.org/wiki/Data_dumps

http://en.wikipedia.org/wiki/Wikipedia:Database_download

http://users.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html