User:Godiard/WkipediaDataRebuild
Updating data in Wikipedia activities
Processing new wikipedia datadumps
Wikipedia provide a daily xml files dump fro every language. My test were done with the spanish dump. The file used was eswiki-20110810-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/
After decompress the file, I processed it with pages_parser.py
This parser generate the following files :
5759736764 sep 2 18:02 eswiki-20110810-pages-articles.xml 264264855 sep 9 19:30 eswiki-20110810-pages-articles.xml.blacklisted 609941327 sep 19 01:13 eswiki-20110810-pages-articles.xml.links 4328870460 sep 9 19:30 eswiki-20110810-pages-articles.xml.processed 66735340 sep 18 20:50 eswiki-20110810-pages-articles.xml.redirects 45633058 sep 9 19:30 eswiki-20110810-pages-articles.xml.templates 24481677 sep 9 19:30 eswiki-20110810-pages-articles.xml.titles
The file blacklisted have the content of all the pages in blacklisted namespaces ('Wikipedia:', 'MediaWiki:').
The list of blacklisted namespaces is configured in pages_parser.py
The file links have the same structure than the file all_links generated by PageLinks.pl (see #Original Process) Every page is stored in a line, the first word is the page title, and next, space separated, are all the links in the page. The spaces in the name of the page or the links are replaced by "_"
The file processed have the same structure than the old all_pages file:
'\01\n' '%s\n' % title '%d\n' % len(page) '\02\n '%s\n' % page '\03\n'
processed, like all_pages have only the real pages, not the redirects.
The file redirects have a list of all the redirects, and the same format than the old all_redirects
[ [title ] ]\t[ [destination] ]
(without spaces between, but wiki is confussed)
Right now, the spanish wikipedia have two different tags to identify a redirect '#REDIRECT' and '#REDIRECCIÓN', and 1214139 pages were identified like redirects. The tags are configured in pages_parser.py
The file templates have the same format than the processed file.
The file titles have the same format than the old all_pages file, one line by page with the page title. Excluding the blacklisted, the templates and the redirects, this file have reference of 1060777 pages.
Selecting a piece
When the first version version of wikiserver was prepared, the spanish version had between 200.000 and 300.000 articles. [1] Selecting 40.000 articles means between a 20% and a 13%. The blog post from Chris Ball [2] suggest this percentage was near 20% too.
At the moment, Spanish wikipedia is near 1.000.000 articles, then selecting 40.000 articles is only a 4%. Doing a good selection is difficult, and will be more and more difficult with the time.
-rw-rw-r--. 1 gonzalo gonzalo 79295775 sep 19 01:19 eswiki-20110810-pages-articles.xml.links_counted
-rw-rw-r--. 1 gonzalo gonzalo 1128569 sep 23 00:45 eswiki-20110810-pages-articles.xml.pages_selected-level-1 -rw-rw-r--. 1 gonzalo gonzalo 250428 sep 21 02:58 eswiki-20110810-pages-articles.xml.pages_selected-level-1.favorites -rw-rw-r--. 1 gonzalo gonzalo 3869636 sep 21 06:05 eswiki-20110810-pages-articles.xml.pages_selected-level-2
[1] http://es.wikipedia.org/wiki/Wikipedia_en_espa%C3%B1ol
[2] http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo
Original process
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/GetPages.pl all_pages all_redirects
[gonzalo@aronax NEW_DATA]$ wc -l all_pages 2328436 all_pages [gonzalo@aronax NEW_DATA]$ wc -l all_redirects 531754 all_redirects
NOTA: No me da la misma cantidad, probablemente porque esta excluyendo categorias.
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/PageLinks.pl all_pages all_redirects > all_links
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/CountLinks.pl all_pages all_redirects all_links > count_links
sort -nr count_links > count_links_sorted
Links
http://wiki.laptop.org/go/WikiBrowse
http://meta.wikimedia.org/wiki/Data_dumps
http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://users.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html