User:Godiard/WkipediaDataRebuild

Updating data in Wikipedia activities

This page is deprecated, the updated info is in http://wiki.sugarlabs.org/go/Activities/Wikipedia/HowTo

Processing new wikipedia datadumps

Wikipedia provide a daily xml files dump fro every language. My test were done with the spanish dump. The file used was eswiki-20110810-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/

After decompress the file, I processed it with pages_parser.py

This parser generate the following files :

5759736764 sep  2 18:02 eswiki-20110810-pages-articles.xml
 264264855 sep  9 19:30 eswiki-20110810-pages-articles.xml.blacklisted
 609941327 sep 19 01:13 eswiki-20110810-pages-articles.xml.links
4328870460 sep  9 19:30 eswiki-20110810-pages-articles.xml.processed
  66735340 sep 18 20:50 eswiki-20110810-pages-articles.xml.redirects
  45633058 sep  9 19:30 eswiki-20110810-pages-articles.xml.templates
  24481677 sep  9 19:30 eswiki-20110810-pages-articles.xml.titles

The file blacklisted have the content of all the pages in blacklisted namespaces ('Wikipedia:', 'MediaWiki:'). The list of blacklisted namespaces is configured in pages_parser.py

The file links have the same structure than the file all_links generated by PageLinks.pl (see Original process) Every page is stored in a line, the first word is the page title, and next, space separated, are all the links in the page. The spaces in the name of the page or the links are replaced by "_"

The file processed have the same structure than the old all_pages file:

'\01\n'
'%s\n' % title
'%d\n' % len(page)
'\02\n
'%s\n' % page
'\03\n'

processed, like all_pages have only the real pages, not the redirects.

The file redirects have a list of all the redirects, and the same format than the old all_redirects

[ [title ] ]\t[ [destination] ]

(without spaces between, but wiki is confussed)

Right now, the spanish wikipedia have two different tags to identify a redirect '#REDIRECT' and '#REDIRECCIÓN', and 1214139 pages were identified like redirects. The tags are configured in pages_parser.py

The file templates have the same format than the processed file.

The file titles have the same format than the old all_pages file, one line by page with the page title. Excluding the blacklisted, the templates and the redirects, this file have reference of 1060777 pages.

Selecting a piece

When the first version version of wikiserver was prepared, the spanish version had between 200.000 and 300.000 articles. [1] Selecting 40.000 articles means between a 20% and a 13%. The blog post from Chris Ball [2] suggest this percentage was near 20% too.

In my experiments I will continue using this 'magic number' of 40.000 articles to be able to compare in any way, size, speed, etc.

At the moment, Spanish wikipedia is near 1.000.000 articles, then selecting 40.000 articles is only a 4%. The English Wikipedia is 3.700.000 articles, our selection will be aprox. a 1%. Doing a good selection is difficult, and will be more and more difficult with the time.

I have tried with different criterias to do the selection:

Do a simple ranking with the number of links pointing to a page: the file make_ranking.py create a file eswiki-20110810-pages-articles.xml.links_counted

with every page and how many links are pointing to the page, sorted. The result is not very prommising.

Select all the pages linked to the actual home in our wikipedia (130 pages), and the pages liked in this pages. Is logic assure the pages linked in our home page are included [3],

and enable discovery through navigation. We have a very good selection of pages in our home, I used this selection, and added 'Wikipedia:Artículos_destacados', using this method and this list of pages, using make_selection.py selected 15788 pages. (file eswiki-20110810-pages-articles.xml.pages_selected-level-1.favorites)

Tried selecting all the previous pages, and the pages linked to them, selected 219782 pages. (file eswiki-20110810-pages-articles.xml.pages_selected-level-2)
Tried use the same method (select pages in a list, and the pages linked in these pages) but with a bigger list. To prepare this list, used the statistics of use from [4],

downloaded the spanish statistics page, processed it with stats_parser.py to get a list of pages in txt format, and cleaned it manually, removing singers, actors, futbol players, etc, and get a list (added to our initial list) of 544 pages. Processing with make_selection.py I have 63347 pages selected.

Cleaned again the list of initial pages until have 431 pages. Processing with make_selection.py I have 49852 pages selected.

[1] http://es.wikipedia.org/wiki/Wikipedia_en_espa%C3%B1ol

[2] http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo

[3] http://bugs.sugarlabs.org/ticket/2905

[4] http://stats.grok.se

Original process

cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/GetPages.pl all_pages all_redirects
[gonzalo@aronax NEW_DATA]$ wc -l all_pages 
2328436 all_pages
[gonzalo@aronax NEW_DATA]$ wc -l all_redirects 
531754 all_redirects
NOTA: No me da la misma cantidad, probablemente porque esta excluyendo categorias.
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/PageLinks.pl all_pages all_redirects > all_links
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/CountLinks.pl all_pages all_redirects  all_links > count_links
sort -nr count_links > count_links_sorted