User:Godiard/WkipediaDataRebuild: Difference between revisions
(Created page with '== Updating data in Wikipedia activities == === Processing new wikipedia datadumps === Wikipedia provide a daily xml files dump fro every language. My test were done with the s…') |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Updating data in Wikipedia activities == |
== Updating data in Wikipedia activities == |
||
'''This page is deprecated, the updated info is in http://wiki.sugarlabs.org/go/Activities/Wikipedia/HowTo''' |
|||
=== Processing new wikipedia datadumps === |
=== Processing new wikipedia datadumps === |
||
Line 22: | Line 24: | ||
The list of blacklisted namespaces is configured in pages_parser.py |
The list of blacklisted namespaces is configured in pages_parser.py |
||
The file '''links''' have the same structure than the file all_links generated by PageLinks.pl (see [[#Original |
The file '''links''' have the same structure than the file all_links generated by PageLinks.pl (see [[#Original process|Original process]]) |
||
Every page is stored in a line, the first word is the page title, and next, space separated, are all the links in the page. |
Every page is stored in a line, the first word is the page title, and next, space separated, are all the links in the page. |
||
The spaces in the name of the page or the links are replaced by "_" |
The spaces in the name of the page or the links are replaced by "_" |
||
Line 54: | Line 56: | ||
When the first version version of wikiserver was prepared, the spanish version had between 200.000 and 300.000 articles. [1] |
When the first version version of wikiserver was prepared, the spanish version had between 200.000 and 300.000 articles. [1] |
||
Selecting 40.000 articles means between a 20% and a 13%. The blog post from Chris Ball [2] suggest this percentage was near 20% too. |
Selecting 40.000 articles means between a 20% and a 13%. The blog post from Chris Ball [2] suggest this percentage was near 20% too. |
||
In my experiments I will continue using this 'magic number' of 40.000 articles to be able to compare in any way, size, speed, etc. |
|||
At the moment, Spanish wikipedia is near 1.000.000 articles, then selecting 40.000 articles is only a 4%. |
At the moment, Spanish wikipedia is near 1.000.000 articles, then selecting 40.000 articles is only a 4%. |
||
The English Wikipedia is 3.700.000 articles, our selection will be aprox. a 1%. |
|||
Doing a good selection is difficult, and will be more and more difficult with the time. |
Doing a good selection is difficult, and will be more and more difficult with the time. |
||
I have tried with different criterias to do the selection: |
|||
* Do a simple ranking with the number of links pointing to a page: the file '''make_ranking.py''' create a file '''eswiki-20110810-pages-articles.xml.links_counted''' |
|||
with every page and how many links are pointing to the page, sorted. The result is not very prommising. |
|||
* Select all the pages linked to the actual home in our wikipedia (130 pages), and the pages liked in this pages. Is logic assure the pages linked in our home page are included [3], |
|||
-rw-rw-r--. 1 gonzalo gonzalo 1128569 sep 23 00:45 eswiki-20110810-pages-articles.xml.pages_selected-level-1 |
|||
and enable discovery through navigation. We have a very good selection of pages in our home, I used this selection, and added 'Wikipedia:Artículos_destacados', |
|||
-rw-rw-r--. 1 gonzalo gonzalo 250428 sep 21 02:58 eswiki-20110810-pages-articles.xml.pages_selected-level-1.favorites |
|||
using this method and this list of pages, using '''make_selection.py''' selected 15788 pages. (file eswiki-20110810-pages-articles.xml.pages_selected-level-1.favorites) |
|||
* Tried selecting all the previous pages, and the pages linked to them, selected 219782 pages. (file eswiki-20110810-pages-articles.xml.pages_selected-level-2) |
|||
* Tried use the same method (select pages in a list, and the pages linked in these pages) but with a bigger list. To prepare this list, used the statistics of use from [4], |
|||
downloaded the spanish statistics page, processed it with stats_parser.py to get a list of pages in txt format, and cleaned it manually, removing singers, actors, futbol players, etc, |
|||
and get a list (added to our initial list) of 544 pages. Processing with '''make_selection.py''' I have 63347 pages selected. |
|||
* Cleaned again the list of initial pages until have 431 pages. Processing with '''make_selection.py''' I have 49852 pages selected. |
|||
Line 70: | Line 81: | ||
[2] http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo |
[2] http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo |
||
[3] http://bugs.sugarlabs.org/ticket/2905 |
|||
[4] http://stats.grok.se |
|||
=== Original process === |
=== Original process === |
||
Line 75: | Line 90: | ||
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/GetPages.pl all_pages all_redirects |
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/GetPages.pl all_pages all_redirects |
||
[gonzalo@aronax NEW_DATA]$ wc -l all_pages |
[gonzalo@aronax NEW_DATA]$ wc -l all_pages |
||
2328436 all_pages |
2328436 all_pages |
||
[gonzalo@aronax NEW_DATA]$ wc -l all_redirects |
[gonzalo@aronax NEW_DATA]$ wc -l all_redirects |
||
531754 all_redirects |
531754 all_redirects |
||
NOTA: No me da la misma cantidad, probablemente porque esta excluyendo categorias. |
NOTA: No me da la misma cantidad, probablemente porque esta excluyendo categorias. |
||
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/PageLinks.pl all_pages all_redirects > all_links |
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/PageLinks.pl all_pages all_redirects > all_links |
||
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/CountLinks.pl all_pages all_redirects all_links > count_links |
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/CountLinks.pl all_pages all_redirects all_links > count_links |
||
sort -nr count_links > count_links_sorted |
sort -nr count_links > count_links_sorted |
||
Latest revision as of 12:56, 12 January 2012
Updating data in Wikipedia activities
This page is deprecated, the updated info is in http://wiki.sugarlabs.org/go/Activities/Wikipedia/HowTo
Processing new wikipedia datadumps
Wikipedia provide a daily xml files dump fro every language. My test were done with the spanish dump. The file used was eswiki-20110810-pages-articles.xml.bz2 from http://dumps.wikimedia.org/eswiki/20110810/
After decompress the file, I processed it with pages_parser.py
This parser generate the following files :
5759736764 sep 2 18:02 eswiki-20110810-pages-articles.xml 264264855 sep 9 19:30 eswiki-20110810-pages-articles.xml.blacklisted 609941327 sep 19 01:13 eswiki-20110810-pages-articles.xml.links 4328870460 sep 9 19:30 eswiki-20110810-pages-articles.xml.processed 66735340 sep 18 20:50 eswiki-20110810-pages-articles.xml.redirects 45633058 sep 9 19:30 eswiki-20110810-pages-articles.xml.templates 24481677 sep 9 19:30 eswiki-20110810-pages-articles.xml.titles
The file blacklisted have the content of all the pages in blacklisted namespaces ('Wikipedia:', 'MediaWiki:').
The list of blacklisted namespaces is configured in pages_parser.py
The file links have the same structure than the file all_links generated by PageLinks.pl (see Original process) Every page is stored in a line, the first word is the page title, and next, space separated, are all the links in the page. The spaces in the name of the page or the links are replaced by "_"
The file processed have the same structure than the old all_pages file:
'\01\n' '%s\n' % title '%d\n' % len(page) '\02\n '%s\n' % page '\03\n'
processed, like all_pages have only the real pages, not the redirects.
The file redirects have a list of all the redirects, and the same format than the old all_redirects
[ [title ] ]\t[ [destination] ]
(without spaces between, but wiki is confussed)
Right now, the spanish wikipedia have two different tags to identify a redirect '#REDIRECT' and '#REDIRECCIÓN', and 1214139 pages were identified like redirects. The tags are configured in pages_parser.py
The file templates have the same format than the processed file.
The file titles have the same format than the old all_pages file, one line by page with the page title. Excluding the blacklisted, the templates and the redirects, this file have reference of 1060777 pages.
Selecting a piece
When the first version version of wikiserver was prepared, the spanish version had between 200.000 and 300.000 articles. [1] Selecting 40.000 articles means between a 20% and a 13%. The blog post from Chris Ball [2] suggest this percentage was near 20% too.
In my experiments I will continue using this 'magic number' of 40.000 articles to be able to compare in any way, size, speed, etc.
At the moment, Spanish wikipedia is near 1.000.000 articles, then selecting 40.000 articles is only a 4%. The English Wikipedia is 3.700.000 articles, our selection will be aprox. a 1%. Doing a good selection is difficult, and will be more and more difficult with the time.
I have tried with different criterias to do the selection:
- Do a simple ranking with the number of links pointing to a page: the file make_ranking.py create a file eswiki-20110810-pages-articles.xml.links_counted
with every page and how many links are pointing to the page, sorted. The result is not very prommising.
- Select all the pages linked to the actual home in our wikipedia (130 pages), and the pages liked in this pages. Is logic assure the pages linked in our home page are included [3],
and enable discovery through navigation. We have a very good selection of pages in our home, I used this selection, and added 'Wikipedia:Artículos_destacados', using this method and this list of pages, using make_selection.py selected 15788 pages. (file eswiki-20110810-pages-articles.xml.pages_selected-level-1.favorites)
- Tried selecting all the previous pages, and the pages linked to them, selected 219782 pages. (file eswiki-20110810-pages-articles.xml.pages_selected-level-2)
- Tried use the same method (select pages in a list, and the pages linked in these pages) but with a bigger list. To prepare this list, used the statistics of use from [4],
downloaded the spanish statistics page, processed it with stats_parser.py to get a list of pages in txt format, and cleaned it manually, removing singers, actors, futbol players, etc, and get a list (added to our initial list) of 544 pages. Processing with make_selection.py I have 63347 pages selected.
- Cleaned again the list of initial pages until have 431 pages. Processing with make_selection.py I have 49852 pages selected.
[1] http://es.wikipedia.org/wiki/Wikipedia_en_espa%C3%B1ol
[2] http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo
[3] http://bugs.sugarlabs.org/ticket/2905
Original process
cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/GetPages.pl all_pages all_redirects [gonzalo@aronax NEW_DATA]$ wc -l all_pages 2328436 all_pages [gonzalo@aronax NEW_DATA]$ wc -l all_redirects 531754 all_redirects NOTA: No me da la misma cantidad, probablemente porque esta excluyendo categorias. cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/PageLinks.pl all_pages all_redirects > all_links cat eswiki-20110810-pages-articles.xml | ../wikiserver/tools/CountLinks.pl all_pages all_redirects all_links > count_links sort -nr count_links > count_links_sorted
Links
http://wiki.laptop.org/go/WikiBrowse
http://meta.wikimedia.org/wiki/Data_dumps
http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://users.softlab.ntua.gr/~ttsiod/buildWikipediaOffline.html