Talk:WikiBrowse: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
Line 19: Line 19:


: Thanks for the hard numbers - it would be great to be able to include more articles, but porting the block decompression code is a large undertaking. Basically we need the ability to seek to anywhere in an archive and decompress a few bytes. Maybe this is something LZMA already has, or maybe we can talk to the Wikipedia on iPhone guy about attempting it? [[User:Wade|wade]] 04:23, 26 May 2008 (EDT)
: Thanks for the hard numbers - it would be great to be able to include more articles, but porting the block decompression code is a large undertaking. Basically we need the ability to seek to anywhere in an archive and decompress a few bytes. Maybe this is something LZMA already has, or maybe we can talk to the Wikipedia on iPhone guy about attempting it? [[User:Wade|wade]] 04:23, 26 May 2008 (EDT)

=== cjb ===
''on the wikipedia activity'' : I think our solution has promise: we have mwlib (which has
become extremely featureful) hooked up to a local python web server,
serving out of a compressed archive, with link detection that colors
links based on whether they're in the local snapshot, and support for a
local image set. It would be great to find out if others are interested
in helping to turn this from an XO-specific tool into a standard
cross-platform wikislice web server.

=== Martin Walker ===
<pre>
Thanks for the WikiBrowse link - I recall that I looked at this a while ago, but I think it was quite basic at that time. I'm going to start watching that page a lot more closely!

Your project sounds like very valuable - please let us know when you have the code ready for use! I was very interested to hear about your goal of a "dynamic" system that can get updated - that is a longer term goal for our releases, but we haven't done anything in that area as yet.

Here are some useful URLs that relate to our work:
This one lists the WikiProjects in our assessment scheme; each project has a table listing the articles in quality order, a log tracking changes in assessments, and an overal lstats table.
http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Index

The Selection Bot is being actively worked in for the next month or so, as we prepare to make our final selection of around 20-30,000 articles for release this fall:
http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/SelectionBot

We did a test run of this Selection Bot back in March, and our initial results are here. We should have better results in a fortnight or so, with what we hope will be the final version of the bot code:
http://toolserver.org/~cbm/release-data/2008-3-7/HTML/
You can pull your selections from those lists if you wish, or (better) we may be able to arrange for you to access the raw data (it is on the main Toolserver); if you wish, I can get hold of the bot code. The bot could easily be adapted to select with a different weighting of quality and importance, though it needs to be run from "inside" Wikipedia.

The schools project on Wikipedia is just new:
http://en.wikipedia.org/wiki/Wikipedia:Wikimedia_School_Team

Andrew Cates' releases have been aimed at kids, and are tailored to the UK curriculum. He's currently putting together a DVD release of around 6000 articles, using some of our selection tools but with volunteers to check the content. I mentioned his past releases on the phone, these have proven very popular:
http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection

Linterweb has been working on offline software for Wikipedia for almost two years, and they have developed a lot of expertise, particularly in offline search engines. All their software is open source/GFDL, so you could incorporate it into your software. They published the Version 0.5 CD release, and they will be publishing our DVD release. They don't put a lot on the web, but the Kiwix site has some information (some in French):
http://www.kiwix.org/index.php/Main_Page

Please stay in touch. I wish you well with your excellent project!
Martin
</pre>

Revision as of 08:25, 19 July 2008

Compression

There are better choices than bzip2. The obvious one is 7zip, which is excellent for both decompression speed and compression ability according to this comparison chart. Note that higher compression levels mean faster decompression. AlbertCahalan 00:07, 26 May 2008 (EDT)


We tried 7zip. It didn't do any better than bzip2 for our archive, which is the latest revision of every article, but I was shocked to see that it produces archives 20x smaller for the archive that is every revision of every article.

We aren't able to interchange compression formats easily, since the code is dependent on an indexer that can create an index of articles into compression blocks, and then decode an individual compression block quickly. This would need to be ported per-format. Cjb 01:59, 26 May 2008 (EDT)


I took the .xo file and grabbed the compressed file out of that and ran it though lzma -9 The .bz2 file was 83,749k and the .lzma was 70,426. lzma -9 may be a bit hefty for the XO take a bit of memory on decompression. -7 would still be a gain. 121.72.128.89 03:41, 26 May 2008 (EDT) (which is lerc who really should make an account).

Thanks for the hard numbers - it would be great to be able to include more articles, but porting the block decompression code is a large undertaking. Basically we need the ability to seek to anywhere in an archive and decompress a few bytes. Maybe this is something LZMA already has, or maybe we can talk to the Wikipedia on iPhone guy about attempting it? wade 04:23, 26 May 2008 (EDT)

cjb

on the wikipedia activity : I think our solution has promise: we have mwlib (which has become extremely featureful) hooked up to a local python web server, serving out of a compressed archive, with link detection that colors links based on whether they're in the local snapshot, and support for a local image set. It would be great to find out if others are interested in helping to turn this from an XO-specific tool into a standard cross-platform wikislice web server.

Martin Walker

Thanks for the WikiBrowse link - I recall that I looked at this a while ago, but I think it was quite basic at that time.  I'm going to start watching that page a lot more closely!

Your project sounds like very valuable - please let us know when you have the code ready for use!  I was very interested to hear about your goal of a "dynamic" system that can get updated - that is a longer term goal for our releases, but we haven't done anything in that area as yet.

Here are some useful URLs that relate to our work:
This one lists the WikiProjects in our assessment scheme; each project has a table listing the articles in quality order, a log tracking changes in assessments, and an overal lstats table.
http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Index

The Selection Bot is being actively worked in for the next month or so, as we prepare to make our final selection of around 20-30,000 articles for release this fall:
http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/SelectionBot

We did a test run of this Selection Bot back in March, and our initial results are here.  We should have better results in a fortnight or so, with what we hope will be the final version of the bot code:
http://toolserver.org/~cbm/release-data/2008-3-7/HTML/
You can pull your selections from those lists if you wish, or (better) we may be able to arrange for you to access the raw data (it is on the main Toolserver); if you wish, I can get hold of the bot code.  The bot could easily be adapted to select with a different weighting of quality and importance, though it needs to be run from "inside" Wikipedia.

The schools project on Wikipedia is just new:
http://en.wikipedia.org/wiki/Wikipedia:Wikimedia_School_Team

Andrew Cates' releases have been aimed at kids, and are tailored to the UK curriculum.  He's currently putting together a DVD release of around 6000 articles, using some of our selection tools but with volunteers to check the content.  I mentioned his past releases on the phone, these have proven very popular:
http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection

Linterweb has been working on offline software for Wikipedia for almost two years, and they have developed a lot of expertise, particularly in offline search engines.  All their software is open source/GFDL, so you could incorporate it into your software.  They published the Version 0.5 CD release, and they will be publishing our DVD release.  They don't put a lot on the web, but the Kiwix site has some information (some in French):
http://www.kiwix.org/index.php/Main_Page

Please stay in touch.  I wish you well with your excellent project!
Martin