Talk:Wikislices

From OLPC
Revision as of 17:41, 26 February 2008 by Sj (talk | contribs) (section)
Jump to navigation Jump to search

Meeting notes 2/21/08

Overall goal of meeting: wiki-hacking session to improve on the tools that Zdenek and others are currently using to make & refine wikislices. Held in #olpc-content on freenode.

Wikipedia snapshots

Developing snapshots of Wikipedia at every order of magnitude from 10MB to 100GB.

snapshot tools

We need...

  • libraries for producing different styles of wikipedia snapshots (wikitext, html, txt, pdf) from categories (special:export), topics/pages (wikiosity), and index pages (wikislices)
  • libraries that can do intelligent things with metadata from history and wikimedia-commons pages
  • libraries that support no-image/thumbnail/mid-res image selection
  • libraries that recalculate blue v. red links given a wikislice

Wiki format glue

We need glue code/scripts to interface between similar projects : WP WikiReaders, Wikibooks, wikipedia wikislice projects, webaroo wikislices, kiwix snapshots, schools-wikipedia snapshots, ksana snapshots, WP 1.0 revision-vetting --- at least at the level of sharing index selections and a list of "good revisions" for included articles.

Offline readers

As a recent focal point, Zvi Boshernitzan and Ben Lisbakken have both made offline wikipedia-readers using Google Gears that are pretty fast and offer some nice features in terms of letting you select a set of articles, cache them locally, and browse an index. We talked last week about how to integrate Gears more tightly into a browsing experience, with hopes of pursuing a prototype withing a couple of weeks. It would be helpful to inform such a client with lists of good revisions of articles, such a those Martin Walker and Andrew Cates have developed for their own projects... and to plan for it to support offline editing as well as reading, using synchronization tools such as Mako's distributed wiki client.

What can people do?

  • wikipediaondvd - Pascal and Guillame wondering how they can help


Meeting minutes 2008-02-2?

care of mel

(Introductions all around, discussion of whether there is actually a meeting today)

Pascal: i am Pascal Martin from linterweb ( http://wikipediaondvd.com) Guillaume is working for linterweb and we really really need information about what you need

Mel: Well, we could certainly talk in the absence of a formal meeting, still. :) I'm curious to hear what you've been doing with wikipediaondvd. I'm reading your website and your wikipedia pages right now. I'm most curious about the processes (both community and technological) you've used to select, curate, and package your work, and whether they're easily transferable to do curations of other wikis.

Pascal: wikipediaondvd is us, not automatical selection, manual selection for the en community

Mel: Right - wikipediaondvd and the http://en.wikipedia.org/Wikipedia:Version_1.0_Editorial_Team wp page are what I'm reading. It looks like you have some very nice community processes in place for making the content selection work. Since other wiki information repositories (like UNICEF, Appropedia, and so forth) have expressed at least tentative interest in making similar wikislices as content bundles for the XO, I'm reading about what you've done for wikipediaondvd in the hopes that the same process can be used to select content from the other wikis.

Pascal: we make also wikiwix.com, a meta search engine for the wikimedia project.

Guillaume: I think the community selection process is taking place on all wikipedia. at least, there's one on fr:

Pascal: is integrated in http://es.wikipedia.org/wiki/Especial:Search?search=&fulltext=Buscar, so wikiwix is good for 12 languages, so we could do a search engine, a desktop search for olpc

Mel: (to Guillaume) That's what I'm guessing - I am looking through the material on http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team

Guillaume: now, the problem with such a selection is that quality will greatly differ between languages. for instance, fr: is far from being as complete as en:

btw, do you an idea of languages you want to include in OLPC? or how much disk space will be gave to wikipedia?

SJ: howdy all. cjb is going to join soon. (To Pascal:) a couple of things : I wanted to work through the new mailing list setup and getting a list of the scripts / libraries people are using, especially redlink/link-recalculation scripts and filename-conversion scripts. And rtl / ltr, all of which seem to crop up as bugs / incomplete features every time a similar library or two come[s] out

Pascal: (to SJ) we have got all on our server

SJ: (To Pascal) the three sets of scripts listed at kiwix.com?

Pascal: (To SJ) Guillaume is my new developper for to work on the olpc. I hope that this meeting could help us to start work.

SJ: welcome Guillaume !

Guillaume: thanks :)

Pascal: so he could work on desktop search and automatcaly selection

SJ: (To Pascal) perfect.

Pascal: that s all that we could do we do the better :)

SJ: I also hope to have "annotate" as a system-wide service, providing a default name for the dicsussion page related to an item found in a search

Pascal: not sure I'm getting what you say

SJ: we are implementing versioning in our datastore

Pascal: we need to know how many times you need a selection of article, and w[h]ich language, because in french there is only 1000 articles which are very good. so if we need help from the wp community to improve the quality of the article, it spent a lot of time

SJ: (to Pascal) indeed. To start, I'd like to describe script-libraries at three levels.

(1) naming a selection - for instance, creating a wikipage that lists a set of pages (optional: revisions) to make into a bundle. as a poor example, see http://en.wikipedia.org/wiki/User:Sj/wp-api

(2) finding/listing selections, something like an apt-get service for collections that have been named [and made] (optional: bundle-on-demand, if the bundling takes time; or if a named bundle invokes "get latest version" rather than "get specific revisoin")

(3) creating collections : turning a named selection into a {wikitext|mw-xml|html|pdf|docbook|txt} collection or futher an rpm/xol/ other bundle of one of the above

Pascal suggested above that making a selection is currently the 'hard' part. I am not quite in agreement /yet/ since we haven't made the obvious bundles that should already be available to all, recreated every month, &c.

for instance : http://en.wikipedia.org/wiki/Wikipedia:Good_articles has a /lot/ of articles. I should note that 3.1) above is "creating collections under contsraints) For instance : under 10MB total, prioritize header text, then first image thumbnail, than first section text, then references, then second image thumbnail... so, some bundles to start with : header + first image from all articles in each section of "good articles" or "full Featured Article blurb" for each entry in http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Articles_de_qualit%C3%A9

that's the fr:wp featured article set. (standard disclaimer: please forgive the english-centered nature of this forum, discussion, and initial set of examples. the most active users of these bundles will actually be native spanish speakers in the short term.)

Mel: meaning that the notes from this discussion on how to make what wikislices should be translated into spanish fairly soon, once they're stable, I'm guessing?

SJ: absolutely. hmm, es:wp has lower featured article standards than fr, or better writing. 560 FAs. http://es.wikipedia.org/wiki/Wikipedia:Art%C3%ADculos_destacados. since the /average/ writing is not so good for es:wp articles, I'm guessing it is standards.

SJ: done for the moment. thoughts? actually, why don't we do a round of introductions for people here to talk about wikislicing and offline reading/editing perhaps even including bobMini and eben and sebastiansilva :) but certainly including coderanger, dirakx, ffm, guillaume, m_stone, mchua, Phil_Bordelon, pm27, and tomeu. also, would someone like to volunteer to take notes / clean up an irc transcript after?

Mel volunteers

SJ: ffm, perhaps you should start with intros, since you are one of the most active wikipedians (though there may be others hiding... the brother of an active editor dropped by on sunday and almost didn't mention it). dirakx, have you worked on es:wp much? it's actually improving pretty rapidly...

Dirakx: nope i havent work on that yet.

Countrymike: i'm mostly active on WikiEducator and i'm also a Custodian on Wikiversity... i've been playing with the pediapress export to pdf quite a bit on WikiEducator...

SJ: Countrymike, tell us more. should we install it? today seems to be an extension installation day.

SJ thinks the teamwiki is going to become the testbed

Sebastiansilva: extension installation day?

SJ: well, there's a backlog of extension requests (to cjb) now eben wants http://www.mediawiki.org/wiki/Extension:SpecialMultiUploadViaZip

ffm: ooh, and the reCAPTCHA, and the OGG player...

  • cjb shakes fist*

cjb: (to SJ) it gives people an obvious way to take down our web server. do they have that already?  :)

FFM: (to cjb) we can make it an admin utility.

SJ: (to cjb) sort of.

FFM: eben can use the admin updload.

SJ: (to cjb) special pages can be restricted by user group

Countrymike: it seems to be working quite well lately, quite a few bugs involving templates etc have been worked out. And I've used it to create some effective handouts for face-to-face teaching... i actually think we may have met, SJ. In vancouver?

FFM: I was a highly active wikipedian, although I havn't been recently since I have been working on OLPC.

SJ: (to Countrymike) yes! I was trying to place you. are you going to the folllowup this year?

Countrymike: will try ... have been invited, so... will see. There has also been an interesting hack on the Peidapress collections done by a guy down here (NZ), Jim Tittsler ... he's created a way to take a "collection" and create an IMS-CP (content package) which can then be "played" back on Moodle, etc... anythng that will import a Content Package. isnt there some work being done on Moodle/OLPC?

SJ: yes.

Countrymike: he's also managed to get eXe on the OLPC too ... ! but it runs pretty slow from what I understand.

SJ: yes, that's a problem. I'm excited to see that working smoothly, though. there's moodle discussion on our schoolserver mailing list, which is another option for eXe (as a service from the server) but this is a bit off-agenda :) let's follow up later

Countrymike: yep. ok.

SJ: Guillaume, pm27, still there?

Pascal: yes _sj_ is 23 h 35 in france :(

SJ: I know :) thanks for staying up

Guillaume: so... pm27 and I are working at linterweb, we worked on kiwix and wikiwix, http://sourceforge.net/projects/kiwix and http://www.wikiwix.com

Pascal: http://wikipediaondvd.com the selection of Martin Waker

SJ: thanks. I got good response from BozMo, the author of the SOS Children's Wikipedia, who wants to share his scripts. he seems to have a good toolchain and article reviewers who are not working with martin yet

Pascal: also SJ we are working on the same platform with Moulin, SOSChild is 4000 articles. I know Bozmo

SJ: yes

FFM: has anyone tried to see how big it is w/ compression?

Pascal: ffm we are working on zero compress, only one file

SJ; to make the full package work better with XOs, we would need tools to take long articles with full images, and turn them into shorter articles with thumbnails and fewer images. it is not only a matter of compression, it is one of reasonable scripted subselection.

Pascal: but it s not really the text is the image wich are really heavy

SJ: right. well, once you reduce the images to one thumbnail per article, the text becomes larger for the good long articles so it is both.

Pascal: yes of course

SJ: what I would like is a script that takes the SOS articles and pulls out just the header and the first image.

Guillaume: btw, do we have an idea on the good size for a package?

SJ: Guillaume, 10MB

Pascal: 500 000 articles only text on french WP = 1 CD

SJ; Guillaume, up to 100MB for a v. weighty collection.but that is awkward to manipulate, a real max.

Guillaume: ok, and I guess only one language is included in a package

SJ: not necessarily. a fine picture dictionary could be made with one sentence from each article and the first image. at any rate, I'd like to have a tool that lets me make subsets of the schools-wp as a testbed. are there currently any scripts from kiwix that could do this? do you select articles based on {title, revision}? where are those pairs stored? is there a way to update a kiwix selection by tweaking one revision of one article, and how does that work?

Pascal: http://www.wikipediaondvd.com/nav/art/a/7/d.html it s the listing of the redactor of the 2000 articles. I think in 10 Mb we have just enough to respect the gnu condition _sj_ :)

SJ: :P you could do one paragraph per article, and hotlinked images (with good alt text if not online)

Guillaume: and you can update the selection, there's tool on the svn to do that

Pascal: http://suivi.ubuntu-fr.org/post/Documentation-telechargeable:-Ubuntu-fr-a-choisi-Kiwix Ubuntu do this Kiwix

SJ: where does it get updateD? is there a wiki list, or a way for me to see the sj version of http://www.wikipediaondvd.com/nav/art/a/7/d.html ? we have 12 people in the chan here who would like to contribute a bit, perhaps to a 100-article version of same

Pascal: Guillaume translate for me :)

Guillaume: so you want to do a manual-only selection of articles?

SJ; I'm trying to understand how they / we can start working now

Pascal: (to SJ) I think we must thinking about what we could do on 10 Mb

Mel: I'm trying to understand what you all want to create (possibly multiple things we're talking about here) and who ought to be creating them and when our deadline is.

Pascal: it s very very little

SJ: Guillaume, ?

Guillaume: I mean... selecting only existing articles could be a problem

Pascal: (to SJ) 10 M is for the data not for the data and application ?

SJ: (to Pascal) yes

Pascal: how much for the data for the olpc ? and how much for the application ?

SJ: 10M for data bundles. if there's an applicatoin other than the browser, that depends. if it's really good and general (and how we browse all wiki content), it can be a similar size or larger. you could merge something with http://wiki.laptop.org/go/MikMik, but one of the output formats for a collection should be plain html in which case there is no need for an application

Pascal: so we could work for a desktop search _sj_ ?

SJ: hmm. I see search as separate from an article reader

Pascal: and kiwix could be the browser :) then only one application for read all data

Guillaume: if it's doing both, you're gaining in disk space :)

SJ: ahh ok so, a read-write browser

Pascal: yes _sj_

SJ: would you use xulrunner or gecko for any of this?

Pascal: yes kiwix is a xulrunner application how many size the browser it use ?

FFM: (to Pascal) what?

Guillaume: how many disk space can the browser use? but I think _sj_ already sayed 10 M

SJ: hm, explain. Browse takes up a fair bit of space but that's largely xulrunner and gecko, which you could take advantage of

Guillaume: yeah

SJ: the question is how much extra space it is on top of our core. more than 10M would be a lot, but start with what is quick to test, and let's get an estimate

SJ: (to Mel, Pascal, Guillaume) For bundling, I'd like to make some small collections soon (say, by this weekend). I'm mainly concerned with the toolchain used to do that. this is more urgent than getting kiwix to run on the XO. as I said, there are many people trying to work on that system. they need a shared library to improve.

Mel: (to SJ) unsure what you mean by toolchain - bot that crawls pages in a certain category and pulls their text into html files? or... I may be way off, here

SJ: mchua, right. except there are much better tools that different projects have; some extract html, some pull from an XML export

Mel: is the issue here primarily finding/modifying an existing tool we like, or creating a totally new one? (new to wikislicing, I am.)

SJ: this week, I would like to know : every script that can convert a list of 100 articles into a set of 100 html files. groups that have related tools include kiwix, moulin, sos-children, homebaked scripts by zdenek, wikiwizzy

Guillaume: well, wikiwix is doing that kind of things as it crawl and index all wikipedia projects. kiwix is an offline readed, it cannot crawl, it takes html files as "input"

Pascal: (to SJ) is not a problem, but not for this week, for monday. a list of random 100 articles on html files. SJ we have got a serveur to do this. it uses also by moulin

Mel: who'll evaluate the list of scripts, and by what criteria? ("the community," I'm guessing, but some starter thoughts are needed by someone.) and by when? sunday?

SJ: (to Mel) I mean it will help all of these groups to have a list of such scripts since some people are rewriting them from scratch.

Pascal: stupid idea _sj_

Guillaume: wikipediaondvd have alreadu been evaluated by the community

SJ: hmm. Pascal, how does the server operate? how can I submit lists to it?

Pascal: if we are here is for help us. you really want to know how the server is operate?

SJ: guillaume, does wikiwix handle exports?

Pascal: is not possible yet but we could do that for monday

SJ: ok, perhaps, I'll list the classes of scripts that would be interesting/useful on wikislices. please add notes if such a script exists already. Pascal, it is perhaps just as helpful to know when scripts exist as it is to write ones to fill a specific need. so, for instance, zdenek has a "convert list of wiki pages to a directory of single-paragraph html files" script. is this a good script for the purpose? maybe. it should at least be named and published so that others can comment on whether this is useful output

(to Pascal) but if there could be wikiwix exports... that would be super. we could have a few different people collaborate on improving the export library then

Guillaume: by export, you mean getting a list of html pages corresponding to a list of wiki pages? we could work on a xul extension that would take an url list or a csv file and output an html dump. that could be used for the selection process too. with a "add the WP page I'm reading to the selection" button

Pascal: (to SJ) http://wikiwix.com/zelda/

SJ: nice. Guillaume, yes, perhaps output a zipfile of a directory of html

Pascal: (to Guillaume) explique un peu à _sj_ à quoi sert zelda

SJ: agreed about the "Add page I'm reading", the default should be to add the specific revision ID. Pascal, Guillaume, n'hesitez pas a ecriere en francais ici

Guillaume: zelda is a FF extension that list WP pages related to the page the user is currently viewing. it uses wikiwix to get that list, btw

SJ: super. comme le proposition de googlegears...

Guillaume: ah bah voila, on peut continuer en francais alors :)

SJ: comment s'utilize wikiwix pour obtenir ce list? bien :)

Pascal: wikiwix est notre moteur :)

Guillaume: wikiwix is a search engine

SJ: yes... it polls wikiwix in the background?

Guillaume: zelda is just sending the url of the current page and gets a rss feed in return

Pascal: we also put some indicator of the quality of the article

Guillaume: (to SJ) yeah

SJ: ok. *starts to write the "how to make a bundle with wikiwix" tutorial*

before it is possible :)

Guillaume: heh

SJ: "qualite" : associee avec une revision ou un article? (nous somme a moitie anglophone ici, y mon keyboard n'a pas d'accents... mais cette channel devrait etre multilingue...) on en:wp "Good articles" : par article

"Featured articles" : ditto "WP 1.0" : hmmmm, par edicion?

Dirakx: oui cette channell est multilingue. :).

Pascal: la qualité à un article, la qualité d'un article, mais on pourrait coupler celà aux révisions

SJ: ok. ce serait mieux...

Guillaume: pour le vandalisme ?

SJ: andrew cates / bozmo spends time reviewing specific revisions. these should be tracked somehow within wp. pour vandalisme, pour seleccion d'images

Pascal: oui mais une fois que l'article est sélectionné il est figé donc plus de souci :)

SJ: ok. comment fige?

Pascal: juste besoin de faire des update si les articles sont de meilleurs qualité. en stockant une version html

SJ: hmm, oui, mais... on faut faire de sélection avec une liste {article, revision}, afin que d'autres voulent facilement compiler des variantes. it is hard for people to collaborate on a list and make variations of it when they can only see the output. they may want to make a small change and recompile that (recompile the collection with that change) so it is useful to make each step in the process something replicable

Pascal: no we could add a svn to collaborate

SJ: euh. I might want 100 export versions of each bundle

Pascal: yes

SJ: some of them are hard to get from the html. I might want the version "list of article names, with permalinks to a good revision online" or "list of articles, highlighting the last 5 editors and the timestamp of the last edit" or, for a critical reading class: "articles, with the last 5 editors and timestamps, and the last 5 diffs"

an aside : we need some cleanup scripts to deal with character encodings in URLs v. in filenames. any suggestions there?

Pascal: _sj_ yes. we just have to access on the mediawiki format, and stock the article on a mediawiki to put small change. do you see what I mean _sj_ ? allo _sj_ ? to access of the mediawiki we need a Live feed from WMF

Atglenn: what do you mean by a live feed?

SJ: back *is not sure what the live feed would be fore*

SJ: (to atglenn) you can get a live rss feed of changes to Wikipedias, to keep your own local copy up to date

FFM: or just wait for the monthly release.

Pascal: a master slavenot this _sj_, i find a link

SJ: hm?

FFM: (to Pascal) wha?

SJ: mysql slave?

Pascal: yes sj

SJ: oh!

Pascal: I know that the WMF it uses

Atglenn: so you meant a live feed of rc... but you can just as well poll for rc from the various wikis using the api

SJ: atglenn, that's not practical


Atglenn: well, what is your scenario? how many of them do you need to monitor?

SJ: (to Pascal) are you really interested in having a live copy of the db? and have you asked for a feed?

Pascal: yes _sj_

SJ: Because it would be enough (for me) to note the revision ID (via the API) when saving an article. making snapshots is a good reason to have one. especially if the result is a better collection of reviewer metadata

Pascal: first i m not sure it was sent a mediwiki format but i think yes

SJ: atglenn, to do this properly you would want the same interface for all articles and you would be actively updating tens of thousands of articles on a few dozen wikis

Atglenn: that's not so much to manage I think... what's the time frame over which these updates would be happening?

Pascal: _sj_ i know that the wikimedia france are waiting a live feed for specifical search and the live feed will go one the same data center that wikiwix use

SJ: ok. search but not for the dvd?

SJ: (to atglenn) over the course of months?

Atglenn: ok

SJ: I agree, it could be done without a feed for at least the next 6 months or so, since there is only a small reviewing community right now

Atglenn: and the idea is to maintain (let's say) daily snapshots for these projects...?

SJ: well... we are talking about also reviewing/updating a page, marking that new revision as the desired one, and having that revision in the database used later to bundle up the collection. now the firefox plugin can just note the revisoinid of your new edit, and wait for the next daily or weekly dump before including it in the bundle

Atglenn: so later revisions would get silently ignored?

SJ: ignored, not necc. silently. make-bundle (protists-for-preteens, wikiwix-to-dir) notice: this bundle will contain 167 pages, 300 thumbnails, approximately 15M in all. warning : 34 pages and 5 images have new updates(review updates) (proceed) (cancel)

Atglenn: retrieval of the corect revids (if we are talking about < 5000 pages) at the moment of the build seems the way to go then

SJ: that could work

Atglenn: even if it takes half a minute cause you are polite about spacing out export requests

SJ: of course "review updates" should do a few things at once. a) update the "reviewed" flag for the article revision (for wikis that have that turned on, if the user doing the reviewing is trusted by that wiki b) allow the reviewer to separately note "bad revision" (in which case, do something...), "ok, don't include" (not vandalism, but not desired for this collection), "update" (replace previous revision in this collection)

Pascal: so we do xul

SJ: nice.

Pascal: and i will diffuse the url on next week to you and second we could improve them by uses this extension ?

SJ: yes. where is the current extension? (for wikiwix) we could ask people to look at it in addition to the offline reader apps gearswiki.theidea.net (all: check it out)

Pascal: http://wikiwix.com/zelda/

SJ: ah, right! merci. and here is wikibrowse (with link problems): http://wiki.laptop.org/go/WikiBrowse. thanks all.