Talk:Wikislices
see Talk:Bundles for scripts used here
Universalism
The question of universal use of this content needs to be considered. Do we run this project under OLPC entirely? Or do we try to create logical bundles for anyone with a wikireader? What are our ideas that may differ from other Wikipedians?
Meeting minutes 2008-02-2?
care of mel
(Introductions all around, discussion of whether there is actually a meeting today)
Pascal: i am Pascal Martin from linterweb ( http://wikipediaondvd.com) Guillaume is working for linterweb and we really really need information about what you need
Mel: Well, we could certainly talk in the absence of a formal meeting, still. :) I'm curious to hear what you've been doing with wikipediaondvd. I'm reading your website and your wikipedia pages right now. I'm most curious about the processes (both community and technological) you've used to select, curate, and package your work, and whether they're easily transferable to do curations of other wikis.
Pascal: wikipediaondvd is us, not automatical selection, manual selection for the en community
Mel: Right - wikipediaondvd and the http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team wp page are what I'm reading. It looks like you have some very nice community processes in place for making the content selection work. Since other wiki information repositories (like UNICEF, Appropedia, and so forth) have expressed at least tentative interest in making similar wikislices as content bundles for the XO, I'm reading about what you've done for wikipediaondvd in the hopes that the same process can be used to select content from the other wikis.
Pascal: we make also wikiwix.com, a meta search engine for the wikimedia project.
Guillaume: I think the community selection process is taking place on all wikipedia. at least, there's one on fr:
Pascal: is integrated in http://es.wikipedia.org/wiki/Especial:Search?search=&fulltext=Buscar, so wikiwix is good for 12 languages, so we could do a search engine, a desktop search for olpc
Mel: (to Guillaume) That's what I'm guessing - I am looking through the material on http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team
Guillaume: now, the problem with such a selection is that quality will greatly differ between languages. for instance, fr: is far from being as complete as en:
btw, do you an idea of languages you want to include in OLPC? or how much disk space will be gave to wikipedia?
SJ: howdy all. cjb is going to join soon. (To Pascal:) a couple of things : I wanted to work through the new mailing list setup and getting a list of the scripts / libraries people are using, especially redlink/link-recalculation scripts and filename-conversion scripts. And rtl / ltr, all of which seem to crop up as bugs / incomplete features every time a similar library or two come[s] out
Pascal: (to SJ) we have got all on our server
SJ: (To Pascal) the three sets of scripts listed at kiwix.com?
Pascal: (To SJ) Guillaume is my new developper for to work on the olpc. I hope that this meeting could help us to start work.
SJ: welcome Guillaume !
Guillaume: thanks :)
Pascal: so he could work on desktop search and automatcaly selection
SJ: (To Pascal) perfect.
Pascal: that s all that we could do we do the better :)
SJ: I also hope to have "annotate" as a system-wide service, providing a default name for the dicsussion page related to an item found in a search
Pascal: not sure I'm getting what you say
SJ: we are implementing versioning in our datastore
Pascal: we need to know how many times you need a selection of article, and w[h]ich language, because in french there is only 1000 articles which are very good. so if we need help from the wp community to improve the quality of the article, it spent a lot of time
SJ: (to Pascal) indeed. To start, I'd like to describe script-libraries at three levels.
(1) naming a selection - for instance, creating a wikipage that lists a set of pages (optional: revisions) to make into a bundle. as a poor example, see http://en.wikipedia.org/wiki/User:Sj/wp-api
(2) finding/listing selections, something like an apt-get service for collections that have been named [and made] (optional: bundle-on-demand, if the bundling takes time; or if a named bundle invokes "get latest version" rather than "get specific revisoin")
(3) creating collections : turning a named selection into a {wikitext|mw-xml|html|pdf|docbook|txt} collection or futher an rpm/xol/ other bundle of one of the above
Pascal suggested above that making a selection is currently the 'hard' part. I am not quite in agreement /yet/ since we haven't made the obvious bundles that should already be available to all, recreated every month, &c.
for instance : http://en.wikipedia.org/wiki/Wikipedia:Good_articles has a /lot/ of articles. I should note that 3.1) above is "creating collections under contsraints) For instance : under 10MB total, prioritize header text, then first image thumbnail, than first section text, then references, then second image thumbnail... so, some bundles to start with : header + first image from all articles in each section of "good articles" or "full Featured Article blurb" for each entry in http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Articles_de_qualit%C3%A9
that's the fr:wp featured article set. (standard disclaimer: please forgive the english-centered nature of this forum, discussion, and initial set of examples. the most active users of these bundles will actually be native spanish speakers in the short term.)
Mel: meaning that the notes from this discussion on how to make what wikislices should be translated into spanish fairly soon, once they're stable, I'm guessing?
SJ: absolutely. hmm, es:wp has lower featured article standards than fr, or better writing. 560 FAs. http://es.wikipedia.org/wiki/Wikipedia:Art%C3%ADculos_destacados. since the /average/ writing is not so good for es:wp articles, I'm guessing it is standards.
SJ: done for the moment. thoughts? actually, why don't we do a round of introductions for people here to talk about wikislicing and offline reading/editing perhaps even including bobMini and eben and sebastiansilva :) but certainly including coderanger, dirakx, ffm, guillaume, m_stone, mchua, Phil_Bordelon, pm27, and tomeu. also, would someone like to volunteer to take notes / clean up an irc transcript after?
Mel volunteers
SJ: ffm, perhaps you should start with intros, since you are one of the most active wikipedians (though there may be others hiding... the brother of an active editor dropped by on sunday and almost didn't mention it). dirakx, have you worked on es:wp much? it's actually improving pretty rapidly...
Dirakx: nope i havent work on that yet.
Countrymike: i'm mostly active on WikiEducator and i'm also a Custodian on Wikiversity... i've been playing with the pediapress export to pdf quite a bit on WikiEducator...
SJ: Countrymike, tell us more. should we install it? today seems to be an extension installation day.
SJ thinks the teamwiki is going to become the testbed
Sebastiansilva: extension installation day?
SJ: well, there's a backlog of extension requests (to cjb) now eben wants http://www.mediawiki.org/wiki/Extension:SpecialMultiUploadViaZip
ffm: ooh, and the reCAPTCHA, and the OGG player...
- cjb shakes fist*
cjb: (to SJ) it gives people an obvious way to take down our web server. do they have that already? :)
FFM: (to cjb) we can make it an admin utility.
SJ: (to cjb) sort of.
FFM: eben can use the admin updload.
SJ: (to cjb) special pages can be restricted by user group
Countrymike: it seems to be working quite well lately, quite a few bugs involving templates etc have been worked out. And I've used it to create some effective handouts for face-to-face teaching... i actually think we may have met, SJ. In vancouver?
FFM: I was a highly active wikipedian, although I havn't been recently since I have been working on OLPC.
SJ: (to Countrymike) yes! I was trying to place you. are you going to the folllowup this year?
Countrymike: will try ... have been invited, so... will see. There has also been an interesting hack on the Peidapress collections done by a guy down here (NZ), Jim Tittsler ... he's created a way to take a "collection" and create an IMS-CP (content package) which can then be "played" back on Moodle, etc... anythng that will import a Content Package. isnt there some work being done on Moodle/OLPC?
SJ: yes.
Countrymike: he's also managed to get eXe on the OLPC too ... ! but it runs pretty slow from what I understand.
SJ: yes, that's a problem. I'm excited to see that working smoothly, though. there's moodle discussion on our schoolserver mailing list, which is another option for eXe (as a service from the server) but this is a bit off-agenda :) let's follow up later
Countrymike: yep. ok.
SJ: Guillaume, pm27, still there?
Pascal: yes _sj_ is 23 h 35 in france :(
SJ: I know :) thanks for staying up
Guillaume: so... pm27 and I are working at linterweb, we worked on kiwix and wikiwix, http://sourceforge.net/projects/kiwix and http://www.wikiwix.com
Pascal: http://wikipediaondvd.com the selection of Martin Waker
SJ: thanks. I got good response from BozMo, the author of the SOS Children's Wikipedia, who wants to share his scripts. he seems to have a good toolchain and article reviewers who are not working with martin yet
Pascal: also SJ we are working on the same platform with Moulin, SOSChild is 4000 articles. I know Bozmo
SJ: yes
FFM: has anyone tried to see how big it is w/ compression?
Pascal: ffm we are working on zero compress, only one file
SJ; to make the full package work better with XOs, we would need tools to take long articles with full images, and turn them into shorter articles with thumbnails and fewer images. it is not only a matter of compression, it is one of reasonable scripted subselection.
Pascal: but it s not really the text is the image wich are really heavy
SJ: right. well, once you reduce the images to one thumbnail per article, the text becomes larger for the good long articles so it is both.
Pascal: yes of course
SJ: what I would like is a script that takes the SOS articles and pulls out just the header and the first image.
Guillaume: btw, do we have an idea on the good size for a package?
SJ: Guillaume, 10MB
Pascal: 500 000 articles only text on french WP = 1 CD
SJ; Guillaume, up to 100MB for a v. weighty collection.but that is awkward to manipulate, a real max.
Guillaume: ok, and I guess only one language is included in a package
SJ: not necessarily. a fine picture dictionary could be made with one sentence from each article and the first image. at any rate, I'd like to have a tool that lets me make subsets of the schools-wp as a testbed. are there currently any scripts from kiwix that could do this? do you select articles based on {title, revision}? where are those pairs stored? is there a way to update a kiwix selection by tweaking one revision of one article, and how does that work?
Pascal: http://www.wikipediaondvd.com/nav/art/a/7/d.html it s the listing of the redactor of the 2000 articles. I think in 10 Mb we have just enough to respect the gnu condition _sj_ :)
SJ: :P you could do one paragraph per article, and hotlinked images (with good alt text if not online)
Guillaume: and you can update the selection, there's tool on the svn to do that
Pascal: http://suivi.ubuntu-fr.org/post/Documentation-telechargeable:-Ubuntu-fr-a-choisi-Kiwix Ubuntu do this Kiwix
SJ: where does it get updateD? is there a wiki list, or a way for me to see the sj version of http://www.wikipediaondvd.com/nav/art/a/7/d.html ? we have 12 people in the chan here who would like to contribute a bit, perhaps to a 100-article version of same
Pascal: Guillaume translate for me :)
Guillaume: so you want to do a manual-only selection of articles?
SJ; I'm trying to understand how they / we can start working now
Pascal: (to SJ) I think we must thinking about what we could do on 10 Mb
Mel: I'm trying to understand what you all want to create (possibly multiple things we're talking about here) and who ought to be creating them and when our deadline is.
Pascal: it s very very little
SJ: Guillaume, ?
Guillaume: I mean... selecting only existing articles could be a problem
Pascal: (to SJ) 10 M is for the data not for the data and application ?
SJ: (to Pascal) yes
Pascal: how much for the data for the olpc ? and how much for the application ?
SJ: 10M for data bundles. if there's an applicatoin other than the browser, that depends. if it's really good and general (and how we browse all wiki content), it can be a similar size or larger. you could merge something with http://wiki.laptop.org/go/MikMik, but one of the output formats for a collection should be plain html in which case there is no need for an application
Pascal: so we could work for a desktop search _sj_ ?
SJ: hmm. I see search as separate from an article reader
Pascal: and kiwix could be the browser :) then only one application for read all data
Guillaume: if it's doing both, you're gaining in disk space :)
SJ: ahh ok so, a read-write browser
Pascal: yes _sj_
SJ: would you use xulrunner or gecko for any of this?
Pascal: yes kiwix is a xulrunner application how many size the browser it use ?
FFM: (to Pascal) what?
Guillaume: how many disk space can the browser use? but I think _sj_ already sayed 10 M
SJ: hm, explain. Browse takes up a fair bit of space but that's largely xulrunner and gecko, which you could take advantage of
Guillaume: yeah
SJ: the question is how much extra space it is on top of our core. more than 10M would be a lot, but start with what is quick to test, and let's get an estimate
SJ: (to Mel, Pascal, Guillaume) For bundling, I'd like to make some small collections soon (say, by this weekend). I'm mainly concerned with the toolchain used to do that. this is more urgent than getting kiwix to run on the XO. as I said, there are many people trying to work on that system. they need a shared library to improve.
Mel: (to SJ) unsure what you mean by toolchain - bot that crawls pages in a certain category and pulls their text into html files? or... I may be way off, here
SJ: mchua, right. except there are much better tools that different projects have; some extract html, some pull from an XML export
Mel: is the issue here primarily finding/modifying an existing tool we like, or creating a totally new one? (new to wikislicing, I am.)
SJ: this week, I would like to know : every script that can convert a list of 100 articles into a set of 100 html files. groups that have related tools include kiwix, moulin, sos-children, homebaked scripts by zdenek, wikiwizzy
Guillaume: well, wikiwix is doing that kind of things as it crawl and index all wikipedia projects. kiwix is an offline readed, it cannot crawl, it takes html files as "input"
Pascal: (to SJ) is not a problem, but not for this week, for monday. a list of random 100 articles on html files. SJ we have got a serveur to do this. it uses also by moulin
Mel: who'll evaluate the list of scripts, and by what criteria? ("the community," I'm guessing, but some starter thoughts are needed by someone.) and by when? sunday?
SJ: (to Mel) I mean it will help all of these groups to have a list of such scripts since some people are rewriting them from scratch.
Pascal: stupid idea _sj_
Guillaume: wikipediaondvd have alreadu been evaluated by the community
SJ: hmm. Pascal, how does the server operate? how can I submit lists to it?
Pascal: if we are here is for help us. you really want to know how the server is operate?
SJ: guillaume, does wikiwix handle exports?
Pascal: is not possible yet but we could do that for monday
SJ: ok, perhaps, I'll list the classes of scripts that would be interesting/useful on wikislices. please add notes if such a script exists already. Pascal, it is perhaps just as helpful to know when scripts exist as it is to write ones to fill a specific need. so, for instance, zdenek has a "convert list of wiki pages to a directory of single-paragraph html files" script. is this a good script for the purpose? maybe. it should at least be named and published so that others can comment on whether this is useful output
(to Pascal) but if there could be wikiwix exports... that would be super. we could have a few different people collaborate on improving the export library then
Guillaume: by export, you mean getting a list of html pages corresponding to a list of wiki pages? we could work on a xul extension that would take an url list or a csv file and output an html dump. that could be used for the selection process too. with a "add the WP page I'm reading to the selection" button
Pascal: (to SJ) http://wikiwix.com/zelda/
SJ: nice. Guillaume, yes, perhaps output a zipfile of a directory of html
Pascal: (to Guillaume) explique un peu à _sj_ à quoi sert zelda
SJ: agreed about the "Add page I'm reading", the default should be to add the specific revision ID. Pascal, Guillaume, n'hesitez pas a ecriere en francais ici
Guillaume: zelda is a FF extension that list WP pages related to the page the user is currently viewing. it uses wikiwix to get that list, btw
SJ: super. comme le proposition de googlegears...
Guillaume: ah bah voila, on peut continuer en francais alors :)
SJ: comment s'utilize wikiwix pour obtenir ce list? bien :)
Pascal: wikiwix est notre moteur :)
Guillaume: wikiwix is a search engine
SJ: yes... it polls wikiwix in the background?
Guillaume: zelda is just sending the url of the current page and gets a rss feed in return
Pascal: we also put some indicator of the quality of the article
Guillaume: (to SJ) yeah
SJ: ok. *starts to write the "how to make a bundle with wikiwix" tutorial*
before it is possible :)
Guillaume: heh
SJ: "qualite" : associee avec une revision ou un article? (nous somme a moitie anglophone ici, y mon keyboard n'a pas d'accents... mais cette channel devrait etre multilingue...) on en:wp "Good articles" : par article
"Featured articles" : ditto "WP 1.0" : hmmmm, par edicion?
Dirakx: oui cette channell est multilingue. :).
Pascal: la qualité à un article, la qualité d'un article, mais on pourrait coupler celà aux révisions
SJ: ok. ce serait mieux...
Guillaume: pour le vandalisme ?
SJ: andrew cates / bozmo spends time reviewing specific revisions. these should be tracked somehow within wp. pour vandalisme, pour seleccion d'images
Pascal: oui mais une fois que l'article est sélectionné il est figé donc plus de souci :)
SJ: ok. comment fige?
Pascal: juste besoin de faire des update si les articles sont de meilleurs qualité. en stockant une version html
SJ: hmm, oui, mais... on faut faire de sélection avec une liste {article, revision}, afin que d'autres voulent facilement compiler des variantes. it is hard for people to collaborate on a list and make variations of it when they can only see the output. they may want to make a small change and recompile that (recompile the collection with that change) so it is useful to make each step in the process something replicable
Pascal: no we could add a svn to collaborate
SJ: euh. I might want 100 export versions of each bundle
Pascal: yes
SJ: some of them are hard to get from the html. I might want the version "list of article names, with permalinks to a good revision online" or "list of articles, highlighting the last 5 editors and the timestamp of the last edit" or, for a critical reading class: "articles, with the last 5 editors and timestamps, and the last 5 diffs"
an aside : we need some cleanup scripts to deal with character encodings in URLs v. in filenames. any suggestions there?
Pascal: _sj_ yes. we just have to access on the mediawiki format, and stock the article on a mediawiki to put small change. do you see what I mean _sj_ ? allo _sj_ ? to access of the mediawiki we need a Live feed from WMF
Atglenn: what do you mean by a live feed?
SJ: back *is not sure what the live feed would be fore*
SJ: (to atglenn) you can get a live rss feed of changes to Wikipedias, to keep your own local copy up to date
FFM: or just wait for the monthly release.
Pascal: a master slavenot this _sj_, i find a link
SJ: hm?
FFM: (to Pascal) wha?
SJ: mysql slave?
Pascal: yes sj
SJ: oh!
Pascal: I know that the WMF it uses
Atglenn: so you meant a live feed of rc... but you can just as well poll for rc from the various wikis using the api
SJ: atglenn, that's not practical
Atglenn: well, what is your scenario? how many of them do you need to monitor?
SJ: (to Pascal) are you really interested in having a live copy of the db? and have you asked for a feed?
Pascal: yes _sj_
SJ: Because it would be enough (for me) to note the revision ID (via the API) when saving an article. making snapshots is a good reason to have one. especially if the result is a better collection of reviewer metadata
Pascal: first i m not sure it was sent a mediwiki format but i think yes
SJ: atglenn, to do this properly you would want the same interface for all articles and you would be actively updating tens of thousands of articles on a few dozen wikis
Atglenn: that's not so much to manage I think... what's the time frame over which these updates would be happening?
Pascal: _sj_ i know that the wikimedia france are waiting a live feed for specifical search and the live feed will go one the same data center that wikiwix use
SJ: ok. search but not for the dvd?
SJ: (to atglenn) over the course of months?
Atglenn: ok
SJ: I agree, it could be done without a feed for at least the next 6 months or so, since there is only a small reviewing community right now
Atglenn: and the idea is to maintain (let's say) daily snapshots for these projects...?
SJ: well... we are talking about also reviewing/updating a page, marking that new revision as the desired one, and having that revision in the database used later to bundle up the collection. now the firefox plugin can just note the revisoinid of your new edit, and wait for the next daily or weekly dump before including it in the bundle
Atglenn: so later revisions would get silently ignored?
SJ: ignored, not necc. silently. make-bundle (protists-for-preteens, wikiwix-to-dir) notice: this bundle will contain 167 pages, 300 thumbnails, approximately 15M in all. warning : 34 pages and 5 images have new updates(review updates) (proceed) (cancel)
Atglenn: retrieval of the corect revids (if we are talking about < 5000 pages) at the moment of the build seems the way to go then
SJ: that could work
Atglenn: even if it takes half a minute cause you are polite about spacing out export requests
SJ: of course "review updates" should do a few things at once. a) update the "reviewed" flag for the article revision (for wikis that have that turned on, if the user doing the reviewing is trusted by that wiki b) allow the reviewer to separately note "bad revision" (in which case, do something...), "ok, don't include" (not vandalism, but not desired for this collection), "update" (replace previous revision in this collection)
Pascal: so we do xul
SJ: nice.
Pascal: and i will diffuse the url on next week to you and second we could improve them by uses this extension ?
SJ: yes. where is the current extension? (for wikiwix) we could ask people to look at it in addition to the offline reader apps gearswiki.theidea.net (all: check it out)
Pascal: http://wikiwix.com/zelda/
SJ: ah, right! merci. and here is wikibrowse (with link problems): http://wiki.laptop.org/go/WikiBrowse. thanks all.
Meeting notes 2/21/08
Overall goal of meeting: wiki-hacking session to improve on the tools that Zdenek and others are currently using to make & refine wikislices. Held in #olpc-content on freenode.
Wikipedia snapshots
Developing snapshots of Wikipedia at every order of magnitude from 10MB to 100GB.
snapshot tools
We need...
- libraries for producing different styles of wikipedia snapshots (wikitext, html, txt, pdf) from categories (special:export), topics/pages (wikiosity), and index pages (wikislices)
- libraries that can do intelligent things with metadata from history and wikimedia-commons pages
- libraries that support no-image/thumbnail/mid-res image selection
- libraries that recalculate blue v. red links given a wikislice
Wiki format glue
We need glue code/scripts to interface between similar projects : WP WikiReaders, Wikibooks, wikipedia wikislice projects, webaroo wikislices, kiwix snapshots, schools-wikipedia snapshots, ksana snapshots, WP 1.0 revision-vetting --- at least at the level of sharing index selections and a list of "good revisions" for included articles.
Offline readers
As a recent focal point, Zvi Boshernitzan and Ben Lisbakken have both made offline wikipedia-readers using Google Gears that are pretty fast and offer some nice features in terms of letting you select a set of articles, cache them locally, and browse an index. We talked last week about how to integrate Gears more tightly into a browsing experience, with hopes of pursuing a prototype withing a couple of weeks. It would be helpful to inform such a client with lists of good revisions of articles, such a those Martin Walker and Andrew Cates have developed for their own projects... and to plan for it to support offline editing as well as reading, using synchronization tools such as Mako's distributed wiki client.
What can people do?
- wikipediaondvd - Pascal and Guillame wondering how they can help