Wikislices: Difference between revisions
(code libs) |
(→See also: mention InfoSlicer) |
||
(25 intermediate revisions by 9 users not shown) | |||
Line 1: | Line 1: | ||
In this context, a '''Wikislice''' is collection of articles pulled from Wikipedia and either stored as simplified HTML pages or stored compressed to be served by the [[WikiBrowse]] activity on the XO. |
|||
A '''Wikislice''' is a collection of materials gathered from a public wiki and packaged into a reusable form. Common examples are topical wikislices from Wikipedia, resulting in books such as the "Solar system" wikijunior text and various [http://en.wikipedia.org/wiki/Wikipedia:Wikireader wikireaders]. Tools used to make wikislices are regular expression toolkits. |
|||
The [[Activities/G1G1]] [[activity group]] includes a chemistry collections stored as HTML, see [[Activities/Wikislice chemistry-en (8.2)]]. It also includes a big chunk of Wikipedia stored compressed, see [[Activities/WikiBrowse English-G1G1]]. |
|||
See [[User:ZdenekBroz]] and the [[library grid]] for some examples. |
|||
The goal is to select from Wikipedia well written, structured, and cited articles while excluding the rest. The entire english Wikipedia is very large and wouldn't fit on the XO. Nor are 1000+ articles on Pokemon characters important education materials for the developing world. |
|||
== WikiProject Wikislice == |
|||
Please visit the [http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikislice project about wikislices] on the english wikipedia. |
|||
Please visit the [http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikislice project about wikislices] on the english Wikipedia. For further discussion and notes, see [[Talk:Wikislices|the discussion page]] or minutes from [[Wikislice meetings|recent IRC chats]]. |
|||
== Code libraries== |
|||
* KsanaForge and their KsanaWiki project have a set of scripts that process raw xml dumps from MediaWiki. They are working on producing read-only flash drives and SD cards for distribution. |
|||
* Linterweb, developer of one of the freely-available static selections of Wikipedia, has an open source toolchain for building it; they are also working on wiki search engines (see [http://www.kiwix.org Kiwix]) and have offered to help build the local-filesystem [[search]] for the [[journal]]. |
|||
* The Moulinwiki project and Renaud Gaudin have a toolchain from processing html output from the MediaWiki parser. They are now combining forces with Linterweb. |
|||
* [http://code.pediapress.com/wiki/wiki PediaPress] has an "mwlib" library for parsing mediawiki text which is freely available |
|||
* the "Wikipedia 1.0" team and Andrew Cates (user:BozMo on en:wp) is using their own scripts to generate and review static collections from a list of constantly changing wiki articles. |
|||
{{stub}} |
|||
== Current slices == |
|||
We are planning on shipping a general collection of material with the XO. Collections could be added to a student's XO based on classroom assignments or simply a child's interest in a subject. |
|||
Common slice templates are topical wikislices from Wikipedia, resulting in books such as the "Solar system" wikijunior text and various [[wikipedia:Wikireader|wikireaders]]. Tools used to make wikislices are regular expression toolkits. See the scripts and tools section below. |
|||
=== Science === |
|||
''see also [[Biology]] and [[Chemistry]]'' |
|||
* An update to the [[Biology]] bundle with fungi and protists is slowly underway. Ditto an update clarifying licenses of the images (all cc-by) |
|||
* Something for a bug blitz would be most helpful, drawing on the above and related zipcodezoo and misha h's content. |
|||
==== Physics ==== |
|||
A list of physics laws, terms, and principles, from Wikipedia. See [[physics (collection)]]. |
|||
==== Chemistry ==== |
|||
A periodic table and collection of core chemistry terms, from Wikipedia. |
|||
=== Health === |
|||
*In conjunction with other [[Health Content]], a bundle of [[Adapting_Hesperian_Books | Where There is No Doctor]] is underway, with a pdf - to - html conversion care of [[user:Pascal|Pascal]]. |
|||
*An index is being built [[Animal_health/Livestock | here]] for a slice related to farm animals as part of the [[Animal_health]] content development effort. |
|||
=== How to... === |
|||
A collection of projects and how-tos for children, indoors and outdoors, with limited raw materials; from WikiHow. |
|||
== Curation == |
|||
=== by hand === |
|||
some slices are curated on en:wp by hand, at the wikiProject page. |
|||
=== by script === |
|||
[http://www.mediawiki.org/wiki/MWDumper MWDumper] will take a wiki dump and pass it through filters to make it smaller. You can write your own filters. |
|||
== Needs == |
|||
# '''naming''' and revising a selection - for instance, creating a wikipage that lists a set of pages (optional: revisions) to make into a bundle. as a poor example, see [http://en.wikipedia.org/wiki/User:Sj/wp-api this old api] |
|||
# '''finding/listing''' selections, something like an '''apt-get''' service for collections that have been named [and made] (optional: bundle-on-demand, if the bundling takes time; or if a named bundle invokes "get latest version" rather than "get specific revision") |
|||
# '''creating collections''' : turning a named selection into a {wikitext|mw-xml|html|pdf|docbook|txt} collection or futher an rpm/xol/ other bundle of one of the above |
|||
# '''reader tools''' : search, categorization, cookie crumb trail, &c |
|||
Pascal suggests above that making a selection is currently the 'hard' part. SJ ia not quite in agreement /yet/ since we haven't made the obvious bundles that should already be available to all, recreated every month, &c. from vetted sources. |
|||
=== Snapshot tools === |
|||
To quickly create and revise slices, one needs |
|||
* libraries for producing different styles of wikipedia snapshots (wikitext, html, txt, pdf) from categories (special:export), topics/pages (wikiosity), and index pages (wikislices) |
|||
* libraries that can do intelligent things with metadata from history and wikimedia-commons pages |
|||
* libraries that support no-image/thumbnail/mid-res image selection |
|||
* libraries that recalculate blue v. red links given a wikislice |
|||
==== for Wikipedia ==== |
|||
Specifically: |
|||
* Produce snapshots of Wikipedia by size (10MB, 100MB, 1GB, 10GB, and 100GB) |
|||
* Produce snapshots by topic ([[Wikiosity]], categories, explicit wikislice lists) |
|||
=== Format conversion === |
|||
There are many existing flavors/formats for bundles of wiki-subsets for offline use. |
|||
Conversion tools are needed to exchange collections between similar projects : WP WikiReaders, Wikibooks, wikipedia wikislice projects, webaroo wikislices, kiwix snapshots, schools-wikipedia snapshots, ksana snapshots, WP 1.0 revision-vetting. If this isn't possible as literal "conversion" from one format to another, index selections of articles+revisions should at least be shared. |
|||
=== Offline reader === |
|||
Wishes: |
|||
* Something that can be served via static HTML from a webserver as a last resort. Wizzy.za currently use Andrew Cates's version for this. It has a nice portal, but fairly rudimentary index, and no search. |
|||
*: NB: ''kiwix'' can't do this without its reader. |
|||
* Categories. |
|||
* Search. Linterweb have this, but it needs a special reader, and I want to serve search from the webserver. Some XML index perhaps, readable by a reader and something serverside ? |
|||
'''Mobile Lab (Apache Web Server)''': |
|||
Originally as a mobile lab I made one of the 40 XO laptops in my group an Apache Web Server. I created a simple HTML page linking to web pages, Wikipedia and other, I saved on the laptop using Firefox. This allowed all 40 students to view and read pages while traveling away from an internet connection by browsing to my mesh address. |
|||
'''Google gears''': |
|||
As a recent focal point, Zvi Boshernitzan and Ben Lisbakken have both made offline wikipedia-readers using Google Gears that are pretty fast and offer some nice features in terms of letting you select a set of articles, cache them locally, and browse an index. We talked last week about how to integrate Gears more tightly into a browsing experience, with hopes of pursuing a prototype withing a couple of weeks. It would be helpful to inform such a client with lists of good revisions of articles, such a those Martin Walker and Andrew Cates have developed for their own projects... and to plan for it to support offline editing as well as reading, using synchronization tools such as Mako's distributed wiki client. |
|||
== Todo == |
|||
* wikipediaondvd - define what scripts are used to produce current cd and dvd images |
|||
* crawl WP for pictures from a given XML dump. (picture dumps are tarred, making this hard, sez andy-r.) |
|||
=== Current scripts === |
|||
Scripts that are currently used to make bundles: |
|||
'''PDF and single-document exports''' |
|||
* wiki2pdf |
|||
'''HTML dumps''' |
|||
* wikiwix export (being built) : takes in a '''list of wikiwix entries''', outputs ? |
|||
* wikiwix interface (being improved) : allow selection via firefox plugin of a set of articles for a collection |
|||
'''Summaries and weight-watchers''' |
|||
* Summarize list : takes in a '''list of article titles''', outputs a '''directory of one-paragraph html files with css'''. (by Zdenek, not published yet) |
|||
* Compress images : take a set of pages and images, reduce images according to a slider |
|||
*: no images (remove altogether) v. hotlink images (include original thumbnail, alt text when offline) |
|||
*: include first {0-10} images on a page, with metadata |
|||
*: thumbnail only v. include full image (but not extra large) v. include all image sizes (full screen and more-than-fullscreen where available) |
|||
*:: bonus: assume local resize tool v. store 3 images for large instances |
|||
== See also == |
|||
* [[WikiBrowse]] a condensed set of wiki pages served to [[Browse]] by a local specialized web server |
|||
* [[Wikibooks]] a set of wiki pages in PDF format. |
|||
* [http://sugarlabs.org/go/Activities/InfoSlicer InfoSlicer] is a tool to create a collection from content (such as Wikipedia pages) on the Internet. |
|||
[[Category:Wiki texts]] |
|||
[[Category:Content]] |
Latest revision as of 00:54, 22 February 2009
In this context, a Wikislice is collection of articles pulled from Wikipedia and either stored as simplified HTML pages or stored compressed to be served by the WikiBrowse activity on the XO.
The Activities/G1G1 activity group includes a chemistry collections stored as HTML, see Activities/Wikislice chemistry-en (8.2). It also includes a big chunk of Wikipedia stored compressed, see Activities/WikiBrowse English-G1G1.
The goal is to select from Wikipedia well written, structured, and cited articles while excluding the rest. The entire english Wikipedia is very large and wouldn't fit on the XO. Nor are 1000+ articles on Pokemon characters important education materials for the developing world.
Please visit the project about wikislices on the english Wikipedia. For further discussion and notes, see the discussion page or minutes from recent IRC chats.
Current slices
We are planning on shipping a general collection of material with the XO. Collections could be added to a student's XO based on classroom assignments or simply a child's interest in a subject.
Common slice templates are topical wikislices from Wikipedia, resulting in books such as the "Solar system" wikijunior text and various wikireaders. Tools used to make wikislices are regular expression toolkits. See the scripts and tools section below.
Science
see also Biology and Chemistry
- An update to the Biology bundle with fungi and protists is slowly underway. Ditto an update clarifying licenses of the images (all cc-by)
- Something for a bug blitz would be most helpful, drawing on the above and related zipcodezoo and misha h's content.
Physics
A list of physics laws, terms, and principles, from Wikipedia. See physics (collection).
Chemistry
A periodic table and collection of core chemistry terms, from Wikipedia.
Health
- In conjunction with other Health Content, a bundle of Where There is No Doctor is underway, with a pdf - to - html conversion care of Pascal.
- An index is being built here for a slice related to farm animals as part of the Animal_health content development effort.
How to...
A collection of projects and how-tos for children, indoors and outdoors, with limited raw materials; from WikiHow.
Curation
by hand
some slices are curated on en:wp by hand, at the wikiProject page.
by script
MWDumper will take a wiki dump and pass it through filters to make it smaller. You can write your own filters.
Needs
- naming and revising a selection - for instance, creating a wikipage that lists a set of pages (optional: revisions) to make into a bundle. as a poor example, see this old api
- finding/listing selections, something like an apt-get service for collections that have been named [and made] (optional: bundle-on-demand, if the bundling takes time; or if a named bundle invokes "get latest version" rather than "get specific revision")
- creating collections : turning a named selection into a {wikitext|mw-xml|html|pdf|docbook|txt} collection or futher an rpm/xol/ other bundle of one of the above
- reader tools : search, categorization, cookie crumb trail, &c
Pascal suggests above that making a selection is currently the 'hard' part. SJ ia not quite in agreement /yet/ since we haven't made the obvious bundles that should already be available to all, recreated every month, &c. from vetted sources.
Snapshot tools
To quickly create and revise slices, one needs
- libraries for producing different styles of wikipedia snapshots (wikitext, html, txt, pdf) from categories (special:export), topics/pages (wikiosity), and index pages (wikislices)
- libraries that can do intelligent things with metadata from history and wikimedia-commons pages
- libraries that support no-image/thumbnail/mid-res image selection
- libraries that recalculate blue v. red links given a wikislice
for Wikipedia
Specifically:
- Produce snapshots of Wikipedia by size (10MB, 100MB, 1GB, 10GB, and 100GB)
- Produce snapshots by topic (Wikiosity, categories, explicit wikislice lists)
Format conversion
There are many existing flavors/formats for bundles of wiki-subsets for offline use.
Conversion tools are needed to exchange collections between similar projects : WP WikiReaders, Wikibooks, wikipedia wikislice projects, webaroo wikislices, kiwix snapshots, schools-wikipedia snapshots, ksana snapshots, WP 1.0 revision-vetting. If this isn't possible as literal "conversion" from one format to another, index selections of articles+revisions should at least be shared.
Offline reader
Wishes:
- Something that can be served via static HTML from a webserver as a last resort. Wizzy.za currently use Andrew Cates's version for this. It has a nice portal, but fairly rudimentary index, and no search.
- NB: kiwix can't do this without its reader.
- Categories.
- Search. Linterweb have this, but it needs a special reader, and I want to serve search from the webserver. Some XML index perhaps, readable by a reader and something serverside ?
Mobile Lab (Apache Web Server): Originally as a mobile lab I made one of the 40 XO laptops in my group an Apache Web Server. I created a simple HTML page linking to web pages, Wikipedia and other, I saved on the laptop using Firefox. This allowed all 40 students to view and read pages while traveling away from an internet connection by browsing to my mesh address.
Google gears: As a recent focal point, Zvi Boshernitzan and Ben Lisbakken have both made offline wikipedia-readers using Google Gears that are pretty fast and offer some nice features in terms of letting you select a set of articles, cache them locally, and browse an index. We talked last week about how to integrate Gears more tightly into a browsing experience, with hopes of pursuing a prototype withing a couple of weeks. It would be helpful to inform such a client with lists of good revisions of articles, such a those Martin Walker and Andrew Cates have developed for their own projects... and to plan for it to support offline editing as well as reading, using synchronization tools such as Mako's distributed wiki client.
Todo
- wikipediaondvd - define what scripts are used to produce current cd and dvd images
- crawl WP for pictures from a given XML dump. (picture dumps are tarred, making this hard, sez andy-r.)
Current scripts
Scripts that are currently used to make bundles:
PDF and single-document exports
- wiki2pdf
HTML dumps
- wikiwix export (being built) : takes in a list of wikiwix entries, outputs ?
- wikiwix interface (being improved) : allow selection via firefox plugin of a set of articles for a collection
Summaries and weight-watchers
- Summarize list : takes in a list of article titles, outputs a directory of one-paragraph html files with css. (by Zdenek, not published yet)
- Compress images : take a set of pages and images, reduce images according to a slider
- no images (remove altogether) v. hotlink images (include original thumbnail, alt text when offline)
- include first {0-10} images on a page, with metadata
- thumbnail only v. include full image (but not extra large) v. include all image sizes (full screen and more-than-fullscreen where available)
- bonus: assume local resize tool v. store 3 images for large instances
See also
- WikiBrowse a condensed set of wiki pages served to Browse by a local specialized web server
- Wikibooks a set of wiki pages in PDF format.
- InfoSlicer is a tool to create a collection from content (such as Wikipedia pages) on the Internet.