Bundle project: Difference between revisions

From OLPC
Jump to navigation Jump to search
(from zb's scripts. need windows utf8 preservation tips)
(..)
Line 1: Line 1:
== guidelines for bundling ==
== guidelines for bundling ==
* '''scripts''': capture and publish any scripts used to create the bundle. Others who see missing metadata or other elements will want to be able to rebuild the bundle with different parameters. Taking care with documenting these scripts is the easiest way to guarantee compatibility with attribution and other licensing as well.
* '''scripts''': capture and publish any scripts used to create the bundle. Others who see missing metadata or other elements will want to be able to rebuild the bundle with different parameters. Taking care with documenting these scripts is the easiest way to guarantee compatibility with attribution and other licensing as well.
* '''licenses''': note licensing and attributino as granularly as the original creators did. every image in a collectino of images, every article in a set of articles, every definition in a set of definitions. If there is a simple way to pass on the aggregate history of collaborative works, include that; else include a link to the source history for the work (or a script that has options for extracting history, latest-author, date, and similar in the format of the original archive).
* '''licenses''': note licensing and attribution as granularly as the original creators did. every image in a collection of images, every article in a set of articles, every definition in a set of definitions. If there is a simple way to pass on the aggregate history of collaborative works, include that; else include a link to the source history for the work (or a script that has options for extracting history, latest-author, date, and similar in the format of the original archive).
* '''other metadata''': see the [[#metadata]] section below. capture the original URL or source, and as many of the intervening authors, uploaders, and upload dates as possible, to help accurately identify the provenance of a work.
* '''other metadata''': see the [[#metadata]] section below. capture the original URL or source, and as many of the intervening authors, uploaders, and upload dates as possible, to help accurately identify the provenance of a work.
* check source archives for APIs for gathering such data. Many sites, including modern mediawiki sites, have an API that will directly give you most information you need without [[#screenscraping]].
* check source archives for APIs for gathering such data. Many sites, including modern mediawiki sites, have an API that will directly give you most information you need without [[#screenscraping]].

Revision as of 21:05, 15 February 2008

guidelines for bundling

  • scripts: capture and publish any scripts used to create the bundle. Others who see missing metadata or other elements will want to be able to rebuild the bundle with different parameters. Taking care with documenting these scripts is the easiest way to guarantee compatibility with attribution and other licensing as well.
  • licenses: note licensing and attribution as granularly as the original creators did. every image in a collection of images, every article in a set of articles, every definition in a set of definitions. If there is a simple way to pass on the aggregate history of collaborative works, include that; else include a link to the source history for the work (or a script that has options for extracting history, latest-author, date, and similar in the format of the original archive).
  • other metadata: see the #metadata section below. capture the original URL or source, and as many of the intervening authors, uploaders, and upload dates as possible, to help accurately identify the provenance of a work.
  • check source archives for APIs for gathering such data. Many sites, including modern mediawiki sites, have an API that will directly give you most information you need without #screenscraping.

screenscraping

...and regular expressions

extracting licenses from Wikimedia Commons

:%s/<a href="[^h][^>]*>\([^<]*\)<\/a>/\1/gc

 :%s/<hr\_p\{-}

.*<\/p>//c (rm gfdl-template excess) vim -c "%s///g" -c wq <file> for a in *.txt; do mv "$a" "${a%.txt}.baz"; done ( or mv "$f" "${a#proto-}"; )

topics and scope

Try to pick a topic that can be covered elegantly in a compact bundle. Most laptop bundles should be under 10M in size. If you think you have a topic that can't possibly be covered this way, consider covering a smaller scope, the same scope with less depth, or the same topic at a different level of abstraction. Larger collections (up to 1G in size) can be packaged for a school library.