Content bundle making script: Difference between revisions
(→Script) |
(→Script: category is required! This should work (untested)) |
||
(6 intermediate revisions by 2 users not shown) | |||
Line 10: | Line 10: | ||
=== Improvements === |
=== Improvements === |
||
==== todo ==== |
|||
These are things I know should be fixed in the script - if you have the time and inclination, please do them. |
These are things I know should be fixed in the script - if you have the time and inclination, please do them. |
||
⚫ | |||
* make the library.info file generation less horribly hackish... right now it just appends all the needed information to the file, but the implementation is very brittle in that the library.info format and default values are hard-coded into the script right now. |
* make the library.info file generation less horribly hackish... right now it just appends all the needed information to the file, but the implementation is very brittle in that the library.info format and default values are hard-coded into the script right now. |
||
* make the script clean up after itself. It leaves a bunch of cruft hanging around that isn't the .xol file. |
* make the script clean up after itself. It leaves a bunch of cruft hanging around that isn't the .xol file. |
||
* Do some basic validation of the input arguments - e.g. URL starts with http, ftp, etc., DEST does not already exist as a directory, filename or $2.xol, DEST does not contain a slash “/”, LANG is a supported language |
|||
* if this is sufficiently useful, consider asking for [[Project hosting]] for the script - please cc me in on the request if you do this! |
|||
* Check return codes of key commands you execute - i.e. “if [$? -ne 0] then” — in many instances, this is a fatal error so you’d cleanup and exit with some message. For example, check the return codes of wget, mkdir and zip. |
|||
* Use ‘rm -rf “$2″‘ with extreme caution - what if someone specified “/” as $2? As noted above, you should ensure that $2 does not already exist, etc. |
|||
==== pending ==== |
|||
* waiting for [[Project hosting]]. [[User:Mchua|Mchua]] |
|||
⚫ | |||
** It works on some plain-vanilla index.html pages; not for others. |
|||
** It does ''not'' work on mediawiki pages like [http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikislice/Dinosaurs this]. This will be the second thing to fix. |
|||
** this needs more work. |
|||
==== done ==== |
|||
* Set the arguments $1, $2, $3 tolocal variables to make it more readable, e.g. URL=$1, DEST=$2, LANG=$3 |
|||
== Usage == |
== Usage == |
||
Line 38: | Line 54: | ||
## $2 - folder name to download to, also name of bundle |
## $2 - folder name to download to, also name of bundle |
||
## $3 - language code this content is in (en, es, pt, etc.) |
## $3 - language code this content is in (en, es, pt, etc.) |
||
# make sure there are 3 arguments |
# make sure there are 3 arguments |
||
if [ $# -ne 3 ]; then |
if [ $# -ne 3 ]; then |
||
Line 45: | Line 61: | ||
grep '^##' "$0" | sed 's/^## //' 1>&2 |
grep '^##' "$0" | sed 's/^## //' 1>&2 |
||
exit 127 |
exit 127 |
||
fi |
fi |
||
# get the files from <url> |
# get the files from <url> |
||
# place them in a folder called <bundlename> in cwd |
# place them in a folder called <bundlename> in cwd |
||
# create a log at <bundlename>-log for debugging |
# create a log at <bundlename>-log for debugging |
||
wget -rp -nH -l1 -o ${2}-log -P "$2" "$1" |
wget -rp -nH -l1 -o ${2}-log -P "$2" "$1" |
||
# create the metadata |
# create the metadata |
||
TARGET_DIR="$2"/library |
TARGET_DIR="$2"/library |
||
Line 62: | Line 78: | ||
echo global_name = "$2" |
echo global_name = "$2" |
||
echo long_name = "$2" |
echo long_name = "$2" |
||
echo category = media |
|||
echo library_version = 1 |
echo library_version = 1 |
||
echo host_version = 1 |
echo host_version = 1 |
||
echo l10n = false |
echo l10n = false |
||
echo locale = $3 ) > ${TARGET_DIR}/library.info |
echo locale = $3 ) > ${TARGET_DIR}/library.info |
||
# zip and rename as bundle |
# zip and rename as bundle |
||
zip -9 -rq "$2.xol" "$2" |
zip -9 -rq "$2.xol" "$2" |
||
# delete extra files |
# delete extra files |
||
rm -rf "$2" |
rm -rf "$2" |
Latest revision as of 20:05, 29 October 2008
Intro
Quick hack of a script that takes an index page on the web and pulls it, and all the pages it references, into a library bundle. Useful for things like Wikislices.
At the end of running the script (it is quiet and may take several minutes as it downloads files) you should have a bundlename.xol file in the directory you ran the script on, ready to go on the XO.
Caution
This script is only suitable for use from starting pages where the link-topology is well understood. Even going one layer deep from a Wikipedia page will possibly retrieve some copyrighted content and make the resulting bundle unsuitable for redistribution by OLPC. This is an experimental tool and the resulting bundles should not be deposited in the OLPC Library without careful review of potential copyright issues.
Improvements
todo
These are things I know should be fixed in the script - if you have the time and inclination, please do them.
- make the library.info file generation less horribly hackish... right now it just appends all the needed information to the file, but the implementation is very brittle in that the library.info format and default values are hard-coded into the script right now.
- make the script clean up after itself. It leaves a bunch of cruft hanging around that isn't the .xol file.
- Do some basic validation of the input arguments - e.g. URL starts with http, ftp, etc., DEST does not already exist as a directory, filename or $2.xol, DEST does not contain a slash “/”, LANG is a supported language
- Check return codes of key commands you execute - i.e. “if [$? -ne 0] then” — in many instances, this is a fatal error so you’d cleanup and exit with some message. For example, check the return codes of wget, mkdir and zip.
- Use ‘rm -rf “$2″‘ with extreme caution - what if someone specified “/” as $2? As noted above, you should ensure that $2 does not already exist, etc.
pending
- waiting for Project hosting. Mchua
- test the script on several index-type sites and try out the .xol files generated on your XO, and see if they work. Post results to the talk page. Mchua
- It works on some plain-vanilla index.html pages; not for others.
- It does not work on mediawiki pages like this. This will be the second thing to fix.
- this needs more work.
done
- Set the arguments $1, $2, $3 tolocal variables to make it more readable, e.g. URL=$1, DEST=$2, LANG=$3
Usage
Copy the script into a directory on your machine and make it executable. usage is bundlemaker url bundlename languagecode
where
- url is the url you want to turn into your index page
- bundlename is the name you want your bundle to have (bundle.xol)
- languagecode is the 2-letter language code the materials are written in; for instance, en, es, pt
example:
./bundlemaker http://www.nlm.nih.gov/medlineplus/healthtopics.html healthbundle en
Script
#!/bin/sh ## usage: bundlemaker <url> <bundlename> <language> ## $1 - url to download from ## $2 - folder name to download to, also name of bundle ## $3 - language code this content is in (en, es, pt, etc.) # make sure there are 3 arguments if [ $# -ne 3 ]; then # Lines in this file starting with a double hash are usage docs, # spit them out if the arguments look weird. grep '^##' "$0" | sed 's/^## //' 1>&2 exit 127 fi # get the files from <url> # place them in a folder called <bundlename> in cwd # create a log at <bundlename>-log for debugging wget -rp -nH -l1 -o ${2}-log -P "$2" "$1" # create the metadata TARGET_DIR="$2"/library mkdir -p ${TARGET_DIR} # note: this is a stupid script! please fix it! # see http://wiki.laptop.org/go/Sample_library.info_file ( echo [Library] echo name = "$2" echo global_name = "$2" echo long_name = "$2" echo category = media echo library_version = 1 echo host_version = 1 echo l10n = false echo locale = $3 ) > ${TARGET_DIR}/library.info # zip and rename as bundle zip -9 -rq "$2.xol" "$2" # delete extra files rm -rf "$2"