Scanning

From OLPC
Revision as of 15:22, 9 September 2007 by Ssb22 (talk | contribs) (added a reference to my de-rotate script and other stuff)
Jump to navigation Jump to search

Many people will be scanning books and other existing print materials in order to repurpose them for use on the 2B1, either in DJVU ebook format or other image formats. In order to help these people it is useful to have a collection of tips for setting up and running your scanning workflow.

This piece of advice comes from a gentleman with considerable experience in scanning copyright-free images from old books:

I routinely scan pages of old books (and other documents) for my Web site, from old books http://www.fromoldbooks.org. I use an Epson Expression 10000 XL, which, as someone else noted, isn't cheap, but it does A3/11x17/tabloid at 2800dpi. At 400dpi grayscale it can scan a regular page in a few seconds.

I've also used a Casio Exilim camera to photograph pages.

The way that it's done for archival purposes is to have a mount that holds a book and also holds a medium-format camera about four feet away. To get good resolution for OCR you'll need something that's about an 11 megapixel camera or more, for a full page at (say) 7x10 inches of actual text. Hugin and ptstitcher and friends, the panorama tools, include software to correct for lens distortion. Phase One sells a camera mount (in Canada you can get it from Vistek [vistek.ca], together with their 40 megapixel back end for a medium-format camera. Or you could make a suitable mount yourself. The trick is that it holds the book open half-way (or less, using mirrors) so that you don't get as much page distortion. Holding the book and the camera rock steady is absolutely necessary if you are photographing text.

For small items like a cheque (say), use a flatbed scanner, and scann at 400dpi grayscale. Project Gutenberg's guidelines are outdated (they use 300dpi black and white as I recall) and don't get such good results. If you go much higher than 400dpi, the OCR software starts having tantrums at you and the quality may actually degrade.

The best OCR software on the market today as far as I can tell is Abbyy Finereader. I tried several, and found this had, for example, at least two orders of magnitudes fewer errors than the GNU OCR package. You should expect errors, though, especially in digits.

Frankly I'd go with a scanner just because they're designed for this application, and you have less hassle. Transferring images from the camera to the computer twenty minutes after taking the photo means you need to keep a separate log of where each photo came from, or you'll muddle them up. I save images with filenames like Ball-Sussex/086-Pevensey-Castle.png so that the page number is in the filename. And the image quality with even a low-end scanner is much higher than you can get in practice with a camera without an elaborate set-up, and reliably better, comes out every time regardless of lighting, camera settings, wobbly hands, etc.

Having said all that, I do photograph pages sometimes to make manual transcriptions. Afterwards I do careful proof-reading against the original.

Liam

Comment from ssb22: I have great difficulty in lining up pages straight on the scanner, so I gave up and wrote a simple script to assist in "de-rotating" any pages that have been rotated. I'm sure there's commercial software out there that does it as well, but this script relies only on free software. Details on http://people.pwf.cam.ac.uk/ssb22/notes/. That page also includes some notes on how to break up a scanned page into sections (or even into individual words) so that it can be enlarged and split across multiple screens, which is useful for people with low vision (straightforward magnification is OK but it requires a lot of panning around, and it's better if the material can be logically split up and laid out again).