User:Cjl/cni-dictionary

From OLPC
< User:Cjl
Revision as of 02:10, 20 October 2012 by Cjl (talk | contribs)
Jump to navigation Jump to search

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That text is the starting point for any number of efforts.

In preparation, download one of the zips blow (each contains 10 images) and mark it reserved on the wiki.


1) Go to: http://www.onlineocr.net/Default.aspx

2) scroll down to bottom of page

3) Click "Browse" (lower right)

4) Select next file from wherever you saved it.

5) Click "Upload"

6) Set Recognition language: to Spanish [on pulldown]

7) Set Output format: to Text Plain (txt) [on pulldown]

8) Enter 6-digit CAPTCHA number

9) Click "Recognize" (lower left), wait for it to process.

10) scroll further down to bottom of page

11) Click "Download Output File"

12) Save file as Pagennn.txt

13) Repeat process from step 3 with next file, the pulldown settings should be preserved, but visually check to be sure.

When complete, place the text file results into a folder named "[Batch-cni-dict-TXTnn" (where nn is the number) and upload it to the wiki as a .zip file.