User:Cjl/cni-dictionary

From OLPC
Jump to navigation Jump to search

The goal is to transform a 456 page image-PDF file of an Asháninka-Spanish and Spanish-Asháninka dictionary into OCR'ed digital text. That digitized text is the starting point for any number of efforts.

Here is the full image-PDF File:Dt19.pdf, which has been broken down to individual pages and grouped in batches below for processing via OCR.

In preparation, download one of the zips blow (each contains 10 images) and mark it reserved on the wiki.


1) Go to: http://www.onlineocr.net/Default.aspx

2) scroll down to bottom of page

3) Click "Browse" (lower right)

4) Select next file from wherever you saved it.

5) Click "Upload"

6) Set Recognition language: to Spanish [on pulldown]

7) Set Output format: to Text Plain (txt) [on pulldown]

8) Enter 6-digit CAPTCHA number

9) Click "Recognize" (lower left), wait for it to process.

10) scroll further down to bottom of page

11) Click "Download Output File"

12) Save file as Pagennn.txt

13) Repeat process from step 3 with next file, the pulldown settings should be preserved, but visually check to be sure.

When complete, place the text file results into a folder named "[Batch-cni-dict-TXTnn" (where nn is the number) and upload it to the wiki as a .zip file. Mark that batch as complete.