Literacy Project/Data Processing Notes

From OLPC
< Literacy Project
Revision as of 16:34, 9 July 2012 by CScott (talk | contribs) (Even more notes)
Jump to: navigation, search

Some notes on processing .zip files received from the field:

Prerequisites: an account on hydro, membership in the literacy group, write access to /home/ethiopia:

$ ssh hydro.laptop.org
cscott@hydro:/home/ethiopia$ groups
cscott literacy
cscott@hydro:/home/ethiopia$ cd /home/ethiopia/
cscott@hydro:/home/ethiopia$ touch do-i-have-write-access
cscott@hydro:/home/ethiopia$ rm do-i-have-write-access 

wolonchete/ and wonchi/ are rsynced to worldliteracy.media.mit.edu

cscott@hydro:/home/ethiopia$ ls wolonchete/ wonchi/
wolonchete/:
2012-03-30  results.txt            wolonchete_2012-05-27  wolonchete_2012-06-24
2012-04-06  wolonchete_2012-05-01  wolonchete_2012-06-03
2012-04-13  wolonchete_2012-05-08  wolonchete_2012-06-10
2012-04-20  wolonchete_2012-05-17  wolonchete_2012-06-17

wonchi/:
2012-02-14  2012-03-30  results.txt        wonchi_2012-05-23  wonchi_2012-06-21
2012-02-20  2012-04-06  wonchi_2012-04-26  wonchi_2012-05-31
2012-02-28  2012-04-13  wonchi_2012-05-03  wonchi_2012-06-07
2012-03-08  2012-04-20  wonchi_2012-05-10  wonchi_2012-06-14
cscott@hydro:/home/ethiopia$ 

The archive/ directory contains zip files copied from the data collection keys. There should be two files per usb key (a wolonchete .zip and a wonchi .zip). BE SURE to verify that contents of .zip file match the name, so you don't inadvertently overwrite something.

cscott@hydro:/home/ethiopia$ ls archive
wolonchete_2012-06-03.zip       wonchi_2012-05-31.zip
wolonchete_2012-06-10.zip       wonchi_2012-06-07.zip
wolonchete_2012-06-17.orig.zip  wonchi_2012-06-14.zip
wolonchete_2012-06-17.zip       wonchi_2012-06-21.zip
wolonchete_2012-06-24.zip

These are the two most recent files received from Ethiopia:

cscott@hydro:/home/ethiopia$ ls ~rsmith/w*.zip
/home/rsmith/wolonchete_2012-07-01.zip  /home/rsmith/wonchi_2012-06-28.zip
cscott@hydro:/home/ethiopia$ unzip -v /home/rsmith/wolonchete_2012-07-01.zip | head
Archive:  /home/rsmith/wolonchete_2012-07-01.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       0  Stored        0   0% 2012-07-04 09:37 00000000  wolonchete_2012-07-01/
       0  Stored        0   0% 2012-07-04 09:27 00000000  wolonchete_2012-07-01/01/
       0  Stored        0   0% 1979-12-31 16:00 00000000  wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/
    8200  Defl:N     2543  69% 2012-06-24 09:06 5209af9a  wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000180_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340465667_mainPipeline.db
    6152  Defl:N     1415  77% 2012-06-24 09:06 5d96a57f  wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000181_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340552241_mainPipeline.db
    6152  Defl:N      811  87% 2012-06-24 09:06 541732f9  wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000182_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340552241_mainPipeline.db
  114696  Defl:N    16867  85% 2012-06-24 09:37 a86306dd  wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000183_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340554005_mainPipeline.db
cscott@hydro:/home/ethiopia$ 

Note that the directory name does match the .zip file name. (We should check the wonchi file, too.) Good.

Now let's unzip the files:

cscott@hydro:/home/ethiopia$ cd
cscott@hydro:~$ mkdir temp ; cd temp
cscott@hydro:~/temp$ unzip /home/rsmith/wonchi_2012-06-28.zip 
cscott@hydro:~/temp$ unzip /home/rsmith/wolonchete_2012-07-01.zip 

We're going to check for duplicates:

cscott@hydro:~/temp$ cp ~rsmith/bin/rdfind ~/bin
cscott@hydro:~/temp$ rdfind wolonchete_2012-07-01/ wonchi_2012-06-28/
Now scanning "wolonchete_2012-07-01", found 13727 files.
Now scanning "wonchi_2012-06-28", found 7630 files.
Now have 21357 files in total.
Removed 0 files due to nonunique device and inode.
Now removing files with zero size from list...removed 36 files
Total size is 3198679889 bytes or 3 Gib
Now sorting on size:removed 6240 files due to unique sizes from list.15081 files left.
Now eliminating candidates based on first bytes:removed 3424 files from list.11657 files left.
Now eliminating candidates based on last bytes:removed 9294 files from list.2363 files left.
Now eliminating candidates based on md5 checksum:removed 2363 files from list.0 files left.
It seems like you have 0 files that are not unique
Totally, 0 b can be reduced.
Now making results file results.txt
cscott@hydro:~/temp$ 

Ok, this gives us confidence that there aren't gross errors in the data, like having the wonchi and wolonchete zips the same, etc. We should repeat the rdfind across the entire dataset (later) as well, to ensure that we don't have stale data here.

OK, this looks good. Let's move them into the ethiopia directory:

cscott@hydro:~/temp$ mv /home/rsmith/wonchi_2012-06-28.zip /home/rsmith/wolonchete_2012-07-01.zip /home/ethiopia/archive/
cscott@hydro:~/temp$ mv wolonchete_2012-07-01/ /home/ethiopia/wolonchete/
cscott@hydro:~/temp$ mv wonchi_2012-06-28/ /home/ethiopia/wonchi/