Literacy Project/Data Processing Notes
Some notes on processing .zip files received from the field:
Prerequisites
Prerequisites: an account on hydro, membership in the literacy group, write access to /home/ethiopia:
$ ssh hydro.laptop.org cscott@hydro:/home/ethiopia$ groups cscott literacy cscott@hydro:/home/ethiopia$ cd /home/ethiopia/ cscott@hydro:/home/ethiopia$ touch do-i-have-write-access cscott@hydro:/home/ethiopia$ rm do-i-have-write-access
Archiving, unpacking, and checking the data
wolonchete/ and wonchi/ are rsynced to worldliteracy.media.mit.edu
cscott@hydro:/home/ethiopia$ ls wolonchete/ wonchi/ wolonchete/: 2012-03-30 results.txt wolonchete_2012-05-27 wolonchete_2012-06-24 2012-04-06 wolonchete_2012-05-01 wolonchete_2012-06-03 2012-04-13 wolonchete_2012-05-08 wolonchete_2012-06-10 2012-04-20 wolonchete_2012-05-17 wolonchete_2012-06-17 wonchi/: 2012-02-14 2012-03-30 results.txt wonchi_2012-05-23 wonchi_2012-06-21 2012-02-20 2012-04-06 wonchi_2012-04-26 wonchi_2012-05-31 2012-02-28 2012-04-13 wonchi_2012-05-03 wonchi_2012-06-07 2012-03-08 2012-04-20 wonchi_2012-05-10 wonchi_2012-06-14 cscott@hydro:/home/ethiopia$
Note that some of these directories have prefixed. The unprefixed versions are guesstimated dates -- we originally recorded the date the USB key arrived at OLPC. We fixed that and started recording the collection dates. The prefixed directories have the actual collection dates. The unprefixed directories are named with our best guess at the collection date.
The archive/ directory contains zip files copied from the data collection keys. There should be two files per usb key (a wolonchete .zip and a wonchi .zip). BE SURE to verify that contents of .zip file match the name, so you don't inadvertently overwrite something.
cscott@hydro:/home/ethiopia$ ls archive wolonchete_2012-06-03.zip wonchi_2012-05-31.zip wolonchete_2012-06-10.zip wonchi_2012-06-07.zip wolonchete_2012-06-17.orig.zip wonchi_2012-06-14.zip wolonchete_2012-06-17.zip wonchi_2012-06-21.zip wolonchete_2012-06-24.zip
These are the two most recent files received from Ethiopia:
cscott@hydro:/home/ethiopia$ ls ~rsmith/w*.zip /home/rsmith/wolonchete_2012-07-01.zip /home/rsmith/wonchi_2012-06-28.zip cscott@hydro:/home/ethiopia$ unzip -v /home/rsmith/wolonchete_2012-07-01.zip | head Archive: /home/rsmith/wolonchete_2012-07-01.zip Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 0 Stored 0 0% 2012-07-04 09:37 00000000 wolonchete_2012-07-01/ 0 Stored 0 0% 2012-07-04 09:27 00000000 wolonchete_2012-07-01/01/ 0 Stored 0 0% 1979-12-31 16:00 00000000 wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/ 8200 Defl:N 2543 69% 2012-06-24 09:06 5209af9a wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000180_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340465667_mainPipeline.db 6152 Defl:N 1415 77% 2012-06-24 09:06 5d96a57f wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000181_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340552241_mainPipeline.db 6152 Defl:N 811 87% 2012-06-24 09:06 541732f9 wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000182_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340552241_mainPipeline.db 114696 Defl:N 16867 85% 2012-06-24 09:37 a86306dd wolonchete_2012-07-01/01/edu.mit.media.funf.bgcollector/00000183_0388920440601417_7c429084-7a15-48f1-a458-957681c105fd_1340554005_mainPipeline.db cscott@hydro:/home/ethiopia$
Note that the directory name does match the .zip file name. (We should check the wonchi file, too.) Good. This has recently (late July) become a slight problem, as the files for wolonchete are now being spelled wolenchite. We've adopted wolonchete as the normalized spelling, so rename anything that needs it (after unzipping).
Now let's unzip the files:
cscott@hydro:/home/ethiopia$ cd cscott@hydro:~$ mkdir temp ; cd temp cscott@hydro:~/temp$ unzip /home/rsmith/wonchi_2012-06-28.zip cscott@hydro:~/temp$ unzip /home/rsmith/wolonchete_2012-07-01.zip
Check for the following:
- Spelling of wonchi/wolonchete
- Tablet directories should be two digits: "01" not "1"
We're going to check for duplicates:
cscott@hydro:~/temp$ cp ~rsmith/bin/rdfind ~/bin cscott@hydro:~/temp$ rdfind wolonchete_2012-07-01/ wonchi_2012-06-28/ Now scanning "wolonchete_2012-07-01", found 13727 files. Now scanning "wonchi_2012-06-28", found 7630 files. Now have 21357 files in total. Removed 0 files due to nonunique device and inode. Now removing files with zero size from list...removed 36 files Total size is 3198679889 bytes or 3 Gib Now sorting on size:removed 6240 files due to unique sizes from list.15081 files left. Now eliminating candidates based on first bytes:removed 3424 files from list.11657 files left. Now eliminating candidates based on last bytes:removed 9294 files from list.2363 files left. Now eliminating candidates based on md5 checksum:removed 2363 files from list.0 files left. It seems like you have 0 files that are not unique Totally, 0 b can be reduced. Now making results file results.txt cscott@hydro:~/temp$
Ok, this gives us confidence that there aren't gross errors in the data, like having the wonchi and wolonchete zips the same, etc. We should repeat the rdfind across the entire dataset (later) as well, to ensure that we don't have stale data here.
OK, this looks good. Let's move them into the ethiopia directory:
cscott@hydro:~/temp$ mv /home/rsmith/wonchi_2012-06-28.zip /home/rsmith/wolonchete_2012-07-01.zip /home/ethiopia/archive/ cscott@hydro:~/temp$ mv wolonchete_2012-07-01/ /home/ethiopia/wolonchete/ cscott@hydro:~/temp$ mv wonchi_2012-06-28/ /home/ethiopia/wonchi/
Normalize the permissions:
cscott@hydro:~/temp$ cd /home/ethiopia/archive/ cscott@hydro:/home/ethiopia/archive$ chmod 644 * cscott@hydro:/home/ethiopia/archive$ cd .. cscott@hydro:/home/ethiopia$ chmod -R a+X,o-w wolonchete/wolonchete_2012-07-01/ wonchi/wonchi_2012-06-28/ cscott@hydro:/home/ethiopia$ chmod -R g+sw wolonchete/wolonchete_2012-07-01/ wonchi/wonchi_2012-06-28/ cscott@hydro:/home/ethiopia$ chown -R :literacy wolonchete/wolonchete_2012-07-01/ wonchi/wonchi_2012-06-28/
Now we're going to run rdfind across all of the data, in order to catch more errors. As of early September, 2012, this step took about 45 minutes to inspect over 4 million files.
cscott@hydro:/home/ethiopia$ rdfind wolonchete/ wonchi/
The output will be in a results.txt file.
Pushing the data into git
Make sure you have a copy of the git repo somewhere handy:
cscott@hydro:/home/ethiopia$ cd cscott@hydro:~$ git clone ssh://dev.laptop.org/home/literacy/git/data ethiopia-data or cscott@hydro:~$ git clone ~rsmith/data/.git ethiopia-data cscott@hydro:~$ cd ethiopia-data/ cscott@hydro:~/ethiopia-data$ git remote add dev ssh://dev.laptop.org/home/literacy/git/data
Make sure your repo is up-to-date. The scripts may have changed since the last import you did:
cscott@hydro:~$ cd ~/ethiopia-data/ cscott@hydro:~/ethiopia-data$ git pull origin
Add the ethiopia-data/scripts directory to your path (or double-check that it's there and before ~/bin, if you've done this before):
cscott@hydro:~/ethiopia-data$ export PATH=$HOME/ethiopia-data/scripts:$PATH
You might want to be inside screen or tmux for the following, since it takes a long time:
cscott@hydro:~/ethiopia-data$ cd /home/ethiopia/ cscott@hydro:/home/ethiopia$ process_by_week.sh wolonchete/wolonchete_2012-07-01 cscott@hydro:/home/ethiopia$ process_by_week.sh wonchi/wonchi_2012-06-28/ [...] Mon Jul 9 17:46:31 EDT 2012: Finished processing wonchi/wonchi_2012-06-28/ cscott@hydro:/home/ethiopia$
You might see a few error messages which look like:
Unable to parse file: 00000034_0388920540e09217_d4c0820a-2ddc-4c19-8adf-872c835dd6b7_1340339595_mainPipeline.db
These are harmless; usually caused by too-small/empty .db files (check this).
Now that we've run this command, we have moved the original db files to encrypted_db and created merged.db and *.csv files:
cscott@hydro:/home/ethiopia$ ls wolonchete/wolonchete_2012-07-01/20/edu.mit.media.funf.bgcollector/encrypted_db/ [...] 00000689_0380614241409357_ffce5f83-9400-4699-9d6c-300ccfd58e85_1341117255_mainPipeline.db.orig cscott@hydro:/home/ethiopia$ ls wolonchete/wolonchete_2012-07-01/20/edu.mit.media.funf.bgcollector/merged_20.db wolonchete/wolonchete_2012-07-01/20/edu.mit.media.funf.bgcollector/merged_20.db cscott@hydro:/home/ethiopia$ ls wolonchete/wolonchete_2012-07-01/20/edu.mit.media.funf.bgcollector/csv_20/ BatteryProbe.csv Matching.csv ScreenProbe.csv FileMoverService.csv NellBalloons5.csv tinkerbook.csv HardwareInfoProbe.csv RecorderService.csv LauncherApp.csv RunningApplicationsProbe.csv
Now let's stage this data for transfer to git: (Note that, in cases of discrepancy, the first two arguments match how the directory will appear in git and the last argument matches the directory name in /home/ethiopia)
cscott@hydro:/home/ethiopia$ copyto-forgit.sh wonchi 2012-06-28 wonchi/wonchi_2012-06-28/
This has put the new data in $HOME/forgit:
cscott@hydro:/home/ethiopia$ cd ~/forgit/ cscott@hydro:~/forgit$ ls wonchi cscott@hydro:~/forgit$ ls wonchi/ 2012-06-28 cscott@hydro:~/forgit$ ls wonchi/2012-06-28/ 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20
Now move this to the git dir and check it in:
cscott@hydro:~/forgit$ mv wonchi/2012-06-28 ~/ethiopia-data/wonchi/ cscott@hydro:~/forgit$ cd ~/ethiopia-data/ cscott@hydro:~/ethiopia-data$ git add wonchi/2012-06-28 cscott@hydro:~/ethiopia-data$ git commit wonchi/2012-06-28 cscott@hydro:~/ethiopia-data$ git push origin # or dev
Now repeat this for the other site (separate commits for each):
cscott@hydro:/home/ethiopia$ copyto-forgit.sh wolenchite 2012-07-01 wolonchete/wolonchete_2012-07-01/ cscott@hydro:/home/ethiopia$ cd ~/forgit/ cscott@hydro:~/forgit$ mv wolenchite/2012-07-01 ~/ethiopia-data/wolenchite/ cscott@hydro:~/forgit$ cd ~/ethiopia-data/ cscott@hydro:~/ethiopia-data$ git add wolenchite/2012-07-01 cscott@hydro:~/ethiopia-data$ git commit wolenchite/2012-07-01 cscott@hydro:~/ethiopia-data$ git push origin # or dev
Final steps
In theory we should then sync up owl and worldliteracy, then send mail to the literacy list to announce that there is new data. Something like:
rsync -a --progress wolonchete wonchi owl.laptop.org:/home/ethiopia rsync -a --progress wolonchete wonchi worldliteracy.media.mit.edu:/home/ethiopia
You might want to run the second rsync from owl after the first is finished, rather than from hydro, since owl and worldliteracy are both in the ML and have a high-bandwidth network between them.