Health portal/Step2

From OLPC
< Health portal
Revision as of 15:10, 14 May 2008 by Cjl (talk | contribs) (Working draft of MedLine Plus scraping / bundling process.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

execute these 4 commands

wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html

Downloaded: 184 files, 2.7M in 15s (176 KB/s)

Cumulative downloaded: 184 files, 2.7M

wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html

Downloaded: 168 files, 2.5M in 14s (185 KB/s)

Cumulative downloaded: 276 files, 4.9M

wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html

Downloaded: 1378 files, 49M in 2m 42s (310 KB/s)

Cumulative downloaded: 1552 files, 53M

wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html

Downloaded: 1308 files, 30M in 2m 18s (224 KB/s)

Total Downloaded: 2297 files, 73M


Good guidance here: http://www.nlm.nih.gov/medlineplus/faq/copyrightfaq.html

the homepage, the summaries on the Health Topics pages, the FAQs, the same pages on MedlinePlus en español all copyright free


Problematic areas (Copyright)

The A.D.A.M. Medical Encyclopedia includes over 4,000 articles about diseases, tests, symptoms, injuries, and surgeries. It also contains an extensive library of medical photographs and illustrations.



http://www.patient-education.com/nlm/terms/


http://www.nlm.nih.gov/medlineplus/languages/languages.html


Need to review images grabbed, keep navigation, lose ADAM pics. Replacesomewith CC images or lose altogether.

Need to trim some stuff in each:

Tone down NIH branding while preserving attribution lose footer for instance.

Need to figure out what to do with links that are next hop. They go off to stuff that may have copyroght or be too US specific. Can delete big chunks of bottom of each page.


In the current form should also allow back and forth between English and Spanish also.

It might make sense to globally restructure for easier i18n/l10n.


Get my grep snd PERL skills polished for bulk global edits on whole directories of files at once, must find my copy of llama book.