Health portal/Step2: Difference between revisions

From OLPC
Jump to navigation Jump to search
(typo)
(update for recent download of MedLinePlus)
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
<noinclude>{{ GoogleTrans-en | es =show | bg =show | zh-CN =show | zh-TW =show | hr =show | cs =show | da =show | nl =show | fi =show | fr =show | de =show | el =show | hi =show | it =show | ja =show | ko =show | no =show | pl =show | pt =show | ro =show | ru =show | sv =show }}</noinclude>{{Health}}
{{TOCright}}
=Downloading raw materials=
This page describes the details of a content harvesting project aimed at creating an .xol bundle of content from the U.S. National Library of Medicine's MedLinePlus webpage (at least the first layer which is explicitly free of copyright).

Such a bundle and the index pages downloaded with it present an opportunity to start an off-line browsable Health Portal. Additional material can be added to this portal by adding an HTML file to the collection and inserting a link to that file in the two index pages, (one by subject, one by alphabetical listing). The point of this is to empower local management of content, objectionable content may be removed (rather than refusal of the entire bundle) and locally important material can easily be integrated with only simple HTML authoring skills.


==Starting with the content scrape==
==Starting with the content scrape==


execute these 4 commands (requires wget tool)
execute these 4 commands (requires wget)


<pre>
<pre>
wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html
wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html
</pre>
</pre>
Downloaded: 184 files, 2.7M in 15s (176 KB/s)


Cumulative downloaded: 184 files, 2.7M
Cumulative downloaded: 185 files, 2.8M


<pre>
<pre>
wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html
wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html
</pre>
</pre>
Downloaded: 168 files, 2.5M in 14s (185 KB/s)


Cumulative downloaded: 276 files, 4.9M
Cumulative downloaded: 280 files, 5.2M


<pre>
<pre>
wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html
wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html
</pre>
</pre>
Downloaded: 1378 files, 49M in 2m 42s (310 KB/s)


Cumulative downloaded: 1552 files, 53M
Cumulative downloaded: 1587 files, 54.6M


<pre>
<pre>
wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html
wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html
</pre>
</pre>
Downloaded: 1308 files, 30M in 2m 18s (224 KB/s)

Total Downloaded: 2297 files, 73M


Cumulative downloaded: 2335 files, 75.2M


This is a deliberately redundant retrieval, many files will be retrieved more than once, but "clobbering" (overwriting) is allowed, so this doesn't produce an excess of files.
This is a deliberately redundant retrieval, many files will be retrieved more than once, but "clobbering" (overwriting) is allowed, so this doesn't produce an excess of files.


See general version (not MedLinePlus specific) of a content bundling script being developed here:
http://wiki.laptop.org/go/Content_bundle_making_script


==Consideration of Copyright issues==
==Consideration of Copyright issues==
Line 43: Line 40:
Good guidance here: http://www.nlm.nih.gov/medlineplus/faq/copyrightfaq.html
Good guidance here: http://www.nlm.nih.gov/medlineplus/faq/copyrightfaq.html


The homepage, the summaries on the Health Topics pages, the FAQs, and the same pages on MedlinePlus (en español) are all free of copyright and in the public domain by virtue of being U.S. Governemnt works. There are however, deeper link levels tha present content that is covered by copyright of various corporate sub-contractors tha do not fall under the U. S. Goverenment exemption from copyright.
The homepage, the summaries on the Health Topics pages, the FAQs, and the same pages on MedlinePlus (en español) are all free of copyright and in the public domain by virtue of being U.S. Government works. There are however, deeper link levels that present content that is covered by copyright of various corporate sub-contractors that do not fall under the U. S. Government exemption from copyright.


===Problematic areas===
===Problematic areas===


It would be really nice to have this, but it's probably not going to happen "The A.D.A.M. Medical Encyclopedia includes over 4,000 articles about diseases, tests, symptoms, injuries, and surgeries. It also contains an extensive library of medical photographs and illustrations."
It would be really nice to have this, but it's probably not going to happen "The A.D.A.M. Medical Encyclopedia includes over 4,000 articles about diseases, tests, symptoms, injuries, and surgeries. It also contains an extensive library of medical photographs and illustrations." Unfortunately, most of the images used on the health topic pages are from the non-free A.D.A.M. source, Mika may be able to help hunt down suitable replacements, see [[User:Mika#images]], which would be very nice.


Maybe figure out a way to frame links to this site (only accessible with internet connection, not off-line version). The idea would be to use something like the Google translation template links so tha the request is for the site at NIH, but via a google translation engine. This way, links are left in, people are sent to the original (in English), but alos have one-click option to get it via a Google translated page. Yes, relying on Google to translate health ideas is not "a good thing", but it is better than not having any access at all to health information.
Maybe figure out a way to frame links to this site (only accessible with internet connection, not off-line version). The idea would be to use something like the Google translation template links so that the request is for the site at NIH, but via a google translation engine. This way, links are left in, people are sent to the original (in English), but also have one-click option to get it via a Google translated page. Yes, relying on Google to translate health ideas is not "a good thing", but it is better than not having any access at all to health information.
X-Plain Tutorial (similar issues with using this content)
X-Plain Tutorial (similar issues with using this content)
http://www.patient-education.com/nlm/terms/
http://www.patient-education.com/nlm/terms/



==For follow-up==
==For follow-up==
Line 64: Line 60:




Need to review images grabbed, keep navigation, lose ADAM pics. Replace some with CC images or lose altogether. Try running the scripts above with out the -p flag set and compare filesets, transfer over those page-element images that are still needed. Use names of images to be excluded to hunt them down in the HTML for editing out.


====[[Health_portal/Step3a|Go to next section - Step 3a]]====
Need to trim some stuff in each downloaded health topic:
====[[Health_portal|Return to Health portal main page]]====


*Tone down NIH branding while preserving attribution, lose or modify footer for instance. Keep some attribution, preserve a link back to NIH site for original or updated content. Possibly through a "leaving bundle to WWW" page.
*Need to figure out what to do with links that are next hop. They go off to stuff that may have copyright or be too US specific. Can delete big chunks of bottom of each page. Provide opportunity for local MinHealth link insertion, "local branding" as well as locally relevant content.
*Nice feature, to look at. I think in the current form it should allow easy back and forth between English and Spanish. This provides some opportunity to increase bilingual understanding of medical terms.
*It might make sense to globally restructure for easier i18n/l10n. For instance, this currently uses separate folder for Spanish, that is fine, but try to i18n-ize the structure so that is one or more lang-xx folders.


[[Category:Health]]

Get my grep snd PERL skills polished for bulk global edits on whole directories of files at once, must find my copy of llama book.

See general version (not MedLinePlus specific) being developed here:
http://wiki.laptop.org/go/Content_bundle_making_script

Latest revision as of 22:09, 13 October 2008

Translate this page with Google -español -български -中文(中国大陆) -中文(臺灣) -hrvatski -čeština -dansk -Nederlands -suomi -français -Deutsch -Ελληνικά -हिन्दी -italiano -日本語 -한국어 -norsk -polski -português -română -русский -svenska

  This page is part of the OLPC Health Project. Hardware | Software | Content | Health Jam
XO Caudecus

Downloading raw materials

Starting with the content scrape

execute these 4 commands (requires wget)

wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html

Cumulative downloaded: 185 files, 2.8M

wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html

Cumulative downloaded: 280 files, 5.2M

wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html

Cumulative downloaded: 1587 files, 54.6M

wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html

Cumulative downloaded: 2335 files, 75.2M

This is a deliberately redundant retrieval, many files will be retrieved more than once, but "clobbering" (overwriting) is allowed, so this doesn't produce an excess of files.


See general version (not MedLinePlus specific) of a content bundling script being developed here: http://wiki.laptop.org/go/Content_bundle_making_script

Consideration of Copyright issues

Good guidance here: http://www.nlm.nih.gov/medlineplus/faq/copyrightfaq.html

The homepage, the summaries on the Health Topics pages, the FAQs, and the same pages on MedlinePlus (en español) are all free of copyright and in the public domain by virtue of being U.S. Government works. There are however, deeper link levels that present content that is covered by copyright of various corporate sub-contractors that do not fall under the U. S. Government exemption from copyright.

Problematic areas

It would be really nice to have this, but it's probably not going to happen "The A.D.A.M. Medical Encyclopedia includes over 4,000 articles about diseases, tests, symptoms, injuries, and surgeries. It also contains an extensive library of medical photographs and illustrations." Unfortunately, most of the images used on the health topic pages are from the non-free A.D.A.M. source, Mika may be able to help hunt down suitable replacements, see User:Mika#images, which would be very nice.

Maybe figure out a way to frame links to this site (only accessible with internet connection, not off-line version). The idea would be to use something like the Google translation template links so that the request is for the site at NIH, but via a google translation engine. This way, links are left in, people are sent to the original (in English), but also have one-click option to get it via a Google translated page. Yes, relying on Google to translate health ideas is not "a good thing", but it is better than not having any access at all to health information.

X-Plain Tutorial (similar issues with using this content) http://www.patient-education.com/nlm/terms/

For follow-up

Other language content linked from here should be looked at:

http://www.nlm.nih.gov/medlineplus/languages/languages.html

Amharic, Arabic, Armenian, Bengali, Bosnian, Burmese, Chamorro, Chinese, Chuukese, Croatian, Farsi, French, French Creole, German, Gujarathi, Hindi, Hmong, Ilocano, Italian, Japanese, Khmer, Kirundi, Korean, Kurdish, Laotian, Marshallese, Navajo, Panjabi, Polish, Portuguese, Romanian, Russian, Samoan, Somali, Spanish, Tagalog, Thai, Tigrinya, Tongan, Turkish, Ukrainian, Urdu, Vietnamese,


Go to next section - Step 3a

Return to Health portal main page