Health portal/Step2: Difference between revisions
(Working draft of MedLine Plus scraping / bundling process.) |
(update for recent download of MedLinePlus) |
||
(32 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<noinclude>{{ GoogleTrans-en | es =show | bg =show | zh-CN =show | zh-TW =show | hr =show | cs =show | da =show | nl =show | fi =show | fr =show | de =show | el =show | hi =show | it =show | ja =show | ko =show | no =show | pl =show | pt =show | ro =show | ru =show | sv =show }}</noinclude>{{Health}} |
|||
⚫ | |||
=Downloading raw materials= |
|||
==Starting with the content scrape== |
|||
⚫ | |||
<pre> |
<pre> |
||
wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html |
wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html |
||
</pre> |
</pre> |
||
Downloaded: 184 files, 2.7M in 15s (176 KB/s) |
|||
Cumulative downloaded: |
Cumulative downloaded: 185 files, 2.8M |
||
<pre> |
<pre> |
||
wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html |
wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html |
||
</pre> |
</pre> |
||
Downloaded: 168 files, 2.5M in 14s (185 KB/s) |
|||
Cumulative downloaded: |
Cumulative downloaded: 280 files, 5.2M |
||
<pre> |
<pre> |
||
wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html |
wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html |
||
</pre> |
</pre> |
||
Downloaded: 1378 files, 49M in 2m 42s (310 KB/s) |
|||
Cumulative downloaded: |
Cumulative downloaded: 1587 files, 54.6M |
||
<pre> |
<pre> |
||
wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html |
wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html |
||
</pre> |
</pre> |
||
Downloaded: 1308 files, 30M in 2m 18s (224 KB/s) |
|||
Cumulative downloaded: 2335 files, 75.2M |
|||
This is a deliberately redundant retrieval, many files will be retrieved more than once, but "clobbering" (overwriting) is allowed, so this doesn't produce an excess of files. |
|||
⚫ | |||
See general version (not MedLinePlus specific) of a content bundling script being developed here: |
|||
the homepage, the summaries on the Health Topics pages, the FAQs, the same pages on MedlinePlus en español all copyright free |
|||
http://wiki.laptop.org/go/Content_bundle_making_script |
|||
==Consideration of Copyright issues== |
|||
⚫ | |||
⚫ | |||
The homepage, the summaries on the Health Topics pages, the FAQs, and the same pages on MedlinePlus (en español) are all free of copyright and in the public domain by virtue of being U.S. Government works. There are however, deeper link levels that present content that is covered by copyright of various corporate sub-contractors that do not fall under the U. S. Government exemption from copyright. |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | It would be really nice to have this, but it's probably not going to happen "The A.D.A.M. Medical Encyclopedia includes over 4,000 articles about diseases, tests, symptoms, injuries, and surgeries. It also contains an extensive library of medical photographs and illustrations." Unfortunately, most of the images used on the health topic pages are from the non-free A.D.A.M. source, Mika may be able to help hunt down suitable replacements, see [[User:Mika#images]], which would be very nice. |
||
Maybe figure out a way to frame links to this site (only accessible with internet connection, not off-line version). The idea would be to use something like the Google translation template links so that the request is for the site at NIH, but via a google translation engine. This way, links are left in, people are sent to the original (in English), but also have one-click option to get it via a Google translated page. Yes, relying on Google to translate health ideas is not "a good thing", but it is better than not having any access at all to health information. |
|||
⚫ | |||
X-Plain Tutorial (similar issues with using this content) |
|||
http://www.patient-education.com/nlm/terms/ |
http://www.patient-education.com/nlm/terms/ |
||
==For follow-up== |
|||
Other language content linked from here should be looked at: |
|||
http://www.nlm.nih.gov/medlineplus/languages/languages.html |
http://www.nlm.nih.gov/medlineplus/languages/languages.html |
||
Amharic, Arabic, Armenian, Bengali, Bosnian, Burmese, Chamorro, Chinese, Chuukese, Croatian, Farsi, French, French Creole, German, Gujarathi, Hindi, Hmong, Ilocano, Italian, Japanese, Khmer, Kirundi, Korean, Kurdish, Laotian, Marshallese, Navajo, Panjabi, Polish, Portuguese, Romanian, Russian, Samoan, Somali, Spanish, Tagalog, Thai, Tigrinya, Tongan, Turkish, Ukrainian, Urdu, Vietnamese, |
|||
Need to review images grabbed, keep navigation, lose ADAM pics. Replacesomewith CC images or lose altogether. |
|||
Need to trim some stuff in each: |
|||
Tone down NIH branding while preserving attribution |
|||
lose footer for instance. |
|||
Need to figure out what to do with links that are next hop. They go off to stuff that may have copyroght or be too US specific. Can delete big chunks of bottom of each page. |
|||
In the current form should also allow back and forth between English and Spanish also. |
|||
====[[Health_portal/Step3a|Go to next section - Step 3a]]==== |
|||
It might make sense to globally restructure for easier i18n/l10n. |
|||
====[[Health_portal|Return to Health portal main page]]==== |
|||
[[Category:Health]] |
|||
Get my grep snd PERL skills polished for bulk global edits on whole directories of files at once, must find my copy of llama book. |
Latest revision as of 22:09, 13 October 2008
Translate this page with Google -español -български -中文(中国大陆) -中文(臺灣) -hrvatski -čeština -dansk -Nederlands -suomi -français -Deutsch -Ελληνικά -हिन्दी -italiano -日本語 -한국어 -norsk -polski -português -română -русский -svenska
Downloading raw materials
Starting with the content scrape
execute these 4 commands (requires wget)
wget -rp -l1 -o logfile1 http://www.nlm.nih.gov/medlineplus/healthtopics.html
Cumulative downloaded: 185 files, 2.8M
wget -rp -l1 -o logfile2 http://www.nlm.nih.gov/medlineplus/spanish/healthtopics.html
Cumulative downloaded: 280 files, 5.2M
wget -rp -l1 -o logfile3 http://www.nlm.nih.gov/medlineplus/all_healthtopics.html
Cumulative downloaded: 1587 files, 54.6M
wget -rp -l1 -o logfile4 http://www.nlm.nih.gov/medlineplus/spanish/all_healthtopics.html
Cumulative downloaded: 2335 files, 75.2M
This is a deliberately redundant retrieval, many files will be retrieved more than once, but "clobbering" (overwriting) is allowed, so this doesn't produce an excess of files.
See general version (not MedLinePlus specific) of a content bundling script being developed here:
http://wiki.laptop.org/go/Content_bundle_making_script
Consideration of Copyright issues
Good guidance here: http://www.nlm.nih.gov/medlineplus/faq/copyrightfaq.html
The homepage, the summaries on the Health Topics pages, the FAQs, and the same pages on MedlinePlus (en español) are all free of copyright and in the public domain by virtue of being U.S. Government works. There are however, deeper link levels that present content that is covered by copyright of various corporate sub-contractors that do not fall under the U. S. Government exemption from copyright.
Problematic areas
It would be really nice to have this, but it's probably not going to happen "The A.D.A.M. Medical Encyclopedia includes over 4,000 articles about diseases, tests, symptoms, injuries, and surgeries. It also contains an extensive library of medical photographs and illustrations." Unfortunately, most of the images used on the health topic pages are from the non-free A.D.A.M. source, Mika may be able to help hunt down suitable replacements, see User:Mika#images, which would be very nice.
Maybe figure out a way to frame links to this site (only accessible with internet connection, not off-line version). The idea would be to use something like the Google translation template links so that the request is for the site at NIH, but via a google translation engine. This way, links are left in, people are sent to the original (in English), but also have one-click option to get it via a Google translated page. Yes, relying on Google to translate health ideas is not "a good thing", but it is better than not having any access at all to health information.
X-Plain Tutorial (similar issues with using this content) http://www.patient-education.com/nlm/terms/
For follow-up
Other language content linked from here should be looked at:
http://www.nlm.nih.gov/medlineplus/languages/languages.html
Amharic, Arabic, Armenian, Bengali, Bosnian, Burmese, Chamorro, Chinese, Chuukese, Croatian, Farsi, French, French Creole, German, Gujarathi, Hindi, Hmong, Ilocano, Italian, Japanese, Khmer, Kirundi, Korean, Kurdish, Laotian, Marshallese, Navajo, Panjabi, Polish, Portuguese, Romanian, Russian, Samoan, Somali, Spanish, Tagalog, Thai, Tigrinya, Tongan, Turkish, Ukrainian, Urdu, Vietnamese,