Health portal/Step3b
James Madison was a rightwing aristocrat who wrote our constitution. ,
The abdominalpain.html topic page dissected
see http://www.nlm.nih.gov/medlineplus/abdominalpain.html and "View source" for the original.
ACTION: Except for some needed and valid HTML header lines, all of the following header section will be removed.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html lang="EN"> <head> <script type="text/javascript"> imageNames = false; </script> <link rel="stylesheet" href="http://www.nlm.nih.gov/medlineplus/images/stylesheet.css" type="text/css"> <link rel="shortcut icon" href="http://www.nlm.nih.gov/medlineplus/images/favicon.ico" type="image/x-icon"> <!--[if IE]> <style type="text/css"> .alist ul { margin-top:3;} </style> <![endif]--> <style type="text/css"> @import url(http://www.nlm.nih.gov/medlineplus/images/advanced.css); @import url(http://www.nlm.nih.gov/medlineplus/images/header.css); @import url(http://www.nlm.nih.gov/medlineplus/images/menubutton.css); </style>
ACTION: This next section contains the page-specific title and also Dublin Core metadata. It would be nice to find a suitable way to preserve this metadata because it might be useful later. For instance, if something like the DITA architecture being explored by Projects/Wikislice were used for content management.
<title>MedlinePlus: Abdominal Pain</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta http-equiv="Content-Language" content="en-us"> <meta name = "description" content ="Abdominal Pain"> <!-- meta data --> <meta name="keywords" content="Abdominal Pain, Pain, Abdominal"> <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" title="The Dublin Core metadata Element Set"> <meta name="DC.Title" content="Abdominal Pain"> <meta name="DC.Title.Alternate" content="Pain, Abdominal"> <meta name="DC.Subject.MeSH" content="Abdominal Pain"> <meta name="DC.Subject.MeSH" content="Digestive System"><meta name="DC.Subject.MeSH" content="Digestive System Diseases"><meta name="DC.Subject.MeSH" content="Pathologic Processes"><meta name="DC.Subject.MeSH" content="Signs and Symptoms"> <meta name="DC.Relation.IsPartOf" content="Digestive System"><meta name="DC.Relation.IsPartOf" content="Symptoms"> <meta name="DC.Identifier.URL" content="http://www.nlm.nih.gov/medlineplus/abdominalpain.html"> <meta name="DC.Publisher" content="National Library of Medicine"> <meta name="DC.Language" content="eng"> <meta name="DC.Type" content="Text"> <meta name="DC.Date.Modified" content="2008-06-25"> <meta name="NLMDC.Date.Modified.Major" content="2008-05-13"> <meta name="DC.Date.Created" content="2003-01-07">
ACTION: The javascript below can be deleted as it does things that are specific to the MedLinePlus site as hosted at NLM.
<script type="text/javascript" src="http://www.nlm.nih.gov/medlineplus/images/static.js"></script> <script type="text/javascript" src="http://www.nlm.nih.gov/medlineplus/images/mplus_en_survey.js"></script> <link rel="alternate" type="application/rss+xml" title="What's New on MedlinePlus" href="http://www.nlm.nih.gov/medlineplus/feeds/whatsnew_en.xml">
ACTION: The statement below marks the end of the header section and should be kept. Some sort of additional global header (besides the page-specific title and metadata) will be developed for use, maybe containing a nice little icon or something.
</head>
ACTION: The next part is the NLM header branding, which should be left out. Attribution and link to original source will be added back in below, probably just above new footer.
<MAP NAME="nlmnih"><AREA SHAPE="rect" title="U.S. National Library of Medicine" alt="U.S. National Library of Medicine" COORDS="200,14,420,25" HREF="http://www.nlm.nih.gov" onClick="leavemplus('theURL=http%3A%2F%2Fwww%2Enlm%2Enih%2Egov','us'),openOutWin('')" target="TheNewWin"><AREA SHAPE="rect" title="National Institutes of Health" alt="National Institutes of Health" COORDS="219,35,419,45" HREF="http://www.nih.gov" onClick="leavemplus('theURL=http%3A%2F%2Fwww%2Enih%2Egov'),openOutWin('')" target="TheNewWin"></MAP>
ACTION: The start-of-body tag should be preserved.
<body>
ACTION: The stuff below that is specific to IE7 can probably go away.
<!--[if !IE 7]> <style> div#body{ /*width:expression;*/ width:expression(document.body.clientWidth > 1280 ? "1250px" : (document.body.clientWidth < 815 ? "800px" : "99%" ) ); } </style> <![endif]--> <div id="skipnav"> <a href="#skip" title="Skip navigation links and go to the page content">Skip navigation</a> </div>
ACTION: The HTML below produces most of the tabs (that take you to non-free content), this will be deleted.
<div id="body"> <table id="toptable"> <tr> <td><a href="http://medlineplus.gov/"><IMG SRC="http://www.nlm.nih.gov/medlineplus/images/healthtopics_banner.gif" title="MedlinePlus Trusted Health Information for You" ALT="MedlinePlus Trusted Health Information for You" BORDER="0" HEIGHT="65" WIDTH="325"></a></td> <td width="100%" height="65" background="http://www.nlm.nih.gov/medlineplus/images/2banner_slice.gif"><img src="http://www.nlm.nih.gov/medlineplus/images/2banner_slice.gif" width="100%" height="65" border="0" title="MedlinePlus Trusted Health Information for You" ALT="MedlinePlus Trusted Health Information for You"></td> <td><img src="http://www.nlm.nih.gov/medlineplus/images/medlineplus_secondary.gif" border="0" height="65" width="425" USEMAP="#nlmnih" title="MedlinePlus Trusted Health Information for You" ALT="MedlinePlus Trusted Health Information for You"></td> </tr> </table> <div id="topcontainer"> <form method="get" action="http://vsearch.nlm.nih.gov/vivisimo/cgi-bin/query-meta" title="Site Search input" target="_self" id="searchform" name="searchform"> <input type="hidden" name="v:project" value="medlineplus"> <input type="text" name="query" size="20" maxlength="250" style="height:24px;"> <input type="image" src="http://www.nlm.nih.gov/medlineplus/images/search_medplus.gif" align="absmiddle" border="0" title="Search MedlinePlus" alt="Search MedlinePlus"></form> <a href="http://apps.nlm.nih.gov/medlineplus/contact/index.cfm?lang=en&from=http%3A%2F%2Fw ww%2Enlm%2Enih%2Egov%2Fmedlineplus%2Fabdominalpain%2Ehtml"><img src="http://www.nlm.nih.gov/medlineplus/images/contactus.gif" title="Contact Us" alt="Contact Us"></a> <a href="http://www.nlm.nih.gov/medlineplus/faq/faq.html"><img src="http://www.nlm.nih.gov/medlineplus/images/faq.gif" title="FAQs" alt="FAQs"></a> <a href="http://www.nlm.nih.gov/medlineplus/sitemap.html"><img src="http://www.nlm.nih.gov/medlineplus/images/sitemap.gif" title="Site Map" alt="Site Map" ></a> <a href="http://www.nlm.nih.gov/medlineplus/aboutmedlineplus.html"><img src="http://www.nlm.nih.gov/medlineplus/images/about.gif" title="About MedelinePlus" alt="About MedelinePlus"></a> </div> <div style="clear:both;"></div> <div id="mainmenu"> <a href="http://www.nlm.nih.gov/medlineplus/medlineplus.html" title="Home" id="tab0" ></a> <a href="http://www.nlm.nih.gov/medlineplus/healthtopics.html" title="Health Topics" id="stab21" style="cursor:pointer;"></a> <a href="http://www.nlm.nih.gov/medlineplus/druginformation.html" title="Drugs & Supplements" id="tab20" ></a> <a href="http://www.nlm.nih.gov/medlineplus/encyclopedia.html" title="Medical Encyclopedia" id="tab33" ></a> <a href="http://www.nlm.nih.gov/medlineplus/mplusdictionary.html" title="Dictionary" id="tab1" ></a> <a href="http://www.nlm.nih.gov/medlineplus/newsbydate.html" title="News" id="tab38" ></a> <a href="http://www.nlm.nih.gov/medlineplus/directories.html" title="Directories" id="tab11" ></a> <a href="http://www.nlm.nih.gov/medlineplus/otherresources.html" title="Other Resources" id="tab22" ></a> </div> <div style="float:right;"> <a href="http://www.nlm.nih.gov/medlineplus/spanish/abdominalpain.html"><img src="http://www.nlm.nih.gov/medlineplus/images/espanol.gif" title="español" alt="español" border="0"></a> </div> <div style="clear:left;"></div> <style type="text/css"> @import url("http://www.nlm.nih.gov/medlineplus/images/consumer_health_20.css"); </style> <script language="JavaScript"> function onLoad() { if ( !window.opener ) return; if (typeof window.opener.siteEntryWin != "undefined") { if (!window.opener.siteEntryWin.closed) { if (typeof window.opener.siteEntryWin.RestartTimer != "undefined") { window.opener.siteEntryWin.RestartTimer(); } } } } </script> <p>
ACTION: This next section poses an interesting problem. In combination with the cascading style sheet, it generates the alphabetical link list. It would be nice to have such a feature, but how to do it in multiple languages? If you have any ideas, please leave them on the talk page. In any event, this is a large enough block of text that should probably not be reproduced hundreds of times, and so it will be deleted, but some thought needs to go into replacing it's function in some way.
<h5 class="menutitle">Other Health Topics:</h5> <div class="alist"> <ul> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_a.html">A</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_b.html">B</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_c.html">C</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_d.html">D</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_e.html">E</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_f.html">F</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_g.html">G</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_h.html">H</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_i.html">I</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_j.html">J</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_k.html">K</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_l.html">L</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_m.html">M</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_n.html">N</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_o.html">O</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_p.html">P</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_q.html">Q</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_r.html">R</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_s.html">S</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_t.html">T</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_u.html">U</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_v.html">V</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_w.html">W</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/healthtopics_xyz.html">XYZ</a></li> <li><a href="http://www.nlm.nih.gov/medlineplus/all_healthtopics.html">List of All Topics</a></li> </ul> </div>
ACTION: We don't need "printer friendly" or "e-mail to a friend" features. In addition, the images associated with topics are mostly non-free, this gets deleted. It might be possible to find or create other images to use, but only if they add meaning, otherwise they just take up space.
<div id="sidecolumn"> <div id="topimage"> <a href="http://www.nlm.nih.gov/medlineplus/print/abdominalpain.html" onclick="openNakedWin('PrintWin');" target="PrintWin"><img src="http://www.nlm.nih.gov/medlineplus/images/print_version.gif" title="Printer-friendly version" alt="Printer-friendly version"/></a> <a href="http://www.nlm.nih.gov/cgi/medlineplus/email_request.pl?refPage=http%3A%2F%2Fwww% 2Enlm%2Enih%2Egov%2Fmedlineplus%2Fabdominalpain%2Ehtml&emailTitle=Abdominal%20Pain" onclick="openNakedWin('EmailReqWin')" target="EmailReqWin"><img src="http://www.nlm.nih.gov/medlineplus/images/email_version.gif" title="E-mail this page to a friend" alt="E-mail this page to a friend"/></a> </div> <div id="topicimage"> <img src="http://www.nlm.nih.gov/medlineplus/images/stomacheache.jpg" title="Photograph of a woman clutching her stomach in pain" alt="Photograph of a woman clutching her stomach in pain" height="230" width="222"></img> </div>
ACTION: The "Related Topics" information should be kept. These are links to other pages within the bundle. Of course, they need to be refactored to work in the new page structure (and for off-line use).
<div id="tabbedBoxes"> <ul> <li class="tabbedBox"> <h3>Related Topics</h3> <ul> <li><span><a href="http://www.nlm.nih.gov/medlineplus/pain.html">Pain</a></span></li> <li><span><a href="http://www.nlm.nih.gov/medlineplus/digestivesystem.html">Digestive System</a></span></li> <li><span><a href="http://www.nlm.nih.gov/medlineplus/symptoms.html">Symptoms</a></span></li> </ul> </li>
ACTION: The "GoLocal section" is entirely U.S. specific and needs to be removed.
<li class="tabbedBox"> <h3>Go Local</h3> <ul style="list-style-type:none;padding-left:11px;padding-top:10px;list-style-position: outside;"> <li> Services and providers for <strong>Abdominal Pain</strong> in the U.S. </li> <li> <div class="golocal20"> <form name="goLocal" action="http://www.nlm.nih.gov/medlineplus/golocal/topicmap_3061.html" onsubmit="return false;"> <select name="state"><option value="">Select Location</option> <option value="http://apps.nlm.nih.gov/medlineplus/local/alabama/list_location.cfm?areaid=3&ser vice_type=topic&invokedby=services&service_id=351&ntopic_id=3061">AL - Alabama</option> <option value="http://apps.nlm.nih.gov/medlineplus/local/arkansas/list_location.cfm?areaid=35&s ervice_type=topic&invokedby=services&service_id=351&ntopic_id=3061">AR - Arkansas</option> ++++++++++++++++++++++++++++++++++++++++++++++ Arizona to Vermont removed for this dissection ++++++++++++++++++++++++++++++++++++++++++++++ <option value="http://apps.nlm.nih.gov/medlineplus/local/wyoming/list_location.cfm?areaid=5&ser vice_type=topic&invokedby=services&service_id=351&ntopic_id=3061">WY - Wyoming</option> </select> <input class="img" src="http://www.nlm.nih.gov/medlineplus/images/glgo.gif" alt="Go" type="image" onclick="goLocalPage(document.forms.goLocal.state.options[state.selectedIndex].value);" > <div> <a href="http://www.nlm.nih.gov/medlineplus/golocal/topicmap_3061.html">Select from map</a> </div> </form> </li> </ul> </li> </ul> </div> </div>
ACTION: We finally get to the "Main Content" section. This is the good stuff, it seems way too short compared to the rest of it.
<div id="maincontent"> <div id="skip"> Abdominal Pain </div> <p></p> <div id="synonyms"> Also called: Bellyache </div> <span id="tpsummary"> <p> Your abdomen extends from below your chest to your groin. Some people call it the stomach, but your abdomen contains many other important organs. Pain in the abdomen can come from any one of them. The pain may start somewhere else, such as your chest. Severe pain doesn't always mean a serious problem. Nor does mild pain mean a problem is not serious. </p> <p> Call your healthcare provider if mild pain lasts a week or more or if you have pain with other symptoms. Get medical help immediately if </p> <ul> <li>You have abdominal pain that is sudden and sharp</li> <li>You also have pain in your chest, neck or shoulder</li> <li>You're vomiting blood or have blood in your stool</li> <li>Your abdomen is stiff, hard and tender to touch</li> <li>You can't move your bowels, especially if you're also vomiting</li> </ul> </span>
ACTION: This "Start here" section and all the sections below, down to the footer are too U.S. specific and need to be removed. Ideally, local links can be placed here by local Ministry of Health. For some topics, it may be worth the effort to research suitable links to the WHO (or for South America, PAHO) web-sites.
- Sometimes there is a "Languages" or a "Games" section, this might be worth preserving in some form, but exactly how to use this needs to be worked out (keep link or harvest localized text from linked page).
<p></p> <div id="_start_"> <span class="categoryname"><a name="cat51"></a>Start Here</span> <ul class="bulletlist"> <li> <span><a href="http://familydoctor.org/online/famdocen/home/tools/symptom/528.printerview.html" onclick="leavemplus('theURL=http%3A%2F%2Ffamilydoctor%2Eorg%2Fonline%2Ffamdocen%2Fhome% 2Ftools%2Fsymptom%2F528%2Eprinterview%2Ehtml','us');openOutWin(this.href);" target="TheNewWin" >Abdominal Pain, Long-Term</a></span><span class="orgs">(American Academy of Family Physicians)</span> </li> ++++++++++++++++++++++++++++++++++++++++++ A whole lot of stuff removed for this dissection ++++++++++++++++++++++++++++++++++++++++++ <li> <span><a href="http://www.mayoclinic.com/print/mittelschmerz/DS00507/DSECTION=all&METHOD=print" onclick="leavemplus('theURL=http%3A%2F%2Fwww%2Emayoclinic%2Ecom%2Fprint%2Fmittelschmerz %2FDS00507%2FDSECTION%3Dall%26METHOD%3Dprint','us');openOutWin(this.href);" target="TheNewWin" >Mittelschmerz</a></span><span class="orgs">(Mayo Foundation for Medical Education and Research)</span> </li> </ul> <a href="#skip" class="toTop">Return to top</a> </li> </ul> </div> <div class="spacer"></div>
ACTION: The footer will mostly be replaced by an attribution section preserving some of this info and by a newly designed XO Health Portal footer. The closing body and html tags will be needed.
<table class="footer" cellspacing="0" cellpadding="0"> <tr> <td colspan="2" class="tabLinks"><a href="http://medlineplus.gov">Home</a> | <a href="http://www.nlm.nih.gov/medlineplus/healthtopics.html">Health Topics</a> | <a href="http://www.nlm.nih.gov/medlineplus/druginformation.html">Drugs & Supplements</a> | <a href="http://www.nlm.nih.gov/medlineplus/encyclopedia.html">Encyclopedia</a> | <a href="http://www.nlm.nih.gov/medlineplus/mplusdictionary.html">Dictionary</a> | <a href="http://www.nlm.nih.gov/medlineplus/newsbydate.html">News</a> | <a href="http://www.nlm.nih.gov/medlineplus/directories.html">Directories</a> | <a href="http://www.nlm.nih.gov/medlineplus/otherresources.html">Other Resources</a> </td> </tr> <tr class="footBody"> <td><a href="http://www.nlm.nih.gov/medlineplus/copyright.html">Copyright</a> | <a href="http://www.nlm.nih.gov/medlineplus/privacy.html">Privacy</a> | <a href="http://www.nlm.nih.gov/medlineplus/accessibility.html">Accessibility</A> | <a href="http://www.nlm.nih.gov/medlineplus/criteria.html">Quality Guidelines</a><br> <a href="http://www.nlm.nih.gov" onclick="leavemplus('theURL=http%3A%2F%2Fwww%2Enlm%2Enih%2Egov','us');openOutWin('')" target="TheNewWin">U.S. National Library of Medicine</A>, 8600 Rockville Pike, Bethesda, MD 20894 <br> <a href="http://www.nih.gov" onClick="leavemplus('theURL=http%3A%2F%2Fwww%2Enih%2Egov','us');openOutWin('')" target="TheNewWin">National Institutes of Health</A> | <a href="http://www.hhs.gov/" onClick="leavemplus('theURL=http%3A%2F%2Fwww%2Ehhs%2Egov%2F','us');openOutWin('')" target="TheNewWin"> Department of Health & Human Services</a> </td> <td class="updated">Date last updated: 25 June 2008 <br>Topic last reviewed: 13 May 2008</td> </tr> </table> </div> </body> </html>
Automating the parsing of health topic pages
User:Karmaflux very kindly provided a nice exemplar perl script for running through and parsing these health topic HTML pages. This script is now being modified to include some additional special case requirements, e.g. sections not found on every health topic page, but still useful like Games or Languages. The basic logic is to turn on accumulation of text lines when certain unique start flag strings are recognized and to turn text line accumulation off when other lines (recognizable as the end of useful sections) are seen. The script will also add back in header and footer and housekeeping HTML (head and body tags) to accomplish an automated parsing of the HTML pages as downloaded from NLM into pages that have been trimmed down to the essentials for the XO health portal.
The perl source code of the parsing script will be posted here under MIT license when it is ready in final form. Cjl 05:13, 10 July 2008 (UTC)