Talk:Content stamping

From OLPC
Revision as of 15:21, 16 March 2007 by Ian Bicking (talk | contribs)
Jump to navigation Jump to search

It seems like it might be unclear when someone is making a review of a website, and just of one web page, or some set of web pages. It's sometimes (but not always) clear to humans; less so to computers.

Examples of ambiguities:

  • An interesting article that has been split across multiple pages. All pages form the work.
  • A blog that has timely information on a subject, e.g., current events. Future pages that don't yet exist may effectively fall under the review.
  • An entire website that has useful information.
  • A web application that can only really be used interactively; the form not the content is what is interesting.

-- Ian Bicking


Valuation functions and general automation seems complex and unreliable to me. In theory it could be useful, but in practice you need a lot of rating to get something meaningful -- not because of the quality of the individual reviews (which may all be just fine), but because of the lack of a common rating standard, and even common criteria. So group A might find lots of new, interesting content, while group B is looking for a small set of content directly related to one subject area. The ratings of group A could be very helpful to group B, as they identify potentially interesting information. But the actual selection process is something group B wants to do themselves. Mixing the two groups really enables group B to do all of their own selection, as they can focus on a smaller set of content that has had some basic vetting. Aggregating and weighing ratings there doesn't seem very useful. -- Ian Bicking


Identifying the current revision seems a little difficult. In theory Last-Modified could be used to determine this, but many pages are aggregations of the specific content and the site content, and the site content is often updated. E.g., a sidebar is updated, which changes the Last-Modified of the entire page, even though the substance of the content does not change. Identifying the "real" content would be very useful, but there's no general way to do that. Only with specific screen scraping, some microformat (though I don't know one currently), maybe RDF, etc., could we identify a meaningful revision or modification date for an item. However, we could build up some set of patterns, with something like RDF for new content that is specifically designed to be read by this system. -- Ian Bicking