Data Mining for Viability -- Junk Monkey

From OLPC
Revision as of 05:21, 24 August 2007 by 221.134.238.247 (talk) (Intern name)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

 

  • Interns - If you are interested in this project, add your name to the Interested interns section below along with a brief description of why you're interested and why you'd be a good mentor for this project, along with any specific ideas for execution you might have beyond the project description.
  • Mentors - If you are interested in this project, add your name to the Interested mentors section below along with a brief description of why you're interested and why you'd be a good mentor for this project, along with any specific ideas for execution you might have beyond the project description.
  • Others - If you are interested in this project in a role other than that of potential mentor or potential intern (example: you are an organization, a potential end-user/tester, may have helpful resources, or want to be notified if the project is chosen), add your name to the Other interested parties section below with contact information and details.
  • Everyone - Contribute to the project description on this page, or discuss this project on the associated talk page (click the "discussion" tab on top).

The deadline for editing this proposal or adding yourself to the list is 11:59pm EST (GMT-5) on August 6, 2007.

Junk Monkey

OLPC's goal of providing computers to underdeveloped countries will be immensely supported by thin-client software. If users can have an internet signal, they can then access enormous amounts of information, tools and software informing users of worldwide events while aiding local, community impacts.


The rise of blogs, e-newsletters and online newspapers compliments the rise of thin-client software. Much of the news information we receive on the internet has very little obvious metadata, such as background about the author or story subject.


This has bothered me and I believe it presents a major challenge for the future of information, not least in the developing world where blogs and online newsletters will undoubtedly flourish as computing service become cheaper and more readily available. To address this problem, I'd like to program a plug-in for Firefox that mines body text of news websites and cross-references data against databases.


The proposed databases can be already existing (such as PubMed, Wikipedia or Sourcewatch) or stimulated by the formulation of such a plug-in (a database of all journalists and major blog writers, for example).


I believe the project will be very challenging theoretically, while not incredibly difficult technically. I've tested some basic algorithms by taking body text, creating a few rules then analyzing the text. The results have been very promising and I hope to explore this idea with any interested peeps.


Examples

Theoretical examples are here on the Project's wiki.

Interested interns

Intern name

Contact information, why you'd be good for the job, any specific plans, variants, or details you would personally like to implement and why

Hemant Goyal- contact information

Vijay Majumdar

Interested mentors

Mentor name

Contact information, why you'd be good for the job, any specific plans, variants, or details you would personally like to implement and why

Other interested parties

Coogan Brennan

coogan(dot)brennan(at)columbia(dot)edu -- project proposer