Search:
|
Access:
» Beyond keyword search for data sources on the World Wide WebRelated categories: High quality and secure programming | WWW Matthew Michelson, C. KnoblockViewed: 15668 | Article date: 2006-01-18 16:25:30 One of the most important features of the World Wide Web is its ability to empower users with lots of information. However, much of this information is still unorganized and inaccessible beyond a simple keyword search. In this article the authors focus on annotating data sources that are unstructured and ungrammatical.
Matthew Michelson is a student at the University of Southern California. Craig A. Knoblock is a senior project leader at USC'Information Sciences Institute and a research associate professor at USC. Contacts: {michelso,knoblock}@isi.edu
One of the most important features of the World Wide Web is its ability to empower users with lots of information. However, much of this information is still unorganized and inaccessible beyond a simple keyword search. For example, consider the auction site EBay (http://www.ebay.com). If a user wants to determine the average asking price for an item, or count the occurrences, he or she would have to submit a keyword search, retrieve all of the listings, filter out those that do not apply, and determine item by item the answer to the questions. Also, note that a keyword search would miss all of the items that had misspelled the name. Clearly, keyword search is not powerful enough to yield interesting answers about the data, especially if we prefer to have programs determine these answers, rather than users. It is preferable to embed within the EBay listings a mechanism that allows them to be queried in a structured manner. This way, determining the average price or count of an item would be a simple, one line query that even a program could perform. This article presents one such method to allow data sources, such as EBay, to support structured queries. We call each of the listings a post and the goal is to extract from each post the attributes embedded within that post that describe the entity. This extraction is more formally known as Information Extraction (IE). Using our example from EBay, perhaps the posts are about cars. In this case, the attributes would be descriptive pieces about the car, such as the make, model or year, and performing IE on a post would allow us to pick out all of the relevant attributes, even if they are misspelled and regardless of their placing within the text of the post. Once we extract the attributes, we can then add tags around them, and query the data source using these tags as the query schema. The addition of these tags is known as annotation. The overall approach to annotation is shown in Figure 1, using an example post from EBay. The different parts of this figure are described later in the article, but it is useful to point out the examples of annotation at the bottommost black box of the figure.
Figure 1. The points of level of similitude registration
|
|
Copyright C 2006 by Software Developer's Journal. All rights reserved.






SDJ Users:
Shopping Cart









