|
|
Meta-Data Jones and the Tower of Babel - The Challenge of Large-Scale Semantic Heterogeneity
Link to this resource The popularity and growth of the "Information SuperHighway" (e.g., the Web) have dramatically increased the number of information sources available for use and the opportunity for important new information-intensive applications (e.g., massive data warehouses, integrated supply chain management, global risk management, in-transit visibility). Unfortunately, there are significant challenges to be overcome regarding data extraction and data interpretation in order for this opportunity to be realized. Data Extraction: One problem is the difficulty in easily and automatically extracting very specific data elements from Web sites for use by operational systems. New technologies, such as XML and Web Querying/Wrapping, offer possible solutions to this problem. Data Interpretation: Another serious problem is the existence of heterogeneous contexts, whereby each SOURCE of information and potential RECEIVER of that information may operate with a different context, leading to large-scale semantic heterogeneity. A context is the collection of implicit assumptions about the context definition (i.e., meaning) and context characteristics (i.e., quality) of the information. As a simple example, whereas most US universities grade on a 4.0 scale, MIT uses a 5.0 scale posing a problem if one is comparing student GPAs. Another typical example might be the extraction of price information from the Web: but is the price in Dollars or Yen (If dollars, is it US dollars or Hong Kong dollars), does it include taxes, does it include shipping, etc. and does that match the receivers assumptions? In this paper, examples of important context challenges will be presented and the critical role of metadata, in the form of context knowledge, will be discussed.  | This keynote address to the IEEE Third Metadata conference highlights the need for strict semantic validation of Web documents, but does not explicitly address schemas. Instead it focusses on the problem of data extraction from heterogenous (legacy) sources and data interchange. |  | Researchers that authored this literature
|