|
|
Extracting Schema from Semistructured Data
Link to this resource Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makesextracting semistmctureddata fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.  | This paper from the database community describes an attempt to derive schema from document instances (such as WWW personal home pages). This novel approach tries to derive compact rather than correct schema, ie an imperfect but useful type specification rather than a type specification which is as complex as the original set of instances. |  |
|