[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Integrating Unstructured Data into Relational Databases

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Integrating Unstructured Data into Relational Databases

Post on 17-Apr-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Integrating unstructured data into relational databases

    Imran R. Mansuriimran@it.iitb.ac.in

    IIT Bombay

    Sunita Sarawagisunita@it.iitb.ac.in

    IIT Bombay

    Abstract

    In this paper we present a system for automatically in-tegrating unstructured text into a multi-relational databaseusing state-of-the-art statistical models for structure extrac-tion and matching. We show how to extend current high-performing models, Conditional Random Fields and theirsemi-markov counterparts, to effectively exploit a variety ofrecognition clues available in a database of entities, therebysignificantly reducing the dependence on manually labeledtraining data. Our system is designed to load unstructuredrecords into columns spread across multiple tables in thedatabase while resolving the relationship of the extractedtext with existing column values, and preserving the cardi-nality and link constraints of the database. We show how tocombine the inference algorithms of statistical models withthe database imposed constraints for optimal data integra-tion.

    1 Introduction

    Database systems are islands of structure in a sea ofunstructured data sources. Several real world applicationsnow need to create bridges for smooth integration of semi-structured sources with existing structured databases forseamless querying. Although a lot of research has gone inthe NLP [4, 14], machine learning [8, 18] and web miningcommunity [9] on extracting structured data from unstruc-tured sources, most of the proposed methods depend on te-diously labeled unstructured data. The database is at besttreated as a store for the structured data.

    In this paper we attempt to bridge this gap by morecentrally exploiting existing large databases of the multi-relational entities to aid the extraction process on one hand,and on the other hand, augmenting the database itself withthe extracted information from unstructured text in a waythat allows multiple noisy variants of entities to co-exist andaid future extraction tasks.

    1.1 Applications

    There are a number of applications where unstructureddata needs to be integrated with structured databases onan ongoing basis so that at the time of extraction a largedatabase is available.

    Consider the example of publications portals like Cite-seer and Google Scholar. When integrating publication datafrom personal homepages, it does not make sense to per-form the extraction task in isolation. Structured databasesfrom, say ACM digital library or DBLP are readily availableand should be exploited for better integration.

    Another example is resume databases in HR departmentsof large organizations. Resumes are often stored with sev-eral structured fields like experience, education and refer-ences. When a new resume arrives in unstructured format,for example, via email text, we might wish to integrate itinto the existing database and the extraction task can bemade easier by consulting the database.

    Another interesting example is from personal informa-tion management (PIM) systems where the goal is to or-ganize personal data like documents, emails, projects andpeople in a structured inter-linked format. The success ofsuch systems will depend on being able to automaticallyextract structure from the existing predominantly file-basedunstructured sources. Thus, for example we should be ableto automatically extract from a PowerPoint file, the authorof a talk and link the person to the presenter of a talk an-nounced in an email. Again, this is a scenario where we willalready have plenty of existing structured data into whichwe have to integrate a new unstructured file.

    1.2 Challenges and Contribution

    A number of challenges need to be addressed in design-ing an effective solution to the data integration task. Use-ful clues are scattered in various ways across the structureddatabase and the unstructured text records. Exploiting themeffectively is a challenge.

    First, the existing structured databases of entities areorganized very differently from labeled unstructured text.

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • Most prior work has been on extracting information fromunstructured text which can be modeled as a sequence oftokens. The input labeled data is thus a collection of tokensequences. In contrast, the database is a graph of entity rela-tionships and there is no information in the database aboutinter-entity sequencing. A good extraction system shouldexploit both the inter-entity sequencing information fromthe labeled unstructured text and the semantic relationshipbetween entities in the database.

    Second, there is significant format variation in the namesof entities in the database and the unstructured text. Theentities in a structured database provide useful pattern-levelinformation that could help in recognizing entities in theunstructured text.

    Finally, in most cases the database will be large whereaslabeled text data will be small. Features designed from thedatabases should be efficient to apply and should not dom-inate features that capture contextual words and positionalinformation from the limited labeled data.

    We design a data integration system that builds uponstate-of-the-art semi-Markov models [7, 16] for informationextraction to exploit useful information in both structureddata and labeled unstructured data in spite of their format,structure and size variations. Our experiments show that ourproposed model significantly improves the accuracy of dataintegration tasks by exploiting databases more effectivelythan has been done before.

    Outline In Section 2, we describe our problem setup andpresent the overall architecture of the system. In Section 3,we present background on statistical models for extractionand matching. In Section 4 we show how we augment thesemodels to exploit features from the database, and integratedata so as to meet the cardinality and link constraints in thedatabase. In Section 5, we present integration performanceon real-life datasets and present related work in Section 6.

    2 Setup and Architecture

    2.1 Relational data representation

    A relational database represents a normalized set of enti-ties with the relationships between them expressed throughforeign keys. An example of such a multi-relationaldatabase is shown in Figure 1 representing bibtex entriesfor journal articles. The database consists of three differ-ent entities: articles, journals and authors. The foreign keysfrom the article to the journal tables shows how journalsname have been normalized across different articles. Theauthor name is a set-valued attribute of articles and this re-lationship is stored in the writes table.

    In real-life the same physical entity is often representedby multiple names that can be thought of as noisy variants

    1983

    Year

    10Update Semantics2

    CanonicalJournalTitleId

    ACM Trans. Databases

    16

    17AI17

    ACM TODS10

    CanonicalNameId

    32

    22

    112

    AuthorArticle

    4Jeffrey Ullman4

    3Ron Fagin3

    4J. Ullman2

    M Y Vardi11

    CanonicalNameId

    Authors

    Writes

    JournalsArticles

    Figure 1. A sample normalized database forstoring journal bibitems

    1988

    1983

    Year

    177

    10Update Semantics2

    CanonicalJournalTitleId

    10ACM Trans. Databases

    16

    17AI17

    ACM TODS10

    CanonicalNameId

    87

    32

    22

    97

    112

    AuthorArticle

    4Jeffrey Ullman4

    3R Fagin88J Helpern9

    3Ron Fagin3

    4J. Ullman2

    M Y Vardi11

    CanonicalNameId

    AuthorsWrites

    Journals

    Articles

    R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also

    Figure 2. Integrating an unstructured citationinto the database of Figure 1

    of each other. The database allows these variants to co-existand canonical links are placed from a noisy variant of anentry to its canonical form. In the sample database of Fig-ure 1, we show such canonical links between two journalentries ACM TODS and ACM Trans. Databases andbetween entries Jeffrey Ullman and J. Ullman in the au-thors table. Each variant of an entity has a unique identifierand thus the canonical link is a foreign key to this uniqueidentifier. These foreign keys can be within the same tableas in the examples in Figure 1 or they could be across tables.This data representation allows a user to preserve an entityin a format that is different from its canonical form. For ex-ample, a user might want author names in a citation to beabbreviated although the canonical entry might contain thefull name.

    In this paper we do not address the issue of which entryto pick as the canonical entry or how to construct a merged

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • canonical entry. These can be handled as part of follow-upoperations once the canonical links are created.

    A bit of notation to summarize the above: The databaseconsists of a set of entities (each of which map to a databasetable) E1 . . . Ee where each entity has a set of attributes orcolumn values (these exclude columns like primary or for-eign keys). Let A1 . . . Am refer to the union of all attributesover all entities. In Figure 1, m is 4 and includes title, year,journal name and author name. Each entity in the databasehas a unique identifier id and optionally a canonical iden-tifier c that is a foreign key to the id of another entity ofwhich it is assumed to be a noisy variant.

    2.2 Labeled unstructured text

    The unstructured data is a text record representing a sin-gle top-level entity like articles, but embedded in a back-ground of irrelevant text. A labeled set L is provided bysomeone manually labeling portions of the text with thename of the entity to which it belongs. We will show in theexperimental section that the size of L can be very small (5to 10).

    2.3 The task

    We start with an existing multi-entity, multi-variantdatabase D and a small labeled set L. We use this totrain models for information extraction and matching so thatgiven an unstructured text string x, we can perform the fol-lowing:

    Extract attributes from x corresponding to all attributesin the entity database. We call this the information ex-traction problem.

    Map the extracted entities to existing entries in thedatabase if they match, otherwise, create new entries.Assign relationships through foreign keys when at-tributes span multiple linked tables. We call this thematching problem.

    As a result of these operations the entry x is integratedacross the multiple tables of the database with the canon-ical links resolved correctly. In actual deployment a usermight also wish to inspect the extracted entries and correctfor any errors of the automated integration step.

    We continue with our sample database of Figure 1 andshow how to integrate into it a noisy unstructured citationR. Fagin and J. Helpern, Belief, awareness and reasoning.In AI 1988 [10] also (see in Figure 2). We extracted fromthis two authors, a title, a journal name and a year field andintegrated them with the database as shown in Figure 2. Onmatching with the database, we find the journal AI to bealready present and the author R. Fagin to be a variant

    of an existing author name Ron Fagin. This leads to thecanonical link from the new entry and the existing authorname.

    2.4 Available clues

    The database D with its entities and canonical links, andthe labeled data L can be exploited in a variety of ways forlearning how to integrate an incoming unstructured entryinto the database. We go over these first and later show howthese are exploited by the statistical models.

    2.4.1 Clues in the database

    As a database gets large, in most cases the entity to be ex-tracted from unstructured text will exist in the database.We therefore need models that can effectively exploit thisvaluable piece of information. The entities in the databasecould be useful even when unstructured text does not matcha database entity. Different entity attributes have very dis-tinctive patterns that can be generalized to recognize theseentities in unstructured text. For example, by inspecting en-tries in a publications database we can detect that authornames often follow a pattern of the form [A-Z]. [A-Z]. [A-Z][a-z]+.

    The schema of the database provides information on thenumber of entities that are allowed in each record. In Fig-ure 1, the table structure makes it clear that an article canhave at most one year, journal and title field but it can havemultiple author fields. Also, a foreign key in the databaseprovides information on what entity is allowed to go withwhat. For example, if we label a text segment as a cityname and the segment matches the entry San Franciscoin the cities column, then any text segment that is labeledas a state needs to match entry CA. In data integrationtasks, respecting cardinality and link constraints could be arequirement even if it does not improve accuracy.

    2.4.2 Information in the labeled unstructured data

    Labeled unstructured data is the classical source of infor-mation for extraction model. Traditionally, three kinds ofinformation have been derived from unstructured records.First, the labeled entities hold the same kind of pattern in-formation as the database of entities as discussed above.Second, the context in which an entity appears is a valuableclue that is present solely in the labeled unstructured data.Context is typically captured via few words before or afterthe labeled entity. For example, the word In often appearsbefore a journal name. Finally, the likely ordering of entitylabels within a sequence is also useful. For example, authornames usually appear before or after a title.

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • 3 Background

    In this section we review state-of-the-art statistical mod-els for extraction and matching. These models are designedto exploit clues only in labeled unstructured data and in Sec-tion 4 we will discuss how to extend these to include cluesfrom the database.

    The problem of extracting structured entities from un-structured data is an extensively researched topic. A num-ber of models have been proposed ranging from the earliestrule-learning models to probabilistic approaches based ongenerative models like HMMs [17, 3] and conditional mod-els like maxent taggers [14]. Almost all the probabilisticmodels treat extraction as a sequence labeling problem. Theinput unstructured text is treated as a sequence of tokensx = x1 . . . xn which needs to be assigned a correspond-ing sequence of labels y = y1 . . . yn from the set of labelsA = {A1 . . . Am}. A promising recently proposed modelfor information extraction is Conditional Random Fields(CRFs). CRFs have been shown [8, 18] to out-perform pre-vious generative models like Hidden Markov Models andlocal conditional models like maxent taggers [14].

    3.1 Conditional Random Fields

    CRFs [8] model a conditional probability distributionover label sequence y given the input sequence x. Thelabel assigned to the i-th word can be a function of anyproperty of the i-th word or the property of the word defin-ing its context and the label of the word before it...

Recommended

View more >