crossing the structure chasm

of 54 /54
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004

Author: charo

Post on 11-Feb-2016

48 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

Crossing the Structure Chasm. Alon Halevy University of Washington, Seattle UBC, January 15, 2004. The Structure Chasm. Authoring. Writing text. Creating a schema. Using someone else ’ s schema. Querying. keywords. Data sharing. Easy. Committees, standards. - PowerPoint PPT Presentation

TRANSCRIPT

  • Crossing the Structure ChasmAlon HalevyUniversity of Washington, Seattle

    UBC, January 15, 2004

  • The Structure ChasmAuthoringCreating a schemaWriting textQueryingkeywordsUsing someone elses schemaData sharing EasyCommittees, standardsBut we can pose complex queries

  • Why is This a Problem?Databases used to be isolated and administered only by experts.Todays applications call for large-scale data sharing:Big science (bio-medicine, astrophysics, )Government agenciesLarge corporationsThe web (over 100,000 searchable data sources)The vision:Content authoring by anyone, anywherePowerful database-style queryingUse relevant data from anywhere to answer the queryThe Semantic WebFundamental problem: reconciling different models of the world.

  • OutlineOther benefits of structure:(Semantic) emailPersonal data managementA tour of recent data sharing architecturesData integration systemsPeer-data management systemsThe algorithmic problems: Query reformulationReconciling semantic heterogeneityWhat can we do with a large corpus of schemas?

  • Adding Structure to Email Email is often used for lightweight data management tasks:Organizing a PC meeting + dinner.Arranging a balanced potluckGiving away opera ticketsAnnouncing an event and associated reminders.Some specialized tools/services:Outlook scheduling, evite.comCan we delegate some email tasks easily?

  • Semantic Email ProcessesOriginatorRecipientsProcess Database

  • Semantic Email[Etzioni, McDowell, (Ha)Levy]

    Creating the structure?Well help with template interfacesIncorporating additional knowledge?I always bring dessertsI dont schedule morning meetingsAnother data sharing challenge.But its free: (and cross platform) www.cs.washington.edu/research/semweb

  • Personal Data ManagementHTMLMail & calendarData is organized by application[Semex: Sigurdsson, Nemes, H.] PapersFilesPresentations

  • Finding PublicationsPerson: A. HalevyPerson: Dan SuciuPerson: Maya RodrigPerson: Steven GribblePerson: Zachary IvesPublication: What Can Peer-to-Peer Do for Databases, and Vice Versa

  • Following Associations (1)

  • A survey of approaches to automatic schema matching

    Corpus-based schema matching

    Database management for peer-to-peer computing: A vision

    Matching schemas by learning from othersA survey of approaches to automatic schema matching

    Corpus-based schema matching

    Database management for peer-to-peer computing: A vision

    Matching schemas by learning from othersPublicationBernsteinFollowing Associations (2)

  • PublicationBernsteinCited byPublicationCitationsFollowing Associations (3)

  • Cited AuthorsBernsteinPublicationFollowing Associations (4)

  • Structure for Personal DataHigh-level concepts are given, but laterextend and personalize concept hierarchy,share (parts) of our data with others,incorporate external data into our view.Concepts are populated automatically with instancesNeed Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy same guy!

  • OutlineOther benefits of structure:(Semantic) emailPersonal data managementA tour of recent data sharing architecturesData integration systemsPeer-data management systemsThe algorithmic problems: Query reformulationReconciling semantic heterogeneityWhat can we do with a large corpus of schemas?

  • Data IntegrationGoal: provide a uniform interface to a set of autonomous data sources.First step towards data sharing. Many research projects (DB & AI)Mine: Information Manifold, Tukwila, LSDRecent industry:Startups: Nimble, Enosys, Composite, MetaMatrixProducts from big players: BEA, IBM

  • Relational DBMS RefresherSchema: the template for data.

    Queries:

    Students:Takes:Courses:SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=Mary and S.ssn = T.ssn and T.cid = C.cid

    SSN

    Name

    Category

    123-45-6789

    Charles

    undergrad

    234-56-7890

    Dan

    grad

    SSN

    CID

    123-45-6789

    CSE444

    123-45-6789

    CSE444

    234-56-7890

    CSE142

    CID

    Name

    Quarter

    CSE444

    Databases

    fall

    CSE541

    Operating systems

    winter

  • Data Integration: Higher-level AbstractionMediated SchemaSemantic mappings

  • Mediated SchemaOMIMSwiss-ProtHUGOGOGene-ClinicsEntrezLocus-LinkGEOEntitySequenceable EntityGenePhenotypeStructured VocabularyExperimentProteinNucleotide SequenceMicroarray ExperimentQuery: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?www.biomediator.orgTarczy-Hornoch, Mork

  • Semantic MappingsDifferences in:Names in schemaAttribute grouping

    Coverage of databasesGranularity and format of attributesBooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords Books TitleISBNPriceDiscountPriceEdition CDs AlbumASINPriceDiscountPriceStudioBookCategoriesISBNCategoryCDCategoriesASINCategoryArtistsASINArtistNameGroupNameAuthorsISBNFirstNameLastNameInventory Database AInventory Database B

  • Issues for Semantic MappingsMediated SchemaSemantic mappings Formalism for mappings Reformulation algorithms How will we create them?

  • Beyond Data IntegrationMediated schema is a bottleneck for large-scale data sharing

    Its hard to create, maintain, and agree upon.

  • Peer Data Management SystemsUWStanfordDBLPUBCWaterlooCiteSeerTorontoMappings specified locallyMap to most convenient nodesQueries answered by traversing semantic paths.Piazza: [Tatarinov, H., Ives, Suciu, Mork]

  • PDMS-Related ProjectsHyperion (Toronto)PeerDB (Singapore)Local relational models (Trento)Edutella (Hannover, Germany)Semantic Gossiping (EPFL Zurich)Raccoon (UC Irvine)Orchestra (Ives, U. Penn)

  • A Few Comments about CommerceUntil 5 years ago:Data integration = Data warehousing.Since then:A wave of startups: Nimble, MetaMatrix, Calixa, Composite, EnosysBig guys made announcements (IBM, BEA).[Delay] Big guys released products.Success: analysts have new buzzword EIINew addition to acronym soup (with EAI).Lessons:Performance was fine. Need management tools.

  • Data Integration: BeforeMediated Schema

  • User ApplicationsLens FileInfoBrowserSoftwareDevelopers KitNIMBLE APIsFront-EndLens BuilderManagement ToolsIntegration BuilderSecurity ToolsData AdministratorData Integration: AfterConcordance DeveloperIntegrationLayerNimble Integration EngineCompilerExecutorMetadataServerCacheCommon XML View

  • Sound Business ModelsExplosion of intranet and extranet information80% of corporate information is unmanagedBy 2004 30X more enterprise data than 1999The average company:maintains 49 distinct enterprise applicationsspends 35% of total IT budget on integration-related efforts

    Source: Gartner, 1999

    Chart1

    0.0663949572

    0.2576721894

    1

    3.8809

    15.06138481

    59.1

    Amount of Information

    Enterprise Information

    Sheet1

    19951996199719981999200020012002200320042005

    0.06639495720.13079806570.25767218940.507614213211.973.88097.64537315.061384813059.11.97

    Year199519971999200120032005

    Amount of Information0.06639495720.257672189413.880915.0613848159.1

    Sheet1

    Amount of Information

    Enterprise Information

    Sheet2

    Sheet3

  • OutlineOther benefits of structure:(Semantic) emailPersonal data managementA tour of recent data sharing architecturesData integration systemsPeer-data management systemsThe algorithmic problems: Query reformulationReconciling semantic heterogeneityWhat can we do with a large corpus of schemas?

  • Languages for Schema MappingMediated SchemaGAVLAVGLAV

  • Local-as-View (LAV)Book: ISBN, Title, Genre, YearR1R2R3R4R5Author: ISBN, NameBooks before 1970Humor books

  • Query ReformulationBook: ISBN, Title, Genre, YearR1R2R3R4R5Author: ISBN, NameBooks before 1970Humor books Query: Find authors of humor booksPlan: R1 Join R5

  • Query ReformulationBook: ISBN, Title, Genre, YearR1R2R3R4R5Author: ISBN, NameISBN, Title, NameISBN, TitleFind authors of humor books before 1960 Plan: Cant do it!(subtle reasons)

  • Query ReformulationQuery is posed on mediated schema that contains no data.Sources are answers to queries (views).Problem: answering queries using views(Conceptually) Need to invert query expression.Traditional databases also use this:Can you reuse previously cached results?

  • Answering Queries Using ViewsNP-Complete for basic queries [LMSS, PODS 95].Results depend on:Query language used for sources and queries,Open-world vs. Closed-world assumptionAllowable access patterns to the sourcesA lot of beautiful theory!

  • Theory?A lot of beautiful theory.There is in these words the beautiful maneuverability of the abstract, rushing in to replace the intractability of the concrete.Milan KunderaThe Book of Laughter and Forgetting

  • Practical Query ReformulationA lot of nice theory.But also very practical algorithms:MiniCon [Pottinger and H., 2001]: scales to thousands of sources.Every commercial DBMS implements some version of answering queries using views. See [Halevy, 2001] for survey.

  • Reformulation in PDMSCant follow all paths naivelyPruning techniques [Tatarinov, H.]Can we pre-compute some paths?Need to compose mappings[Madhavan, H., VLDB-2003]

  • Open PDMS Research Issues Managing large networks of mappings: Consistency TrustImproving networks: finding additional mappings

    Indexing:Heterogeneous data across the networkCaching:Where? What?

  • OutlineOther benefits of structure:(Semantic) emailPersonal data managementA tour of recent data sharing architecturesData integration systemsPeer-data management systemsThe algorithmic problems: Query reformulationReconciling semantic heterogeneityWhat can we do with a large corpus of schemas?

  • Semantic MappingsNeed mappings in every data sharing architecture

    Standards are great, but there are too many.BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords Books TitleISBNPriceDiscountPriceEdition CDs AlbumASINPriceDiscountPriceStudioBookCategoriesISBNCategoryCDCategoriesASINCategoryArtistsASINArtistNameGroupNameAuthorsISBNFirstNameLastNameInventory Database AInventory Database B

  • Why is it so Hard?Schemas never fully capture their intended meaning:Schema elements are just symbols.We need to leverage any additional information we may have.

    Theorem: Schema matching is AI-Complete.Hence, a human will always be in the loop.Goal is to improve designers productivity.Solution must be extensible.

  • Matching HeuristicsMultiple sources of evidences in the schemasSchema element namesBooksAndCDs/Categories ~ BookCategories/CategoryDescriptions and documentationItemID: unique identifier for a book or a CDISBN: unique identifier for any bookData types, data instancesDateTime Integer, addresses have similar formatsSchema structureAll books have similar attributesUse domain knowledge

    All these techniques consider only the two schemas.In isolation, techniques are incomplete or brittle:Need principled combination.

  • Using Past ExperienceMatching tasks are often repetitive Humans improve over time at matching. A matching system should improve too!

    LSD: Learns to recognize elements of mediated schema.[Doan, Domingos, H., SIGMOD-01, MLJ-03]Doan: 2003 ACM Distinguished Dissertation Award.

    data sources

  • Example: Matching Real-Estate Sourceslisted-price $250,000 $110,000 ... address price agent-phone description location Miami, FL Boston, MA ... phone(305) 729 0831(617) 253 1429 ... commentsFantastic houseGreat location ...realestate.com location listed-price phone commentsSchema of realestate.comIf fantastic & great occur frequently in data values => descriptionLearned hypotheses price $550,000 $320,000 ... contact-phone(278) 345 7215(617) 335 2315 ... extra-infoBeautiful yardGreat beach ...homes.comIf phone occurs in the name => agent-phoneMediated schema

  • Learning Source DescriptionsWe learn a classifier for each element of the mediated schema.Training examples are provided by the given mappings.Multi-strategy learning:Base learners: name, instance, descriptionCombine using stacking.Accuracy of 70-90% in experiments.

  • Corpus-Based Schema MatchingCan we use previous experience to match two new schemas?Can a corpus of schemas and matches be a general purpose resource?Information Retrieval and NLP progressed by using corpora Can the same be done for structured data?

  • Corpus-Based Schema MatchingCan we use previous experience to match two new schemas?Reuse extracted knowledgeto match new schemasLearn general purpose knowledgeClassifier for every corpus element

  • The Corpus vs. Other Matchers

    Chart1

    0.73666666670.680.77

    0.820.680.86

    0.66333333330.650.68

    0.82333333330.80.85

    0.550.760.64

    0.690.580.62

    0.860.870.9

    0.790.80.83

    MKB

    BASIC

    COMB

    Schema Pairs

    Recall

    Inventory Domain

    Sheet1

    SHIPPINGDOMAIN-SUMMARYINVENTORYDOMAIN-SUMMARY

    MKBBASICMKBBASICCOMB

    P1a0.590.79P1a0.73666666670.680.77

    P1b0.540.74P1b0.820.680.86

    P2a0.720.75P2a0.66333333330.650.68

    P2b0.590.66P2b0.82333333330.80.85

    P3a0.5566670.39P3a0.550.760.64

    P3b0.6266670.28P3b0.690.580.62

    P4a0.640.85P4a0.860.870.9

    P4b0.7566670.79P4b0.790.80.83

    P5a

    P5b

    mkbprecbasicprecmkbrecbackreccmbpreccombrec

    nimura-stark-50.570.760.590.79

    nimura-stark-50.4433330.610.540.74

    moore-sedlacek-60.70.730.720.75

    moore-sedlacek-60.4733330.530.590.66

    mordue-yu-60.4866670.340.5566670.39

    mordue-yu-60.4866670.220.6266670.28

    ess-stark-50.560.740.640.85

    ess-stark-50.5833330.610.7566670.79

    RecRecBackRecBackReccmb

    arneyeyers-5ASWD0.710.710.820.82

    arneyeyers-5ASWD0.750.750.820.82

    arneyeyers-5ASWD0.750.750.820.82

    arneyeyers-5BASIC0.680.680.680.68

    arneyeyers-5BASIC0.680.680.680.68

    arneyeyers-5BASIC0.680.680.680.68

    arneygarcia-5ASWD0.690.690.730.73

    arneygarcia-5ASWD0.650.650.870.87

    arneygarcia-5ASWD0.650.650.870.87

    arneygarcia-5BASIC0.650.650.80.8

    arneygarcia-5BASIC0.650.650.80.8

    arneygarcia-5BASIC0.650.650.80.8

    blecknerhertz-5 ASWD0.790.790.530.53

    blecknerhertz-5 ASWD0.80.80.580.58

    blecknerhertz-5 BASIC0.680.680.660.66

    blecknerhertz-5 BASIC0.680.680.660.66

    daybleckner-5ASWD0.550.550.690.69

    daybleckner-5ASWD0.550.550.690.69

    daybleckner-5ASWD0.550.550.690.69arneyeyers-5

    daybleckner-5BASIC0.760.760.580.58arneyeyers-5PrecRecBackPrecBackRecCMbPresCmbRecBackPrecBackRec

    daybleckner-5BASIC0.760.760.580.58arneyeyers-5ASWD0.540.710.680.820.570.750.710.86

    daybleckner-5BASIC0.760.760.580.58arneyeyers-5ASWD0.570.750.680.820.590.790.710.86

    dayeyers-5 ASWD0.840.840.770.77arneyeyers-5ASWD0.570.750.680.82

    dayeyers-5 ASWD0.870.870.80.8arneyeyers-5BASIC0.510.680.560.68

    dayeyers-5 ASWD0.870.870.830.83arneygarcia-5BASIC0.510.680.560.68

    dayeyers-5 BASIC0.870.870.80.8arneygarcia-5BASIC0.510.680.560.68

    dayeyers-5 BASIC0.870.870.80.8arneygarcia-5ASWD0.540.690.580.730.590.710.660.83

    dayeyers-5 BASIC0.870.870.80.8arneygarcia-5ASWD0.540.650.680.830.540.650.680.87

    arneygarcia-5ASWD0.540.650.680.83

    arneygarcia-5BASIC0.540.650.630.8

    blecknerhertzBASIC0.540.650.630.8

    blecknerhertzBASIC0.540.650.630.8

    blecknerhertz-5 ASWD0.680.760.50.550.570.640.60.66

    blecknerhertz-5 ASWD0.710.80.520.580.680.760.670.74

    daybleckner-5-5 BASIC0.610.680.60.66

    daybleckner-5-5 BASIC0.610.680.60.66

    daybleckner-5ASWD0.510.580.610.650.570.640.570.62

    daybleckner-5ASWD0.510.580.610.650.570.640.570.62

    daybleckner-5ASWD0.490.550.640.69

    daybleckner-5BASIC0.680.760.540.58

    dayeyers-5 ASBASIC0.680.760.540.58

    dayeyers-5 ASBASIC0.680.760.540.58

    dayeyers-5 ASWD0.70.840.680.770.760.90.740.83

    dayeyers-5 BAWD0.730.870.710.80.760.90.740.83

    dayeyers-5 BAWD0.730.870.710.80.760.90.740.83

    dayeyers-5 BASIC0.730.870.710.8

    SIC0.730.870.710.8

    SIC0.730.870.710.8

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    00

    MKB

    BASIC

    Schema Pairs

    Recall

    Shipping Domain

    Sheet2

    00

    00

    00

    00

    00

    00

    MKB

    BASIC

    Schema Pairs

    Recall

    Inventory Domain

    Sheet3

    000

    000

    000

    000

    000

    000

    000

    000

    MKB

    BASIC

    COMB

    Schema Pairs

    Recall

    Inventory Domain

  • Exploiting Previous Experience

    Chart1

    2.33333-9

    3.66667-12

    3-4

    2.33333-5.33333

    9.33333-2.33333

    10.6667-0.333333

    2.33333-10.6667

    5.66667-7

    Only MKB

    Only BASIC

    Schema Pairs

    Avg Number of Matches

    Shipping Domain

    Sheet1

    Differences in Inventory Domain

    Shipping Domain Data

    xbasicfaswdfbasicbaswdb

    Only MKBOnly BASICarneyeyers2.3315

    P1a2.33333-9arneygarcia221.51

    P1b3.66667-12daybleckner114.524.5

    P2a3-4dayeyers1.3111

    P2b2.33333-5.33333blecknerhertz23.584.5

    P3a9.33333-2.33333Only MKBMissed Matches

    P3b10.6667-0.333333P1a3-2.3

    P4a2.33333-10.6667P1b5-1

    P4b5.66667-7P2a2-2

    P2b1-1.5

    P3a3.5-2

    P3b4.5-8

    Differences in Shipping Domain.P4a1-1.3

    P4b1-1

    xfwdASWDfwdBASICbackASWDbackBASIC

    essstark31277

    essstark21057

    essstark21057

    mooresedlacek3426

    mooresedlacek3426

    mooresedlacek3434

    mordueyu102100

    mordueyu102101

    mordueyu83120

    nimurastark28313

    nimurastark29411

    nimurastark310412

    AVERAGED diff results

    nimurastark2.33333-9

    nimurastark3.66667-12

    mooresedlacek3-4

    mooresedlacek2.33333-5.33333

    mordueyu9.33333-2.33333

    mordueyu10.6667-0.333333

    essstark2.33333-10.6667

    essstark5.66667-7

    Sheet1

    Extra Matches

    Missed Matches

    Schema Pairs

    Avg Number of Matches

    Shipping Domain

    Sheet2

    Only MKB

    Missed Matches

    Schema Pairs

    Avg Number of Matches

    Inventory Domain

    Sheet3

  • Corpus ChallengesWhat exactly should we learn?Generalizing with few training examplesBalancing previous experience with other cluesSize and scope of the corpus

  • Other Corpus Based ToolsConjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete:I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately. Now we can cross the structure chasm.

  • Conclusion

    Vision: data authoring, querying and sharing by everyone, everywhere. Structure is useful in our daily tasks. Key challenge: reconciling semantic heterogeneity

    CorpusOfschemas

  • Some Referenceswww.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Surveys on schema matching languages: Halevy, VLDB Journal 01Lenzerini, PODS 2002Semi-automatic schema matching:Rahm and Bernstein, VLDB Journal 01.Teaching integration to undergraduates: SIGMOD Record, September, 2003.

    *Almost* the same, but not quite: What can databases do for peer-to-peer and What Can Peer-to-Peer Do for Databases, and Vice Versa*Almost* the same, but not quite: What can databases do for peer-to-peer and What Can Peer-to-Peer Do for Databases, and Vice Versa*Almost* the same, but not quite: What can databases do for peer-to-peer and What Can Peer-to-Peer Do for Databases, and Vice Versa*Almost* the same, but not quite: What can databases do for peer-to-peer and What Can Peer-to-Peer Do for Databases, and Vice Versa*Almost* the same, but not quite: What can databases do for peer-to-peer and What Can Peer-to-Peer Do for Databases, and Vice VersaShown query is courtesy of Hao Mei.For all phenotypes in one database, find all gene, locus, product triples (GeneTests curator query).For a given genetic disease, find all other diseases caused by the same gene (Joyce Mitchell, National Library of Medicine).For a set of genes and proteins, find all matches (Marianne Barrier).For some gene, find all homologues in other species (me).

    Notes: Phenotype can be normal (as depicted) or abnormal (gross). Gene is misrepresented as a more concrete chromosome. Vocabulary picture stolen from Unified Medical Language System.Here is a example of a design experiment..Design schemas for managing the inventory of a store.tower of babelHere is a example of a design experiment..Design schemas for managing the inventory of a store.tower of babelSummary of most current techniques to do schema matchingCombine multiple sources of evidences.Each one is noisy.Points:1) Introduce our approach.2) We do not manually map the schemas of all sources to mediated schema.The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources.3) Once the markup is done, there are many different types of information to learn from.