assigning schema labels using ontology and heuristics xuan zhang, rouming jin, gagan agrawal

32
Assigning Schema Assigning Schema Labels Using Labels Using Ontology and Ontology and Heuristics Heuristics Xuan Zhang, Rouming Jin, Gagan Xuan Zhang, Rouming Jin, Gagan Agrawal Agrawal

Upload: irma-chandler

Post on 18-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Assigning Schema Labels Using Assigning Schema Labels Using Ontology and HeuristicsOntology and Heuristics

Xuan Zhang, Rouming Jin, Gagan AgrawalXuan Zhang, Rouming Jin, Gagan Agrawal

Page 2: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Problem StatementProblem Statement

• Setting: A online integration system finds a new Setting: A online integration system finds a new data file…data file…

• Question: Can it be integrated into the system on Question: Can it be integrated into the system on the fly? How?the fly? How?

• Sub-tasks:Sub-tasks:• Understand the dataUnderstand the data

• Talk to data hostTalk to data host• Consult field expertConsult field expert

• Process the dataProcess the data• Database administratorDatabase administrator• ProgrammerProgrammer

Can we automate the process?

Page 3: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Dataset flat fileDataset flat file

Layout learningLayout learning

Layout descriptorLayout descriptor

Parser generationParser generation

ParserParser

ParsingParsing

Raw attribute valuesRaw attribute values

Value cleaning and summarizationValue cleaning and summarization

Attribute summariesAttribute summaries

Score calculationScore calculation

ScoresScores

Expert or clustering algorithmExpert or clustering algorithm

Cutoff valuesCutoff values

LabelingLabeling

LabelsLabels

Step 1Step 1

1.1. Delimiter Identification (Ref [25], [26])Delimiter Identification (Ref [25], [26])

2.2. Wrapper Generation (Ref [32])Wrapper Generation (Ref [32])

3.3. Schema MiningSchema Mining

Step 2Step 2

Step 3Step 3

On-the-fly Integration OverviewOn-the-fly Integration Overview

Page 4: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Schema MiningSchema Mining

• Assign meaning (label or names) to Assign meaning (label or names) to attributes in a data setattributes in a data set

• ChallengesChallenges• WhatWhat

• DelimitersDelimitersValuesValues

• HowHow• Top-downTop-downBottom-upBottom-up

Page 5: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Our ApproachOur Approach

• Summarize attribute values from bottom upSummarize attribute values from bottom up• Similarity between ontology and schemaSimilarity between ontology and schema

• An attribute An attribute aa with label with label attatt, a value , a value vv• Schema: “Schema: “vv is-a is-a attatt” ”

• Ontology: “Node(Ontology: “Node(vv) is a child of Node() is a child of Node(attatt)” )”

• E.g E.g proteinprotein is-ais-a molecule typemolecule type

• Common ancestor of values in ontology ~ Common ancestor of values in ontology ~ attribute label in schemaattribute label in schema

Page 6: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Real-world ComplicationsReal-world Complications

• Complete comprehensive ontology databaseComplete comprehensive ontology database• Selective samplingSelective sampling

• Error-free datasetError-free dataset• Adjustable sensitivity and fault toleranceAdjustable sensitivity and fault tolerance

• TimeTime• Data mining + Statistic analysisData mining + Statistic analysis

Remark: attribute label Remark: attribute label attribute name attribute name

e.g date : {creation date, last modification date}e.g date : {creation date, last modification date}

Page 7: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

OutlineOutline

• MotivationMotivation• SystemSystem• Mining algorithmMining algorithm• ExperimentExperiment

Page 8: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Schema Mining SystemSchema Mining System

• SystemSystem• Data cleaning and Data cleaning and

summarizationsummarization

• Score functionScore function

• Ontology databaseOntology database

Value cleaning and summarizationValue cleaning and summarization

Attribute summariesAttribute summaries

Score calculationScore calculation

ScoresScores Expert orExpert or

Clustering algorithmClustering algorithm

Cutoff valuesCutoff values

LabelingLabeling

Attribute labelsAttribute labels

Page 9: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Data SummarizationData Summarization

• Token profile: a ordered list of Token profile: a ordered list of N(numerical), A(alphabetic) and special N(numerical), A(alphabetic) and special characterscharacters• E.g Profile(“polyA_site”)=A_AE.g Profile(“polyA_site”)=A_A

• Token category: word, number or elseToken category: word, number or else• Frequent tokensFrequent tokens

• Approximate frequent token mining algorithmApproximate frequent token mining algorithm• Assumption: token distributed evenlyAssumption: token distributed evenly

Page 10: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Template Scoring FunctionTemplate Scoring Function

• Desired propertyDesired property• SimpleSimple

• Adjustable trade-off Adjustable trade-off between sensitivity and between sensitivity and error toleranceerror tolerance

0.00.1

0.20.3

0.40.5

0.60.7

0.80.9

1.0

F_pt B_pt t

Temperature

Page 11: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Ontology DatabaseOntology Database

• Goal: to approximate a complete Goal: to approximate a complete comprehensive ontology databasecomprehensive ontology database

• ApproachApproach• ““Complete”: sample popular termsComplete”: sample popular terms• ““Comprehensive”: public ontology databases + Comprehensive”: public ontology databases +

common factscommon facts• ResultResult

• 6 major categories, 386 terms6 major categories, 386 terms

Page 12: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Ontology ContentsOntology Contents

• Sample existing databasesSample existing databases• Organism name: NCBI TaxonomyOrganism name: NCBI Taxonomy• Cellular component: Gene OntologyCellular component: Gene Ontology• Publication method: NCBI Entrez JournalsPublication method: NCBI Entrez Journals

• New categoriesNew categories• Biology database: popular database namesBiology database: popular database names• Molecular type: biology factMolecular type: biology fact• Free text: common words in natural languageFree text: common words in natural language

• EnhancementEnhancement

+ taxonomy hierarchy+ taxonomy hierarchy

+ direct submission+ direct submission

Page 13: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

OutlineOutline

• MotivationMotivation• SystemSystem• Mining algorithmMining algorithm

• Using ontologyUsing ontology• Using heuristicsUsing heuristics

• ExperimentExperiment

Page 14: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Mining With OntologyMining With Ontology

1.1. Occurrence(term) =Occurrence(term) =Frequent_Counts[i], Frequent_Counts[i],

if term=Frequent_Tokens[i]if term=Frequent_Tokens[i]

minmini:[0, t]i:[0, t] Frequent_Counts[i], Frequent_Counts[i],

if term=Frequent_Tokens[0]|…|Frequent_Tokens[t]if term=Frequent_Tokens[0]|…|Frequent_Tokens[t]

0, 0, elseelse

2.2. Strength(term) = Strength(term) = Occurrence(term) + Occurrence(term) + Strength(child_term) Strength(child_term)

Page 15: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Mining With OntologyMining With Ontology

• Likelihood of attribute to be labeled with Likelihood of attribute to be labeled with ll• Factors:Factors:

• Relative strength of term Relative strength of term ll compared with that of compared with that of other termsother terms

• completeness of ontologycompleteness of ontology

• Score = product of two factors, modulated by Score = product of two factors, modulated by the template scoring functionthe template scoring function

Page 16: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Mining With HeuristicsMining With Heuristics

• Use token profileUse token profile• ““number”: {N, N.N}number”: {N, N.N}• ““date”: {N-A-N, N/N/N}date”: {N-A-N, N/N/N}

• Use frequent token countsUse frequent token counts• ““identification”: Frequent_Counts[]=1identification”: Frequent_Counts[]=1

Page 17: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Mining With Heuristics Mining With Heuristics

• Use other token informationUse other token information• ““biological sequence”: length >45, or in 10’sbiological sequence”: length >45, or in 10’s

• Use token sequence informationUse token sequence information• ““people name”: length (2~3), separator (“,” or people name”: length (2~3), separator (“,” or

“and”), profile (not number, date)“and”), profile (not number, date)

Page 18: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Experimental ResultsExperimental Results

• DatasetsDatasets• GenBank, UniProt SWISSPROT and PfamGenBank, UniProt SWISSPROT and Pfam

• Cutoff valuesCutoff values• Cluster scores to group Cluster scores to group mostmost, , middlemiddle and and littlelittle

by minimizing standard deviationby minimizing standard deviation• EvaluationEvaluation

• Weighted Cohen’s Kappa: Compare group Weighted Cohen’s Kappa: Compare group mostmost, , middlemiddle and and littlelittle with true label Y(yes), with true label Y(yes), P(partial) and N(no)P(partial) and N(no)

Page 19: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: SummaryResults: Summary

Category 1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,

7: name, 8: number, 9: organism, 10: publication method, 11: sequence

Very goodVery good

GoodGood

ModerateModerate

Page 20: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Cellular Component (Ontology)Results: Cellular Component (Ontology)

Page 21: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Biology Database (Ontology)Results: Biology Database (Ontology)

Page 22: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Free Text (Ontology)Results: Free Text (Ontology)

Page 23: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Molecule Type (Ontology)Results: Molecule Type (Ontology)

Page 24: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Organism Name (Ontology)Results: Organism Name (Ontology)

Page 25: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Publication Method (Ontology)Results: Publication Method (Ontology)

Page 26: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Date (Heuristics)Results: Date (Heuristics)

Page 27: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: ID (Heuristics)Results: ID (Heuristics)

Page 28: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: People Name (Heuristics)Results: People Name (Heuristics)

Page 29: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Number (Heuristics)Results: Number (Heuristics)

Page 30: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Results: Bio. Sequence (Heuristics)Results: Bio. Sequence (Heuristics)

Page 31: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Discussion: Hits and MissesDiscussion: Hits and Misses

• According to Kappa tests, good or very According to Kappa tests, good or very goodgood

• Possible improvementPossible improvement• Better clustering methodBetter clustering method• Bigger ontology databaseBigger ontology database• More involved language analysisMore involved language analysis• Hybrid of bottom-up and top-down approachesHybrid of bottom-up and top-down approaches

Page 32: Assigning Schema Labels Using Ontology and Heuristics Xuan Zhang, Rouming Jin, Gagan Agrawal

Assigning Schema Labels Assigning Schema Labels Using Using

Ontology and HeuristicsOntology and Heuristics