tango table analysis for generating ontologies yuri a. tijerino*, david w. embley*, deryle w....

27
TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University ** Rensselaer Polytechnic Institute

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

TANGOTable ANalysis for

Generating Ontologies

Yuri A. Tijerino*, David W. Embley*,

Deryle W. Lonsdale* and George Nagy**

* Brigham Young University** Rensselaer Polytechnic Institute

List of contents Motivation Applications Table understanding Concept matching Ontology merging/growing Example Future direction

Motivation Semi-automated ontological engineering through

Table Analysis for Generating Ontologies (TANGO) Keyword or link analysis search not enough to

search for information in tables Structure in tables can lead to domain knowledge

which includes concepts, relationships and constraints (ontologies)

Tables on web created for human use can lead to robust domain ontologies

TANGO Applications Extraction ontologies (generation) Data integration Semantic web Multiple-source query processing Document image analysis for documents

that contain tables

Table understanding What is a table? Why table normalization? What is table understanding? What is mini-ontology generation?

Table understanding:What is a table? “…a two-dimensional assembly of cells

used to present information…” Lopresti and Nagy

Normalized tables (row-column format) Small paper (using OCR) and/or electronic

tables (marked up) intended for human use

Table understanding:What is table normalization?

Table normalization means to take any table and produce a standard row-column table with all data cells containing expanded values and type information

Country GDP/PPP GDP/PPP Per

Capita

Real-Growth Rate

Inflation

Afghanistan $21,000,000,000 $800 ? ?

Albania $13,200,000,000 $3,800 7.3% 3.0%

Algeria $177,000,000,000 $5,600 3.8% 3.0%

Andorra $1,300,000,000 $19,000 3.8% 4.3%

Angola $13,300,000,000 $1,330 5.4% 110.0%

Antigua and Barbuda

$674,000,000 $10,000 3.5% 0.4%

… … … … …

Raw table

Normalizedtable

Table understanding:What is table normalization?

Table understanding:What is table normalization??? Population Population

Growth rate

PopulationDensity

BirthRate

DeathRate

MigrationRate

LifeExpectanc

yMale

LifeExpectancyFemale

InfantMortality

Afghanistan 25,824,882 3.95% 39.88 persons/km2

4.19% 1.70% 1.46% 47.82 years

46.82 years 14.06%

Albania 3,364,571 1.05% 122.79 persons/km2

2.07% 0.74% -0.29% 65.92 years

72.33 years 4.29%

Algeria 31,133,486 2.10% 13.07 persons/km2

2.70% 0.55% -0.05% 68.07 years

70.46 years 4.38%

American Samoa 63,786 2.64% 320.53 persons/km2

2.65% 0.40% 0.39% 71.23 years

79.95 years 1.02%

Andorra 65,939 2.24% 146.53 persons/km2

1.03% 0.55% 1.76% 80.55 years

86.55 years 0.41%

Angola 11,510 2.84% 8.97 persons/km2

4.31% 1.64% 0.16% 46.08 years

50.82 years 12.92%

… … … … … … … … … …

Western Sahara 239,333 2.34% 0.90 persons/km2

4.54% 1.66% -0.54% 47.98 years

50.57 years 13.67%

World 5,995,544,836

1.30% 14.42 persons/km2

2.20% 0.90% ? 61.00 years

65.00 years 5.60%

Yemen 16,942,230 3.34% 32.09 persons/km2

4.33% 0.99% 0.00% 58.17 years

61.88 years 6.98%

Zambia 9,663,535 2.12% 13.05 persons/km2

4.45% 2.26% 0.08% 36.72 years

37 21 years 9.19%

Zimbabwe 11,163,160 1.02% 28.87 persons/km2

3.06% 2.04% ? 38.77 years

38.94 years 6.12%

Table understanding:Information useful for normalization Captions – in vicinity of table (above,

below etc) Footnotes – on annotated column labels or

data cells Embedded information – in rows, columns

or cells {e.g., $, %, (1,000), billions, etc} Links to other views of the table, possibly

with new information

What is table understanding? Normalize table Take a table as an input and produce standard records in the form of

attribute-value pairs as output Discover constraints among columns Understand the data values

Country GDP/PPP GDP/PPP Per Capita

Real-Growth Rate

Inflation

Afghanistan $21,000,000,000 $800 ? ?

Albania $13,200,000,000 $3,800 7.3% 3.0%

Algeria $177,000,000,000 $5,600 3.8% 3.0%

Andorra $1,300,000,000 $19,000 3.8% 4.3%

Angola $13,300,000,000 $1,330 5.4% 110.0%

Antigua and Barbuda

$674,000,000 $10,000 3.5% 0.4%

… … … … …

{has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),has(Country,Real-growth rate*), has(Country, Inflation*)

Left-most, primary key

Dollar amount(from data frame)

Percentage(from data frame)

Country names(from data frame)

{<Country: Afghanistan>, <GDP/PPP: $21,000,000,000>, <GDP/PPP per capita: $800>, <Real-growth rate: ?>, <Inflation: ?>}

Example:Creating a domain ontology

Has associateddata frames

Includes proceduralknowledge

Distances

Duration betweenTime zones

Name Geopolitical Entity

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

HasGMT

Example:Table understanding to mini-ontology generation

Agglomeration Population Continent Country

Tokyo 31,139,900 Asia Japan

New York-Philadelphia

30,286,900 The Americas United States of America

Mexico 21,233,900 The Americas Mexico

Seoul 19,969,100 Asia Korea (South)

Sao Paulo 18,847,400 The Americas Brazil

Jakarta 17,891,000 Asia Indonesia

Osaka-Kobe-Kyoto 17,621,500 Asia Japan

… … … …

Niigata 503,500 Asia Japan

Raurkela 503,300 Asia India

Homjel 502,200 Europe Belarus

Zunyi 501,900 Asia China

Santiago 501,800 The Americas Dominican Republic

Pingdingshan 501,500 Asia China

Fargona 501,000 Asia Uzbekistan

Kirov 500,200 Europe Russia

Newcastle 500,000 Australia /Oceania

AustraliaAgglomeration Population

Country Continent

Example:Concept matching to ontology Merging

Merge

Results

Agglomeration Population

Country Continent

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

Name Geopolitical Entity

Continent

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

CityAgglomerationCountry

HasGMT

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

Name Geopolitical Entity

HasGMT

Concept matching We use exhaustive concept matching

techniques to match concepts from different mini-ontologies, including: Lexical and Natural Language Processing Value Similarity Value Features Data Frame Comparison Constraints

Concept Matching (Lexical & NLP) Lexical

Direct comparisons (substring/superstring) WordNet (Synonyms, Word Senses,

Hypernyms/Hyponyms) Natural Language Processing

Phrases in column headers Footnotes (for columns, rows, values) Explanations of symbols, rows, columns Titles and subtitles

Concept Matching (Value Similarity) Compute overlap for string values

comparing data sets Compute overlap for numeric values

comparing Gaussian Probability Distributions

Compute similarity of numeric values using regression

Concept Matching (Value Similarity)

Afghanistan

Albania

Algeria

Andorra

Yemen

Zambia

Zimbabwe

Afghanistan

Albania

Algeria

American Samoa

World

Yemen

Zambia

Zimbabwe

A B

In B not in A

In A not in B

In B not in A

Real-world exampleTotal of 193 cells in ATotal of 267 cells in B

77 fields in B not in A3 fields in A not in B

190 total matches

Proportion of matches withrespect to A = 190/193 = 98%

Proportion of matches withrespect to B = 190/267 = 71%

Concept Matching (Value Similarity)

31,900,600

30,521,550

25,335,200

12,300,555

3,567,203

2,300,531

1,400,112

31,500,900

30,400,111

25,500,100

21,000,900

7,000,000

3,500,050

2,300,000

1,500,000

A B

In B not in A

In A not in B

In B not in A

Total of 170 cells in ATotal of 240 cells in B

50 fields in B not in A2 fields in A not in B

168 total matches

Proportion of matches withrespect to A = 168/170 = 99%

Proportion of matches withrespect to B = 168/240 = 70%

Gaussian PDF

Concept Matching (Value Features) We can also compute similarities from

value characteristics such as: Character/numeric length, ratio Numeric values mean, variance, standard

deviation

Concept Matching (Data frames) Snippets of real-world knowledge about

data (type, length, nearby keywords, patterns [as in regexps], functional, etc)

We have used data frames to Recognize data types Include recognizers for values (dates, times,

longitude, latitude, countries, cities, etc) Provide conversion routines Match headers, labels, footnotes and values Compose or split columns (e.g., addresses)

Concept Matching (Constraints) Keys in tables (as well as nonkeys) Functional relationships 1-1, 1-*, *-1 or *-* correspondences Subset/superset of value sets Unknown and null values

Ontology merging/growing Direct merge (no conflicts)

Use results of matching phase to find similar concepts in ontologies (e.g., data value similarities, data frames, NLP, etc)

Conflict resolution Interactively identify evidence and counter evidence of

functional relationships among mini-ontologies using constraint resolution

IDS Interaction with human knowledge engineer Issues – identify Default strategy – apply Suggestions – make

Example: Another mini-ontology generation

Place

Longitude Latitude

Elevation

USGS Quad

Area

MineReservoirLakeCity/town

Country

State

Place Name

Example: Another mini-ontology generation

Place

Longitude Latitude

Elevation

USGS Quad

Area

MineReservoirLakeCity/town

Country

State

Place Name

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

CityAgglomerationCountry

Merge

Continent

Time

hasnameshasGMT

Example: Concept Mapping to Ontology Merging

Place

Elevation

USGS Quad

Area

MineReservoirLake

Country

State

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

AgglomerationCountryContinent

Time

hasnameshasGMT

GeopoliticalEntity with population

City/town

Future direction Start with multiple tables (or URLs) and

generate mini-ontologies Identify most suitable mini-ontologies to

merge by calculating which tables have most overlap of concepts

Generate multiple domain ontologies Integrate with form-based data extraction

tools (smarter Web search engines)