1 concepts, ontologies, and project tango deryle lonsdale byu linguistics and english language...

95
1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language [email protected]

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

1

Concepts, Ontologies, and Project TANGO

Deryle LonsdaleBYU Linguistics and English Language

[email protected]

Page 2: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

2

Outline

NSF projects Semantic Web

Concepts Project TIDIE

Ontologies Project TANGO

Tables Ontology generation

Page 3: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

3

Acknowledgements NSF David Embley (BYU CS), Steve Liddle (BYU

Marriott School) and Yuri Tijerino BYU Data Extraction Group members

Page 4: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

4

The National Science Foundation Federal agency, $5.5 billion budget, funds 20%

of all federally supported basic research conducted by America’s colleges and universities

7 directorates Biological Sciences, Computer and Information Science

and Engineering, Engineering, Geosciences, Mathematics and Physical Sciences, Social, Behavioral and Economic Sciences, and Education and Human Resources

200,000 scientists, engineers, educators and students at universities, laboratories and field sites

10,000 awards/year, 3 years duration (avg.)

Page 5: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

5

The NSF Nifty 50 (general) ACCELERATING, EXPANDING

UNIVERSE ANTARCTIC OZONE HOLE

RESEARCH ARABIDOPSIS—A PLANT GENOME

PROJECT BAR CODES BLACK HOLES CONFIRMED BUCKY BALLS COMPUTER VISUALIZATION

TECHNIQUES DATA COMPRESSION TECHNOLOGY DISCOVERY OF PLANETS DOPPLER RADAR EFFECTS OF ACID RAIN EL NIÑO AND LA NIÑA PREDICTIONS FIBER OPTICS

GEMINI TELESCOPES HANTAVIRUS

IDENTIFICATION DNA FINGERPRINTING MRI—MAGNETIC

RESONANCE IMAGING NANOTECHNOLOGY THE NATIONAL

OBSERVATORIES OVERCOMING HEAVY

METALS OVERCOMING SALT

TOXICITY TISSUE ENGINEERING TUMOR DETECTION VOLCANIC ERUPTION

DETECTION YELLOW BARRELS

Page 6: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

6

Language-related Nifty 50 AMERICAN SIGN LANGUAGE DICTIONARY

DEVELOPMENT COMPUTER VISUALIZATION TECHNIQUES THE DARCI CARD DATA COMPRESSION TECHNOLOGY THE "EYE CHIP" OR RETINA CHIP THE INTERNET PERSONS WITH DISABILITIES ACCESS

TO THE WEB PROJECT LISTEN SPEECH RECOGNITION TECHNOLOGY vBNS—VERY HIGH SPEED BACKBONE

NETWORK SYSTEM WEB BROWSERS

Page 7: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

7

Hypernym

Synonym

Annotation

The search query

Browsing the Semantic Web

Page 8: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

8

Ranking based on content data and structure

Grouping results by their conceptual relationships Using lexical semantics for similarity search

movie

astronomy

sports

Browsing the Semantic Web

Page 9: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

9

Desirable, not (yet) possible

Word sense disambiguation Other types of queries (e.g. services)

What is the cheapest available round-trip flight to Cancun the day after finals this semester?

Set up an appointment with my optometrist for next week.

List available 4-person BYU-approved apartments in Orem for under $150/month.

Find me a linguistics job in Tahiti.

Page 10: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

10

Project TIDIE

Apr 10, 2001 – May 12, 2005

Page 11: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

11

Overview of TIDIE

3-year NSF project at BYU Total amount about $430,000 PI David Embley (BYU CS), 4 co-PI’s

from BYU 18 grad students, 45 publications Demos, tools, papers, presentations at

website (www.deg.byu.edu/)

Page 12: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

12

Goal of TIDIE Target-Based Independent-of-Document

Information Extraction Target-based: user specifies what to find

Not just keyword search, but concept-based search using an ontology

Document independent Should work even if pages change over time, on

new documents IE: match, merge, retrieve, format information Present in way that user can search, query

results

Page 13: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

13

Document-based IE

Page 14: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

14

Recognition and extraction

Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081

Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold

Page 15: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

15

Concepts

What drive the matching process for IE Inherent in words, numbers, phrases,

text Linguistics: lexical semantics Denotations: entities, attributes Location: relationships Occurrences: constraints

Page 16: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

16

Concept matching

We use exhaustive concept matching techniques to find concepts in documents including: Lexical information (lexicons) Natural language processing (NLP)

techniques Similarity of values Features of value Data frames Constraints

Page 17: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

17

Lexicons

Repositories of enumerable classes of lexical information

FirstNames, LastNames, USStates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.

WordNet (synonyms, word senses, hypernyms/hyponyms)

Page 18: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

18

The data-frame library Snippets of real-world knowledge about data

(type, length, nearby keywords, patterns [as in regexps], functional relations, etc)

Low-level patterns implemented as regular expressions

Match items such as email addresses, phone numbers, names, etc.

Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; },

{ extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},

{ extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b";end;

Page 19: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

19

Isolated concepts are OK, but...

We’re also interested in the relations between concepts

This is often best done graphically Ontology: arrangement of concepts that

explicitizes their relations, constraints Conceptual modeling: field of CS /

linguistics that deals with formalizing concepts, using such information

BYU has its own well-known conceptual modeling framework (OSM)

Page 20: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

20

Conceptual modeling (OSM)

Year Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Page 21: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

21

Ontologies and IE

Source Target

Page 22: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

22

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Constant/keyword recognition

Descriptor/String/Position(start/end)

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Page 23: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

23

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Database instance generator

insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Page 24: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

24

CarAds

Color

Feature

AccessoryBodyType

OtherFeatureEngine

Transmission

Mileage

ModelTrim

TrimModel

Year

Make

Price

PhoneNr

0:1

has1:*

0:1has1:*

0:0.7:1has

1:* 0:0.9:1has

1:*

0:0.78:1

has

1:*

0:1

1:*

0:1

1:*

0:1

has1:*

0:*has

1:*

0:*

has

1:*

CarAds

Color

Feature

AccessoryBodyType

OtherFeatureEngine

Transmission

Mileage

ModelTrim

TrimModel

Year

Make

Price

PhoneNr

0:1

has1:*

0:1has1:*

0:0.7:1has

1:* 0:0.9:1has

1:*

0:0.78:1

has

1:*

0:1

1:*

0:1

1:*

0:1

has1:*

0:*has

1:*

0:*

has

1:*

Car ads extraction ontology

Page 25: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

25

Car ads ontology (textual)Car [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]

constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;

Page 26: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

26

A gene ontology

Page 27: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

27

A geneology data model

Page 28: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

28

Finding jobs in linguistics

Built ontology for linguistics jobs: what defines a linguistics job

Data frames and lexicons: language names (www.ethnologue.com), subfields of linguistics (www.linguistlist.org), tools linguists use, programming languages, activities, responsibilities, country names

Documents: 3500 web pages + emails to me

Complete results reported in DLLS 2003

Page 29: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

29

Sample query

Page 30: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

30

Sample output

Page 31: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

31

Subfield expertise sought

0

100

200

300

400

500

600

700

IE/ IR Morpho NLP Phonetics

Phonology Pragmatics Speech SyntaxSemantics MT TESOL/EFL Translation

0

200

400

600

800

Psycho Neuro HistoricalTypological Acquisition CognitionSocioling Lexicography PhilologyPhilosophy Anthropo

Page 32: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

32

Technical skills sought

0

100

200

300

400

500

600

700

C/C++ CGI HTML/SGMLJ ava/ J script Lisp PerlProlog SQL TclVB XML/XSLT

0

50

100

150

200

250

300

Machine learning Finite- stateStatistical Stoch/ProbMath GenerativeField Methods

Page 33: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

33

Sample observations 270 don’t have linguist* (!) Computer/computational background required

for almost 1/3 (1116) Noticeable amount of headhunting,

particularly in Seattle, DC areas Often a job title is not even listed (!) Great need for ontologies related to linguistics

job titles theoretical frameworks, subfields typical linguist job activities linguistic research/development venues

Page 34: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

34

An engineering discipline? 160 linguistics jobs ending in “engineer” Software development cycle

research e., software design e. development e., software e. software quality e., linguistic test e., linguistic quality e. linguistic support e., user experience e. presales e., technical sales e.

Specific subfields web site e. speech e., voice recognition e., speech recognition application e., speech e.,

ASR tuning e., audio e. dialog e.

tools e. AI e., NLP e. knowledge e., ontology e. linguist e., natural language e. staff e. human factors e., user interface e.

Page 35: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

35

A recent ontologist job ad Date: Thu, 28 Jul 2005 11:44:40 Subject: General Linguistics: Ontologist, Denver, USA

Job Rank: Ontologist Specialty Areas: General Linguistics

Position Summary: Ontologist

This person will be responsible for modifying & editing Ontology structures.

Skills: Basic computer skills such as Internet, email, and spreadsheet programs In-depth knowledge of any major industry, such as Health Care, Automotive, Legal, Construction, and

so forth helpful Superior communication skills, both oral and written. Ability to communicate effectively with reports,

peers, superiors, and customers essential Travel &/or foreign language experience desired

Personal Characteristics: A healthy sense of logic, and a love for details A deep and abiding love of language, and of rule-governed classification systems. This person should

be excited by the challenge of figuring out the precise place where a word belongs, and be delighted with the prospect of performing such tasks as the major part of their job

Position Qualifications: -Bachelor's degree, preferably in Linguistics, Library Science, English, or related field

Page 36: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

36

Another recent ontologist ad Position Summary: Lead Ontologist

The Lead Ontologist will be responsible for creating & designing Ontology and Ontology structures. This person will be responsible for innovation and general Ontology development as Ontology requirements change. They will serve as Team Lead on various Ontology projects, and they will assist the Director with certain aspects of management, including the development of department culture and standards. They will also serve as a liaison between the Director and the rest of the team.

Skills: Ability to edit & manipulate text highly desired, using tools such as Emacs and Perl.

High level programming language experience and SQL also desired Knowledge of Ontology structures, and experience with developing and maintaining

such structures Ability to assist with Ontology development and use problem-solving skills to overcome

obstacles Ability to QA own Ontology work, and work of others Ability to lead projects from set-up through to QA Leadership or management experience a plus

Position Qualifications: -Bachelor's degree in Linguistics, Library Science, or related field -2-3 years experience in Ontology or related field

Application Deadline: Open until filled.

Page 37: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

37

Matching request with ontology

“Tell me about cruises on San Francisco Bay. I’d like to know scheduled times, cost, and the duration of cruises on Friday of next week.”

Page 38: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

38

Building a query

Friday, Oct. 29thcost

duration

Selection Constants

San Francisco Bayscheduled times

Projection

= Result ( )

Join Path

Page 39: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

39

StartTime Price Duration

Source

10:45 am, 12:00 pm, 1:15, 2:30, 4:00 $20.00, $16.00, $12.00

1

10:00 am, 10:45 am, 11:15 am, 12:00 pm, 12:30 pm, 1:15 pm, 1:45 pm, 2:30 pm, 3:00 pm, 3:45 pm, 4:15 pm, 5:00 pm

$17.00, $16.00, $12.00

1 Hour 2

Page 40: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

40

Another example Service Request

Match with Task Ontology Domain Ontology Process Ontology

Complete, Negotiate, Finalize

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Page 41: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

41

Service domain ontology

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Page 42: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

42

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Page 43: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

43

Relevant mini-ontology

Appointment

Place

Dermatologist

Person

Name

Address

Date

Time

is at

is on

has

hasis with

is for

is at

is at

has

is at

->Appointment

Place

Dermatologist

Person

Name

Address

Date

Time

is at

is on

has

hasis with

is for

is at

is at

has

is at

->

Page 44: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

44

Ontologies: issues Most successful in data-rich, narrow- domain

applications Ambiguities are problematic, context only

partially eliminates Incompleteness: implicit information Commonsense world pragmatics evasive Knowledge prerequisites are steep Major efforts in creation, maintenance

Must be created by experts Experts are biased in knowledge, agreement needed Ontologies continually change; upkeep a massive task

Page 45: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

45

Ontologies: possible solutions

Some automation is needed Current automatic generation of ontologies is

not successful, because extracted from free-form, unstructured text.

A more effective alternative is to extract ontologies from structured data on the web (tables, charts, etc.)

TANGO project Part 1: Extract tables from the web Part 2: Define mini-ontologies from tables Part 3: Merge into growing domain ontology

Page 46: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

46

Project TANGO

Page 47: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

47

Overview

Table ANalysis for Generating Ontologies

3-year NSF-funded project Joint BYU/RPI project Uses and extends TIDIE concepts,

ontologies Goal is to process tables, generate

ontologies, use results for IE

Page 48: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

48

Motivation

Keyword or link analysis search not enough to search for information in tables

Structure in tables can lead to domain knowledge which includes concepts, relationships and constraints (ontologies)

Tables on web created for human use can lead to robust domain ontologies

Page 49: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

49

Table understanding

What is a table? Why table normalization? What is table understanding? What is mini-ontology generation?

Page 50: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

50

What is a table?

“…a two-dimensional assembly of cells used to present information…” Lopresti and Nagy

Normalized tables (row-column format) Small paper (using OCR) and/or

electronic tables (marked up) intended for human use

Page 51: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

51

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 52: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

52

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 53: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

53

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 54: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

54

?

Olympus C-750 Ultra Zoom

Sensor Resolution 4.2 megapixelsOptical Zoom 10 xDigital Zoom 4 xInstalled Memory 16 MBLens Aperture F/8-2.8/3.7Focal Length min 6.3 mmFocal Length max 63.0 mm

Page 55: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

55

Digital Camera

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 56: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

56

?

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Page 57: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

57

?

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Page 58: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

58

Airline Itinerary

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Page 59: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

59

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 60: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

60

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 61: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

61

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 62: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

62

Maps

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,100 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 63: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

63

Table normalization

take any table, produce a standard row-column table with all data cells containing expanded values and type information

Country GDP/PPP GDP/PPP Per

Capita

Real-Growt

h Rate

Inflation

Afghanistan $21,000,000,000 $800 ? ?

Albania $13,200,000,000 $3,800 7.3% 3.0%

Algeria $177,000,000,000 $5,600 3.8% 3.0%

Andorra $1,300,000,000 $19,000 3.8% 4.3%

Angola $13,300,000,000 $1,330 5.4% 110.0%

Antigua and Barbuda

$674,000,000 $10,000 3.5% 0.4%

… … … … …

Raw table

Normalizedtable

Page 64: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

64

Normalizing across hyperlinks

Page 65: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

65

Normalized table?? Population Populatio

nGrowth

rate

PopulationDensity

BirthRate

DeathRate

Migration

Rate

LifeExpectan

cyMale

LifeExpectanc

yFemale

InfantMortalit

y

Afghanistan 25,824,882 3.95% 39.88 persons/

km2

4.19%

1.70%

1.46% 47.82 years

46.82 years

14.06%

Albania 3,364,571 1.05% 122.79 persons/

km2

2.07%

0.74%

-0.29% 65.92 years

72.33 years

4.29%

Algeria 31,133,486 2.10% 13.07 persons/

km2

2.70%

0.55%

-0.05% 68.07 years

70.46 years

4.38%

American Samoa

63,786 2.64% 320.53 persons/

km2

2.65%

0.40%

0.39% 71.23 years

79.95 years

1.02%

Andorra 65,939 2.24% 146.53 persons/

km2

1.03%

0.55%

1.76% 80.55 years

86.55 years

0.41%

Angola 11,510 2.84% 8.97 persons/

km2

4.31%

1.64%

0.16% 46.08 years

50.82 years

12.92%

… … … … … … … … … …

Western Sahara 239,333 2.34% 0.90 persons/

km2

4.54%

1.66%

-0.54% 47.98 years

50.57 years

13.67%

World 5,995,544,836

1.30% 14.42 persons/

km2

2.20%

0.90%

? 61.00 years

65.00 years

5.60%

Yemen 16,942,230 3.34% 32.09 persons/

km2

4.33%

0.99%

0.00% 58.17 years

61.88 years

6.98%

Zambia 9,663,535 2.12% 13.05 persons/

km2

4.45%

2.26%

0.08% 36.72 years

37 21 years

9.19%

Zimbabwe 11,163,160 1.02% 28.87 persons/

km2

3.06%

2.04%

? 38.77 years

38.94 years

6.12%

Page 66: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

66

How to understand tables

Captions – in vicinity of table (above, below etc)

Footnotes – on annotated column labels or data cells

Embedded information – in rows, columns or cells {e.g., $, %, (1,000), billions, etc}

Links to other views of the table, possibly with new information

Page 67: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

67

Use of normalized data Take a table as an input and produce standard records

in the form of attribute-value pairs as output Discover constraints among columns Understand the data values

Country GDP/PPP GDP/PPP Per

Capita

Real-Growth Rate

Inflation

Afghanistan

$21,000,000,000 $800 ? ?

Albania $13,200,000,000 $3,800 7.3% 3.0%

Algeria $177,000,000,000

$5,600 3.8% 3.0%

Andorra $1,300,000,000 $19,000 3.8% 4.3%

Angola $13,300,000,000 $1,330 5.4% 110.0%

Antigua and Barbuda

$674,000,000 $10,000 3.5% 0.4%

… … … … …

{has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),has(Country,Real-growth rate*), has(Country, Inflation*)

Left-most, primary key

Dollar amount(from data frame)

Percentage(from data frame)

Country names(from data frame)

{<Country: Afghanistan>, <GDP/PPP: $21,000,000,000>, <GDP/PPP per capita: $800>, <Real-growth rate: ?>, <Inflation: ?>}

Page 68: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

68

Ontology generation overview

Concepts of Interest

Concepts with Relations

Data extraction ontology

Sample Documents

Page 69: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

69

Example:Creating a domain ontology

Has associateddata frames

Includes proceduralknowledge

Distances

Duration betweenTime zones

Name Geopolitical Entity

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

HasGMT

Page 70: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

70

Example:Table understanding to mini-ontology generation

Agglomeration Population

Continent Country

Tokyo 31,139,900

Asia Japan

New York-Philadelphia

30,286,900

The Americas

United States of America

Mexico 21,233,900

The Americas

Mexico

Seoul 19,969,100

Asia Korea (South)

Sao Paulo 18,847,400

The Americas

Brazil

Jakarta 17,891,000

Asia Indonesia

Osaka-Kobe-Kyoto

17,621,500

Asia Japan

… … … …

Niigata 503,500 Asia Japan

Raurkela 503,300 Asia India

Homjel 502,200 Europe Belarus

Zunyi 501,900 Asia China

Santiago 501,800 The Americas

Dominican Republic

Pingdingshan 501,500 Asia China

Fargona 501,000 Asia Uzbekistan

Kirov 500,200 Europe Russia

Newcastle 500,000 Australia /Oceania

Australia

Agglomeration Population

Country Continent

Page 71: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

71

Example:Concept matching to ontology Merging

Merge

Results

Agglomeration Population

Country Continent

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

Name Geopolitical Entity

Continent

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

CityAgglomerationCountry

HasGMT

Time

Location

Longitude Latitude

hasnames

Latitude and longitudedesignates location

Country City

Name Geopolitical Entity

HasGMT

Page 72: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

72

Ontology merging/growing Direct merge (no conflicts)

Use results of matching phase to find similar concepts in ontologies (e.g., data value similarities, data frames, NLP, etc)

Conflict resolution Interactively identify evidence and counter

evidence of functional relationships among mini-ontologies using constraint resolution

IDS Interaction with human knowledge engineer Issues – identify Default strategy – apply Suggestions – make

Page 73: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

73

Example: Another mini-ontology generation

Place

Longitude Latitude

Elevation

USGS Quad

Area

MineReservoirLakeCity/town

Country

State

Place Name

Page 74: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

74

Example: Another mini-ontology generation

Place

Longitude Latitude

Elevation

USGS Quad

Area

MineReservoirLakeCity/town

Country

State

Place Name

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

CityAgglomerationCountry

Merge

Continent

Time

hasnameshasGMT

Page 75: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

75

Example: Concept Mapping to Ontology Merging

Place

Elevation

USGS Quad

Area

MineReservoirLake

Country

State

Location

Longitude Latitude

Latitude and longitudedesignates location

Name Geopolitical Entity

Population

AgglomerationCountryContinent

Time

hasnameshasGMT

GeopoliticalEntity with population

City/town

Page 76: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

76

Recognize Table Information

Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 30%

Page 77: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

77

Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 30%

Page 78: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

78

Discover Mappings

Page 79: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

79

Merge

Page 80: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

80

Review: the TANGO process

Start out with normalized table Generate likely candidates for:

Object Sets Relationship Sets Functional Constraints Inclusion Constraints/Hierarchical Structure

Get help from user when needed Choose best candidate for the ontology

Page 81: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

81

Generate concepts

Create list of candidate concepts (usually column names)

Page 82: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

82

Example 1: Generate Concepts

Determine lexicalization (columns with associated values are lexical)

Page 83: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

83

Example 1: Generate Concepts

Current ontology

Page 84: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

84

Example 1: Generate Relationships

Decide relationship sets Exponential number of combinations Basic assumption: one main concept relates to all

others (attributes) Goal: find central column of interest

Page 85: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

85

Example 1: Generate Relationships

Look for mapping between one column and title of table

Page 86: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

86

Example 1: Generate Relationships

Current ontology

Page 87: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

87

Example 1: Generate Constraints

FDs and Participation Constraints FD definition: X → Y iff (X[i] = X[j]) → (Y[i] = Y[j]) for all

row indexes i and j. Unless solid case (two or more same values), only

consider FDs from central object to attributes Use heuristics for setting exact participation (0:1,1:*, etc)

Page 88: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

88

Example 1: Generate Concepts

Numerical values are usually functionally determined by column of interest and have 0:* participation constraint.

Page 89: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

89

Example 1: Generate Constraints

Completed mini-ontology

Page 90: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

90

Example 2: Generate Concepts

SubFamily, Group, and SubGroup are generic types

Enumerate column values as object sets because less than 5 divisions (recursively)

Page 91: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

91

Example 2: Generate Relationships

Found mapping of central column of interest to title (Language)

Exceptions to basic assumption Hierarchy

(enumerated object sets)

Transitive FDs (X → Y, Y → Z, remove X → Z)

Create ISA hierarchy from table structure

Page 92: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

92

Example 2: Generate Relationships

Current ontology

Page 93: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

93

Example 2: Generate Hierarchical Constraints

Assign members to each object set for easy calculation

Find inclusion dependencies: Union – All

members of parents are members of one or more child

Intersection (Less common) – Child members are always in both parents

Mutual exclusion – Intersection of any two child members is empty.

Page 94: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

94

Example 2: Generate Hierarchical Constraints

Completed mini-ontology

Page 95: 1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language lonz@byu.edu

95

Future direction

Start with multiple tables (or URLs) and generate mini-ontologies

Identify most suitable mini-ontologies to merge by calculating which tables have most overlap of concepts

Generate multiple domain ontologies Integrate with form-based data

extraction tools (smarter Web search engines)