unified models of information extraction and data mining with application to social network analysis...

62
Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with David Jensen Knowledge Discovery and Dissemination (KDD) Conference September 2004 Intelligence Technology Innovation Center ITIC QuickTime™ and a TIFF (Uncompressed) deco are needed to see this QuickTime™ and a TIFF (Uncompressed) d are needed to see th QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this pictur

Upload: hugo-franklin

Post on 17-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Unified Models of Information Extraction and Data Mining

with Application to Social Network Analysis

Andrew McCallumInformation Extraction and Synthesis Laboratory

Computer Science Department

University of Massachusetts Amherst

Joint work with David Jensen

Knowledge Discovery and Dissemination (KDD) Conference

September 2004

Intelligence Technology Innovation Center

ITICQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Page 2: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Goal:

Improve the state-of-the-art in our abilityto mine actionable knowledgefrom unstructured text.

Page 3: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Extracting Job Openings from the Web

foodscience.com-Job2 Employer: foodscience.com JobTitle: Ice Cream Guru JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 4: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Data Mining the Extracted Job Information

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 5: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

IE fromChinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

Page 6: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Traditional Pipeline

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

KnowledgeDiscovery

Spider

Actionableknowledge

Page 7: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Problem:

Combined in serial juxtaposition,IE and KD are unaware of each others’ weaknesses and opportunities.

1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.

2) IE is unaware of emerging patterns and regularities in the DB.

The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

SegmentClassifyAssociateCluster

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

KnowledgeDiscovery

Actionableknowledge

Page 8: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Solution:

Page 9: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

ProbabilisticModel

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Research & Approach:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…], …

Conditionally-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Page 10: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,

and correction propagation in interactive IE

• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)

– Integrating extraction with co-reference (Graphs & chains)

• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.

Accomplishments, Discoveries & Results:

Page 11: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,

and correction propagation in interactive IE

• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)

– Integrating extraction with co-reference (Graphs & chains)

• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.

Accomplishments, Discoveries & Results:

Page 12: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Types of Uncertainty in Knowledge Discovery from Text

• Confidence that extractor correctly obtained statements the author intended.

• Confidence that what was written is truthful– Author could have had misconceptions.– …or have been purposefully trying to mislead.

• Confidence that the emerging, discovered pattern is a reliable fact or generalization.

Page 13: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

1. Labeling Sequence DataLinear-chain CRFs

yt - 1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model,

trained to maximize conditional probability of outputs given inputs

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

p(y | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t ,y t )

t=1

T

Φ(⋅) = exp λ k fk (⋅)k

∑ ⎛

⎝ ⎜

⎠ ⎟where

OTHER PERSON PERSON ORG TITLE … output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

Page 14: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Finite State Lattice

p(y | x) =1

Z(x)Φy (y t , y t−1)Φxy (x t , y t )

t=1

T

Page 15: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Constrained Forward-Backward

p(Arden Bement = PERSON | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t , y t )

t=1

T

∏y∈C

Page 16: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Forward-Backward Confidence Estimationimproves accuracy/coverage

op

timal

ourforward-backwardconfidence

traditionaltoken-wiseconfidence

no use ofconfidence

Page 17: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Confidence Estimation Applied

• New word discovery inChinese word segmentation

– Improves segmentation accuracy by ~25%

• Highlighting fields for Interactive Information Extraction– After fixing least confident field,

constrained Viterbi automatically reduces error by another 23%.

[Peng, Fangfang, McCallumCOLING 2004]

[Kristiansen, Culotta, Viola, McCallum AAAI 2004]Honorable Mention Award

Page 18: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,

and correction propagation in interactive IE

• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)– Integrating extraction with co-reference (Graphs & chains)

• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.

Accomplishments, Discoveries & Results:

Page 19: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Page 20: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Page 21: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

But errors cascade--must be perfect at every stage to do well.

Page 22: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

Page 23: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

[Sutton, McCallum, SRL 2004]

Dependency among similar, distant mentions ignored.

Page 24: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

[Sutton, McCallum, SRL 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

Page 25: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

3. Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

25% reduction in error on co-reference ofproper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

“Entity resolution”“Object correspondence”

Page 26: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Joint IE and Coreference from Research Paper Citations

Textual citation mentions(noisy, with duplicates)

Paper database, with fields,clean, duplicates collapsed

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…

4. Joint segmentation and co-reference

Page 27: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

Citation Segmentation and Coreference

Page 28: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

Citation Segmentation and Coreference

Page 29: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

Citation Segmentation and Coreference

Y?N

Page 30: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

3) Form canonical database record

Citation Segmentation and Coreference

AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Resolving conflicts

Page 31: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1) Segment citation fields

2) Resolve coreferent citations

3) Form canonical database record

Citation Segmentation and Coreference

AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Perform jointly.

Page 32: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

J Besag 1986 On the…

AUT AUT YR TITL TITL

Page 33: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

Citation mention attributes

J Besag 1986 On the…

AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”

c

Page 34: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

x

s

IE + Coreference Model

c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Structure for each citation mention

Page 35: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

x

s

IE + Coreference Model

c

Binary coreference variablesfor each pair of mentions

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Page 36: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

x

s

IE + Coreference Model

c

y n

n

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Binary coreference variablesfor each pair of mentions

Page 37: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

y n

n

x

s

IE + Coreference Model

c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Research paper entity attribute nodes

AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...

Page 38: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Such a highly connected graph makes exact inference intractable, so…

Page 39: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Loopy Belief

Propagation

v6v5

v3v2v1

v4

m1(v2) m2(v3)

m3(v2)m2(v1) messages passed between nodes

Approximate Inference 1

Page 40: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Loopy Belief

Propagation

• Generalized Belief

Propagation

v6v5

v3v2v1

v4

m1(v2) m2(v3)

m3(v2)m2(v1)

v6v5

v3v2v1

v4

v9v8v7

messages passed between nodes

messages passed between regions

Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with size of overlap between regions!

Approximate Inference 1

Page 41: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Iterated Conditional

Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v6i+1 = argmax P(v6

i | v \ v6

i) v6

i

= held constant

Approximate Inference 2

Page 42: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Iterated Conditional

Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v5j+1 = argmax P(v5

j | v \ v5

j) v5

j

= held constant

Approximate Inference 2

Page 43: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Iterated Conditional Modes

(ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v4k+1 = argmax P(v4

k | v \ v4

k) v4

k

= held constant

Approximate Inference 2

but greedy, and easily falls into local minima.Structured inference scales well here,

Page 44: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Iterated Conditional Modes

(ICM) [Besag 1986]

• Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v4

k | v \ v4

k)

e.g. an N-best list (the top N values)

v6v5

v3v2v1

v4

v4k+1 = argmax P(v4

k | v \ v4

k) v4

k

= held constant

v6v5

v3v2v1

v4

Approximate Inference 2

Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once.

Here, a “message” grows only linearly with overlap region size and N!

Page 45: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

IE + Coreference Model

Exact inference onthese linear-chain regions

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

From each chainpass an N-best List

into coreference

Page 46: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

IE + Coreference Model

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Approximate inferenceby graph partitioning…

…integrating outuncertaintyin samples

of extraction

Make scale to 1Mcitations with Canopies

[McCallum, Nigam, Ungar 2000]

Page 47: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

y n

n

IE + Coreference Model

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Exact (exhaustive) inferenceover entity attributes

Page 48: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

y n

n

IE + Coreference Model

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Revisit exact inferenceon IE linear chain,

now conditioned on entity attributes

Page 49: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

y n

n

Parameter Estimation

Coref graph edge weightsMAP on individual edges

Separately for different regions

IE Linear-chainExact MAP

Entity attribute potentialsMAP, pseudo-likelihood

In all cases:Climb MAP gradient with

quasi-Newton method

Page 50: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]

Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

Page 51: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,

and correction propagation in interactive IE

• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)

– Multiple co-reference decisions (Affinity Matrix CRF)– Integrating extraction with co-reference (Graphs & chains)

• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.

Accomplishments, Discoveries & Results:

Page 52: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Workplace effectiveness ~ Ability to leverage network of acquaintances“The power of your little black book”

But filling Contacts DB by hand is tedious, and incomplete.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Email Inbox Contacts DB

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

WWW

Automatically

One Application Project:

Page 53: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

System Overview

ContactInfo andPerson Name

Extraction

Person Name

Extraction

NameCoreference

HomepageRetrieval

Social NetworkAnalysis

KeywordExtraction

CRFWWW

names

Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 54: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

An ExampleTo: “Andrew McCallum” [email protected]

Subject ...

First Name:

Andrew

Middle Name:

Kachites

Last Name:

McCallum

JobTitle: Associate Professor

Company: University of Massachusetts

Street Address:

140 Governor’s Dr.

City: Amherst

State: MA

Zip: 01003

Company Phone:

(413) 545-1323

Links: Fernando Pereira, Sam Roweis,…

Key Words:

Information extraction,

social network,…

Search for new people

Page 55: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Summary of Results

Token

Acc

Field

Prec

Field

Recall

Field

F1

CRF 94.50 85.73 76.33 80.76

Person Keywords

William Cohen Logic programming

Text categorization

Data integration

Rule learning

Daphne Koller Bayesian networks

Relational models

Probabilistic models

Hidden variables

Deborah McGuiness

Semantic web

Description logics

Knowledge representation

Ontologies

Tom Mitchell Machine learning

Cognitive states

Learning apprentice

Artificial intelligence

Contact info and name extraction performance (25 fields)

Ex

amp

le ke

ywo

rds

extrac

ted

1. Expert Finding:When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)

2. Social Network Analysis:Understand the social structure of your organization.Suggest structural changes for improved efficiency.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 56: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Main Application Project:

Page 57: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Main Application Project:

ResearchPaper

Cites

Page 58: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Main Application Project:

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

Page 59: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

Status:• Spider running. Over 1.5M PDFs in hand.• Best-in-world published results in IE from

research paper headers and references.• First version of multi-entity co-reference running.• First version of Web servlet interface up.• Well-engineered: Java, servlets, SQL, Lucene,

SOAP, etc.

• Public launch this Fall.

Main Application Project:

Page 60: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• ~80k lines of Java• Document classification, information extraction, clustering, co-

reference, POS tagging, shallow parsing, relational classification, …

• New package: Graphical models and modern inference methods.– Variational, Tree-reparameterization, Stochastic sampling, contrastive

divergence,…• New documentation and interfaces.

• Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP.

MALLET:Machine Learning for Language Toolkit

Released as Open Source Software.http://mallet.cs.umass.edu

Software Infrastructure

In use at UMass, MIT, CMU, Stanford, Berkeley, UPenn, UT Austin, Purdue…

Page 61: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

• Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum and Ben Wellner. Neural Information Processing Systems (NIPS), 2004.

• An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004.

• Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004.

• Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004.

• Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004.

• Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. AAAI 2004. (Winner of Honorable Mention Award.)

• Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. HLT-NAACL, 2004.

• Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng, Fangfang Feng, and Andrew McCallum. International Conference on Computational Linguistics (COLING 2004), 2004.

• Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. (HLT-NAACL), 2004,

Publications and Contact Info

http://www.cs.umass.edu/~mccallum

Page 62: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis

End of Talk