statistical relational learning for knowledge extraction from the web

1

Statistical Relational Learning for Knowledge Extraction

from the Web

Hoifung PoonDept. of Computer Science & Eng.

University of Washington

1

22

“Drowning in Information, Starved for Knowledge”

2

WWW

2

3

Great Vision:Knowledge Extraction from Web

Also need: Knowledge representation and reasoning Close the loop: Apply knowledge to extraction

Machine reading [Etzioni et al., 2007]

Craven et al., “Learning to Construct Knowledge Bases from the World Wide Web," Artificial Intelligence, 1999.

3

44

Machine Reading: Text Knowledge

4

……

4

5

Rapidly Growing Interest

AAAI-07 Spring Symposium on Machine Reading DARPA Machine Reading Program (2009-2014) NAACL-10 Workshop on Learning By Reading Etc.

5

6

Great Impact

Scientific inquiry and commercial applications Literature-based discovery, robot scientists Question answering, semantic search Drug design, medical diagnosis Breach knowledge acquisition bottleneck for

AI and natural language understanding Automatically semantify the Web Etc.

6

7

This Talk

Statistical relational learning offers promising solutions to machine reading

Markov logic is a leading unifying framework A success story: USP

Unsupervised, end-to-end machine reading Extracts five times as many correct answers as

state of the art, with highest accuracy of 91%

7

88

USP: Question-Answer Example

Q: What does IL-2 control?

A: The DEX-mediated IkappaBalpha induction

Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells.

8

999

Overview

Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions

9

10

Key Challenges

Complexity Uncertainty Pipeline accumulates errors Supervision is scarce

10

111111

Languages Are Structural

IL-4 induces CD11B

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41......

George Walker Bush was the 43rd President of the United States.…… Bush was the eldest son of President G. H. W. Bush and Babara Bush. …….In November 1977, he met Laura Welch at a barbecue.11

governments

lm$pxtm(Hebrew: according to their families)

121212

Languages Are Structural

govern-ment-s

l-m$px-t-m(Hebrew: according to their families)

S

V NP

NP VP

IL-4 induces CD11B

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41......

involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

George Walker Bush was the 43rd President of the United States.…… Bush was the eldest son of President G. H. W. Bush and Babara Bush. …….In November 1977, he met Laura Welch at a barbecue.12

1313

Knowledge Is Heterogeneous

IndividualsE.g.: Socrates is a man

TypesE.g.: Man is mortal

Inference rulesE.g.: Syllogism

Ontological relations

Etc.13

MAMMAL

HUMAN

ISA

FACE

EYE

ISPART

141414

Complexity

Can handle using first-order logic Trees, graphs, dependencies, hierarchies, etc.

easily expressed Inference algorithms (satisfiability testing,

theorem proving, etc.) But … logic is brittle with uncertainty

151515

G. W. Bush ………… Laura Bush ……Mrs. Bush ……

Languages Are Ambiguous

I saw the man with the telescope


NP

NP ADVP


Here in London, Frances Deek is a retired teacher …In the Israeli town …, Karen London says …Now London says …

London PERSON or LOCATION?

Microsoft buys Powerset

Microsoft acquires Powerset

Powerset is acquired by Microsoft Corporation

The Redmond software giant buys Powerset

Microsoft’s purchase of Powerset, …

……

Which one?

15

161616

Knowledge Has Uncertainty

We need to model correlations Our information is always incomplete Our predictions are uncertain

17

Uncertainty

Statistics provides the tools to handle this Mixture models Hidden Markov models Bayesian networks Markov random fields Maximum entropy models Conditional random fields Etc.

But … statistical models assume i.i.d. data(independently and identically distributed) objects feature vectors

18

Pipeline is Suboptimal

E.g., NLP pipeline:

Tokenization Morphology Chunking Syntax …

Accumulates and propagates errors Wanted: Joint inference

Across all processing stages Among all interdependent objects

18

191919

Supervision is Scarce

Tons of text … but most is not annotated Labeling is expensive (Cf. Penn-Treebank)

Need to leverage indirect supervision

19

20

Redundancy

Key source of indirect supervision State-of-the-art systems depend on this

E.g., TextRunner [Banko et al., 2007]

But … Web is heterogeneous: Long tail Redundancy only present in head regime

212121

Overview


21

2222

Statistical Relational Learning

Burgeoning field in machine learning Offers promising solutions for machine reading Unify statistical and logical approaches Replace pipeline with joint inference Principled framework to leverage both

direct and indirect supervision

22

2323

Machine Reading: A Vision

Challenge: Long tail

2424

Machine Reading: A Vision

252525

Challenges in Applying Statistical Relational Learning

Learning is much harder Inference becomes a crucial issue Greater complexity for user

262626

Progress to Date

Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction

[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009]

Etc.

272727

Progress to Date

Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction

[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009]

Etc.

Leading unifying framework

282828

Overview

Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions

28

29

Markov Networks Undirected graphical models

Log-linear model:

Weight of Feature i Feature i

otherwise0

CancerSmokingif1)CancerSmoking,(1f

1 1.5w

Cancer

CoughAsthma

Smoking

iii xfw

ZxP )(exp

1)(

29

30

First-Order Logic

Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x,y)

Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob)

World (model, interpretation):Assignment of truth values to all ground predicates

30

31

Markov Logic

Intuition: Soften logical constraints Syntax: Weighted first-order formulas Semantics: Feature templates for Markov

networks A Markov Logic Network (MLN) is a set of

pairs (Fi, wi) where Fi is a formula in first-order logic

wi is a real number1

( ) exp ( )i ii

P x w n xZ

Number of true groundings

of Fi

31

32

Example: Friends & Smokers

habits. smoking similar have Friends

cancer. causes Smoking

32

33


)()(),(,

)()(

ySmokesxSmokesyxFriendsyx

xCancerxSmokesx

33

34


)()(),(,

)()(


xCancerxSmokesx

1.1

5.1

34

35


)()(),(,

)()(


xCancerxSmokesx

1.1

5.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)Probabilistic graphical models andfirst-order logic are special cases

35

36

MLN Algorithms:The First Three Generations

Problem First generation

Second generation

Third generation

MAP inference

Weighted satisfiability

Lazy inference

Cutting planes

Marginal inference

Gibbs sampling

MC-SAT Lifted inference

Weight learning

Pseudo-likelihood

Voted perceptron

Scaled conj. gradient

Structure learning

Inductive logic progr.

ILP + PL (etc.)

Clustering + pathfinding

36

37

Efficient Inference Logical or statistical inference already hard But … can do approximate inference

Suffice to perform well in most cases Combine ideas from both camps E.g., MC-SAT MCMC SAT solver

Can also leverage sparsity in relational domains

More: Poon & Domingos, “Sound and Efficient Inference with Probabilistic and Deterministic Dependencies”, in Proc. AAAI-2006.

37

More: Poon, Domingos & Sumner, “A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC”, in Proc. AAAI-2008.

38

Weight Learning

Probability model P(X) X: Observable in training data Maximize likelihood of observed data Regularization to prevent overfitting

393939

Weight Learning

No. of times clause i is true in data

Expected no. times clause i is true according to MLN

39

log ( ) ( ) ( )i x ii

P x n x E n xw

Gradient descent

Use MC-SAT for inference Can also leverage second-order information

[Lowd & Domingos, 2007]

Requires inference

404040

Unsupervised Learning: How?

I.I.D. learning: Sophisticated model requires more labeled data

Statistical relational learning: Sophisticated model may require less labeled data Ambiguities vary among objects Joint inference Propagate information from

unambiguous objects to ambiguous ones One formula is worth a thousand labels

Small amount of domain knowledge large-scale joint inference

40

41

Unsupervised Weight Learning

Probability model P(X,Z) X: Observed in training data Z: Hidden variables E.g., clustering with mixture models

Z: Cluster assignment X: Observed features

Maximize likelihood of observed data by summing out hidden variables Z

( , ) ( ) ( | )P X Z P Z P X Z

42

4242

| ,log ( ) ( , ) ( , )z x i x z ii

P x E n x z E n x zw

Unsupervised Weight Learning

Sum over z, conditioned on observed x

Summed over both x and z

More: Poon, Cherry, & Toutanova, “Unsupervised Morphological Segmentation with Log-Linear Models”, in Proc. NAACL-2009.

Best Paper Award42

Gradient descent

Use MC-SAT to compute both expectations May also combine with contrastive estimation

434343

Markov Logic

Unified inference and learning algorithms Can handle millions of variables, billions of features,

ten of thousands of parameters Easy-to-use software: Alchemy Many successful applications

E.g.: Information extraction, coreference resolution, semantic parsing, ontology induction

43

4444

Pipeline Joint Inference

Combine segmentation and entity resolution for information extraction

Extract complex and nested bio-events from PubMed abstracts

More: Poon & Domingos, “Joint Inference for Information Extraction”, in Proc. AAAI-2007.

More: Poon & Vanderwende, “Joint Inference for Knowledge Extraction from Biomedical Literature”, in Proc. NAACL-2010.

44

4545

Unsupervised Learning: Example

Coreference resolution: Accuracy comparable to previous supervised state of the art

More: Poon & Domingos, “Joint Unsupervised Coreference Resolution with Markov Logic”, in Proc. EMNLP-2008.

45

464646

Overview


46

4747

Unsupervised Semantic Parsing

USP [Poon & Domingos, EMNLP-09] First unsupervised approach for semantic parsing End-to-end machine reading system Read text, answer questions

OntoUSP USP Ontology Induction [Poon & Domingos, ACL-10]

Encoded in a few Markov logic formulas

Best Paper Award

47

484848

Semantic Parsing

Microsoft buys Powerset BUY(MICROSOFT,POWERSET)Goal

Microsoft buys PowersetMicrosoft acquires semantic search engine PowersetPowerset is acquired by Microsoft CorporationThe Redmond software giant buys PowersetMicrosoft’s purchase of Powerset, …

Challenge

48

49

Limitations of Existing Approaches

Manual grammar or supervised learning Applicable to restricted domains only For general text

Not clear what predicates and objects to use Hard to produce consistent meaning annotation

Also, often learn both syntax and semantics Fail to leverage advanced syntactic parsers Make semantic parsing harder

5050

USP: Key Idea # 1

Target predicates and objects can be learned Viewed as clusters of syntactic or lexical variations

of the same meaning

BUY(-,-)

buys, acquires, ’s purchase of, … Cluster of various expressions for acquisition

MICROSOFT

Microsoft, the Redmond software giant, … Cluster of various mentions of Microsoft

5151

USP: Key Idea # 2

Relational clustering Cluster relations with same objects

USP Recursively cluster arbitrary expressions with similar subexpressions


Microsoft acquires semantic search engine Powerset




5252

USP: Key Idea # 2








Cluster same forms at the atom level

5353

USP: Key Idea # 2








Cluster forms in composition with same forms

5454

USP: Key Idea # 2









5555

USP: Key Idea # 2









5656

USP: Key Idea # 3

Start directly from syntactic analyses Focus on translating them to semantics Leverage rapid progress in syntactic parsing Much easier than learning both

57

Joint Inference in USP

Forms canonical meaning representation by recursively clustering synonymous expressions

Text Logical form in this representation Induces ISA hierarchy among clusters and

applies hierarchical smoothing (shrinkage)

57

58

USP: System Overview

Input: Dependency trees for sentences Converts dependency trees into quasi-logical

forms (QLFs) Starts with QLF clusters at atom level Recursively builds up clusters of larger forms Output:

Probability distribution over QLF clusters and their composition

MAP semantic parses of sentences58

59

Generating Quasi-Logical Forms

buys

Microsoft Powerset

nsubj dobj

Convert each node into an unary atom

59

60


nsubj dobj

n1, n2, n3 are Skolem constants

buys(n1)

Microsoft(n2) Powerset(n3)

60

61


nsubj dobj

Convert each edge into a binary atom

buys(n1)


61

62


Convert each edge into a binary atom

buys(n1)


nsubj(n1,n2) dobj(n1,n3)

62

63

A Semantic Parse

buys(n1)



Partition QLF into subformulas

63

64

A Semantic Parse

buys(n1)



Subformula Lambda form: Replace Skolem constant not in unary atom

with a unique lambda variable 64

65

A Semantic Parse

buys(n1)


λx2.nsubj(n1,x2

)

Subformula Lambda form: Replace Skolem constant not in unary atom

with a unique lambda variable

λx3.dobj(n1,x3

)

65

66

A Semantic Parse

buys(n1)


λx2.nsubj(n1,x2

)

Core form: No lambda variableArgument form: One lambda variable

λx3.dobj(n1,x3

)

Core form

Argument form Argument form

66

67

A Semantic Parse

buys(n1)

Microsoft(n2

)

Powerset(n3)

λx2.nsubj(n1,x2)

Assign subformula to object cluster

λx3.dobj(n1,x3) BUY

MICROSOFT

POWERSET

67

68

Object Cluster: BUY

buys(n1

)

Distribution over core forms

0.1

acquires(n1) 0.2

……

One formula in MLN

Learn weights for each pair ofcluster and core form

68

69

Object Cluster: BUY

buys(n1

)

May contain variable number of property clusters

0.1

acquires(n1) 0.2

……

BUYER

BOUGHT

PRICE

……

69

70

Property Cluster: BUYER

λx2.nsubj(n1,x2)

Distributions over argument forms, clusters, and number

0.5

0.4

……

MICROSOFT 0.2

GOOGLE 0.1

……

Zero 0.1

One 0.8

……

λx2.agent(n1,x2)

70

Three MLN formulas

7171

Probabilistic Model

71

Exponential prior on number of parameters Cluster mixtures:

Object Cluster: BUY

buys 0.1

acquires 0.4

…

……


0.5

0.4

…

MICROSOFT 0.2

GOOGLE 0.1

…

Zero 0.1

One 0.8

…

nsubj

agent

71

7272

Probabilistic Model

72

Exponential prior on number of parameters Cluster mixtures with hierarchical smoothing:

Object Cluster: BUY

buys 0.1

acquires 0.4

…

……


0.5

0.4

…

MICROSOFT 0.2

GOOGLE 0.1

…

Zero 0.1

One 0.8

…

nsubj

agent

E.g., picking MICROSOFT as BUYER argument depends not only on BUY, but also on its ISA ancestors

72

73

Abstract Lambda Form

buys(n1) λx2.nsubj(n1,x2) λx3.dobj(n1,x3)

BUYS(n1) λx2.BUYER(n1,x2) λx3.BOUGHT(n1,x3)

Final logical form is obtained via lambda reduction

73

747474

Challenge: State Space Too Large

Potential cluster number exp(token-number) Also, meaning units and clusters often small

Use combinatorial search

74

757575

Inference: Find MAP Parse

Initialize

Search Operator

Lambda reduction

induces

protein CD11B

nsubj dobj

IL-4

nn

protein

IL-4

nn

protein

IL-4

nn

75

767676

Learning: Greedily Maximize Posterior

enhances 1.0induces 1.0

MERGE COMPOSE

amino acid 1.0induces 0.2enhances 0.8

……Initialize

Search Operators enhances 1.0induces 1.0 acid 1.0amino 1.0

acid 1.0amino 1.0

76

777777

Operator: Abstract

induces 0.30.1

…

enhances

ISA ISA

inhibits 0.2suppresses 0.1

induces 0.6

up-regulates 0.2

…

INDUCE

INHIBIT

inhibits 0.4

0.2

…

suppresses

INHIBIT

inhibits 0.4

0.2

…

suppressesinduces 0.6

up-regulates 0.2

…

INDUCE

MERGE with

REGULATE?

Captures substantial similarities 77

787878

Experiments

Apply to machine reading:

Extract knowledge from text and answer questions Evaluation: Number of answers and accuracy GENIA dataset: 1999 Pubmed abstracts Use simple factoid questions, e.g.:

What does anti-STAT1 inhibit? What regulates MIP-1 alpha?

78

7979

Total and Correct Answers

0

100

200

300

400

500

KW-SYN TextRunner RESOLVER DIRT USP

USP extracted five times as many correct answers as TextRunner

Highest precision of 91%

79

8080

Qualitative Analysis

Resolve many nontrivial variations Argument forms that mean the same, e.g.,

expression of X X expression

X stimulates Y Y is stimulated with X Active vs. passive voices Synonymous expressions Etc.

80

8181

Clusters And Compositions

Clusters in core forms investigate, examine, evaluate, analyze, study, assay diminish, reduce, decrease, attenuate synthesis, production, secretion, release dramatically, substantially, significantly ……

Compositionsamino acid, t cell, immune response, transcription factor,

initiation site, binding site …81

8282

Question-Answer Example

Q: What does IL-2 control?

A: The DEX-mediated IkappaBalpha induction

Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells.

82

838383

Overview

Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions

83

8484

Web-Scale Joint Inference

Challenge: Efficiently identify the relevant Key: Induce and leverage an ontology

Ontology Capture essential properties & Abstract away unimportant variations

Upper-level nodes Skip irrelevant branches Wanted: Combine the following

Probabilistic ontology induction (e.g., USP) Coarse-to-fine learning and inference

[Felzenszwalb & McAllester, 2007; Petrov, Ph.D. Thesis]

84

8585

Knowledge Reasoning

Most facts/rules are not explicitly stated “Dark matter” in the natural language universe

kale contains calcium calcium prevent osteoporosis

kale prevents osteoporosis Keys:

Induce generic reasoning patterns Incorporate reasoning in extraction

Additional sources of indirect supervision

85

8686

Harness Social Computing Bootstrap online community

Knowledge Base

86

8787

Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop“Tell me everything about dicer applied

to synapse …”

87

Knowledge Base

8888

Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop

“Your extraction from my paper is correct except for blah …”

88

Knowledge Base

8989

Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop Form positive feedback loop

89

Knowledge Base

9090

Acknowledgments

Pedro Domingos, Colin Cherry, Kristina Toutanova, Lucy Vanderwende, Oren Etzioni, Dan Weld, Matt Richardson, Parag Singla, Stanley Kok, Daniel Lowd, Marc Sumner

ARO, AFRL, ONR, DARPA, NSF

90

9191

Summary

Statistical relational learning offers promising solutions for machine reading

Markov logic provides a language for this Syntax: Weighted first-order logical formulas Semantics: Feature templates of Markov nets

Open-source software: Alchemy

A success story: USP

Three key research directions

alchemy.cs.washington.edu

alchemy.cs.washington.edu/papers/poon09

91

statistical relational learning for knowledge extraction from the web

Documents

bush laura bush

bush languages

knowledge extraction

knowledge representation

text knowledge

knowledge bases

george walker bush

machine readingmarkov