statistical relational learning for knowledge extraction from the web
DESCRIPTION
Statistical Relational Learning for Knowledge Extraction from the Web. Hoifung Poon Dept. of Computer Science & Eng. University of Washington. 1. “Drowning in Information, Starved for Knowledge”. WWW. 2. 2. 2. Great Vision: Knowledge Extraction from Web. - PowerPoint PPT PresentationTRANSCRIPT
1
Statistical Relational Learning for Knowledge Extraction
from the Web
Hoifung PoonDept. of Computer Science & Eng.
University of Washington
1
22
“Drowning in Information, Starved for Knowledge”
2
WWW
2
3
Great Vision:Knowledge Extraction from Web
Also need: Knowledge representation and reasoning Close the loop: Apply knowledge to extraction
Machine reading [Etzioni et al., 2007]
Craven et al., “Learning to Construct Knowledge Bases from the World Wide Web," Artificial Intelligence, 1999.
3
44
Machine Reading: Text Knowledge
4
……
4
5
Rapidly Growing Interest
AAAI-07 Spring Symposium on Machine Reading DARPA Machine Reading Program (2009-2014) NAACL-10 Workshop on Learning By Reading Etc.
5
6
Great Impact
Scientific inquiry and commercial applications Literature-based discovery, robot scientists Question answering, semantic search Drug design, medical diagnosis Breach knowledge acquisition bottleneck for
AI and natural language understanding Automatically semantify the Web Etc.
6
7
This Talk
Statistical relational learning offers promising solutions to machine reading
Markov logic is a leading unifying framework A success story: USP
Unsupervised, end-to-end machine reading Extracts five times as many correct answers as
state of the art, with highest accuracy of 91%
7
88
USP: Question-Answer Example
Q: What does IL-2 control?
A: The DEX-mediated IkappaBalpha induction
Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells.
8
999
Overview
Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions
9
10
Key Challenges
Complexity Uncertainty Pipeline accumulates errors Supervision is scarce
10
111111
Languages Are Structural
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the United States.…… Bush was the eldest son of President G. H. W. Bush and Babara Bush. …….In November 1977, he met Laura Welch at a barbecue.11
governments
lm$pxtm(Hebrew: according to their families)
121212
Languages Are Structural
govern-ment-s
l-m$px-t-m(Hebrew: according to their families)
S
V NP
NP VP
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41......
involvement
up-regulation
IL-10human
monocyte
SiteTheme Cause
gp41 p70(S6)-kinase
activation
Theme Cause
Theme
George Walker Bush was the 43rd President of the United States.…… Bush was the eldest son of President G. H. W. Bush and Babara Bush. …….In November 1977, he met Laura Welch at a barbecue.12
1313
Knowledge Is Heterogeneous
IndividualsE.g.: Socrates is a man
TypesE.g.: Man is mortal
Inference rulesE.g.: Syllogism
Ontological relations
Etc.13
MAMMAL
HUMAN
ISA
FACE
EYE
ISPART
141414
Complexity
Can handle using first-order logic Trees, graphs, dependencies, hierarchies, etc.
easily expressed Inference algorithms (satisfiability testing,
theorem proving, etc.) But … logic is brittle with uncertainty
151515
G. W. Bush ………… Laura Bush ……Mrs. Bush ……
Languages Are Ambiguous
I saw the man with the telescope
I saw the man with the telescope
NP
NP ADVP
I saw the man with the telescope
Here in London, Frances Deek is a retired teacher …In the Israeli town …, Karen London says …Now London says …
London PERSON or LOCATION?
Microsoft buys Powerset
Microsoft acquires Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
……
Which one?
15
161616
Knowledge Has Uncertainty
We need to model correlations Our information is always incomplete Our predictions are uncertain
17
Uncertainty
Statistics provides the tools to handle this Mixture models Hidden Markov models Bayesian networks Markov random fields Maximum entropy models Conditional random fields Etc.
But … statistical models assume i.i.d. data(independently and identically distributed) objects feature vectors
18
Pipeline is Suboptimal
E.g., NLP pipeline:
Tokenization Morphology Chunking Syntax …
Accumulates and propagates errors Wanted: Joint inference
Across all processing stages Among all interdependent objects
18
191919
Supervision is Scarce
Tons of text … but most is not annotated Labeling is expensive (Cf. Penn-Treebank)
Need to leverage indirect supervision
19
20
Redundancy
Key source of indirect supervision State-of-the-art systems depend on this
E.g., TextRunner [Banko et al., 2007]
But … Web is heterogeneous: Long tail Redundancy only present in head regime
212121
Overview
Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions
21
2222
Statistical Relational Learning
Burgeoning field in machine learning Offers promising solutions for machine reading Unify statistical and logical approaches Replace pipeline with joint inference Principled framework to leverage both
direct and indirect supervision
22
2323
Machine Reading: A Vision
Challenge: Long tail
2424
Machine Reading: A Vision
252525
Challenges in Applying Statistical Relational Learning
Learning is much harder Inference becomes a crucial issue Greater complexity for user
262626
Progress to Date
Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction
[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009]
Etc.
272727
Progress to Date
Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction
[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009]
Etc.
Leading unifying framework
282828
Overview
Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions
28
29
Markov Networks Undirected graphical models
Log-linear model:
Weight of Feature i Feature i
otherwise0
CancerSmokingif1)CancerSmoking,(1f
1 1.5w
Cancer
CoughAsthma
Smoking
iii xfw
ZxP )(exp
1)(
29
30
First-Order Logic
Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x,y)
Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob)
World (model, interpretation):Assignment of truth values to all ground predicates
30
31
Markov Logic
Intuition: Soften logical constraints Syntax: Weighted first-order formulas Semantics: Feature templates for Markov
networks A Markov Logic Network (MLN) is a set of
pairs (Fi, wi) where Fi is a formula in first-order logic
wi is a real number1
( ) exp ( )i ii
P x w n xZ
Number of true groundings
of Fi
31
32
Example: Friends & Smokers
habits. smoking similar have Friends
cancer. causes Smoking
32
33
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
33
34
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
34
35
Example: Friends & Smokers
)()(),(,
)()(
ySmokesxSmokesyxFriendsyx
xCancerxSmokesx
1.1
5.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)Probabilistic graphical models andfirst-order logic are special cases
35
36
MLN Algorithms:The First Three Generations
Problem First generation
Second generation
Third generation
MAP inference
Weighted satisfiability
Lazy inference
Cutting planes
Marginal inference
Gibbs sampling
MC-SAT Lifted inference
Weight learning
Pseudo-likelihood
Voted perceptron
Scaled conj. gradient
Structure learning
Inductive logic progr.
ILP + PL (etc.)
Clustering + pathfinding
36
37
Efficient Inference Logical or statistical inference already hard But … can do approximate inference
Suffice to perform well in most cases Combine ideas from both camps E.g., MC-SAT MCMC SAT solver
Can also leverage sparsity in relational domains
More: Poon & Domingos, “Sound and Efficient Inference with Probabilistic and Deterministic Dependencies”, in Proc. AAAI-2006.
37
More: Poon, Domingos & Sumner, “A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC”, in Proc. AAAI-2008.
38
Weight Learning
Probability model P(X) X: Observable in training data Maximize likelihood of observed data Regularization to prevent overfitting
393939
Weight Learning
No. of times clause i is true in data
Expected no. times clause i is true according to MLN
39
log ( ) ( ) ( )i x ii
P x n x E n xw
Gradient descent
Use MC-SAT for inference Can also leverage second-order information
[Lowd & Domingos, 2007]
Requires inference
404040
Unsupervised Learning: How?
I.I.D. learning: Sophisticated model requires more labeled data
Statistical relational learning: Sophisticated model may require less labeled data Ambiguities vary among objects Joint inference Propagate information from
unambiguous objects to ambiguous ones One formula is worth a thousand labels
Small amount of domain knowledge large-scale joint inference
40
41
Unsupervised Weight Learning
Probability model P(X,Z) X: Observed in training data Z: Hidden variables E.g., clustering with mixture models
Z: Cluster assignment X: Observed features
Maximize likelihood of observed data by summing out hidden variables Z
( , ) ( ) ( | )P X Z P Z P X Z
42
4242
| ,log ( ) ( , ) ( , )z x i x z ii
P x E n x z E n x zw
Unsupervised Weight Learning
Sum over z, conditioned on observed x
Summed over both x and z
More: Poon, Cherry, & Toutanova, “Unsupervised Morphological Segmentation with Log-Linear Models”, in Proc. NAACL-2009.
Best Paper Award42
Gradient descent
Use MC-SAT to compute both expectations May also combine with contrastive estimation
434343
Markov Logic
Unified inference and learning algorithms Can handle millions of variables, billions of features,
ten of thousands of parameters Easy-to-use software: Alchemy Many successful applications
E.g.: Information extraction, coreference resolution, semantic parsing, ontology induction
43
4444
Pipeline Joint Inference
Combine segmentation and entity resolution for information extraction
Extract complex and nested bio-events from PubMed abstracts
More: Poon & Domingos, “Joint Inference for Information Extraction”, in Proc. AAAI-2007.
More: Poon & Vanderwende, “Joint Inference for Knowledge Extraction from Biomedical Literature”, in Proc. NAACL-2010.
44
4545
Unsupervised Learning: Example
Coreference resolution: Accuracy comparable to previous supervised state of the art
More: Poon & Domingos, “Joint Unsupervised Coreference Resolution with Markov Logic”, in Proc. EMNLP-2008.
45
464646
Overview
Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions
46
4747
Unsupervised Semantic Parsing
USP [Poon & Domingos, EMNLP-09] First unsupervised approach for semantic parsing End-to-end machine reading system Read text, answer questions
OntoUSP USP Ontology Induction [Poon & Domingos, ACL-10]
Encoded in a few Markov logic formulas
Best Paper Award
47
484848
Semantic Parsing
Microsoft buys Powerset BUY(MICROSOFT,POWERSET)Goal
Microsoft buys PowersetMicrosoft acquires semantic search engine PowersetPowerset is acquired by Microsoft CorporationThe Redmond software giant buys PowersetMicrosoft’s purchase of Powerset, …
Challenge
48
49
Limitations of Existing Approaches
Manual grammar or supervised learning Applicable to restricted domains only For general text
Not clear what predicates and objects to use Hard to produce consistent meaning annotation
Also, often learn both syntax and semantics Fail to leverage advanced syntactic parsers Make semantic parsing harder
5050
USP: Key Idea # 1
Target predicates and objects can be learned Viewed as clusters of syntactic or lexical variations
of the same meaning
BUY(-,-)
buys, acquires, ’s purchase of, … Cluster of various expressions for acquisition
MICROSOFT
Microsoft, the Redmond software giant, … Cluster of various mentions of Microsoft
5151
USP: Key Idea # 2
Relational clustering Cluster relations with same objects
USP Recursively cluster arbitrary expressions with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
5252
USP: Key Idea # 2
Relational clustering Cluster relations with same objects
USP Recursively cluster arbitrary expressions with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster same forms at the atom level
5353
USP: Key Idea # 2
Relational clustering Cluster relations with same objects
USP Recursively cluster arbitrary expressions with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster forms in composition with same forms
5454
USP: Key Idea # 2
Relational clustering Cluster relations with same objects
USP Recursively cluster arbitrary expressions with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster forms in composition with same forms
5555
USP: Key Idea # 2
Relational clustering Cluster relations with same objects
USP Recursively cluster arbitrary expressions with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster forms in composition with same forms
5656
USP: Key Idea # 3
Start directly from syntactic analyses Focus on translating them to semantics Leverage rapid progress in syntactic parsing Much easier than learning both
57
Joint Inference in USP
Forms canonical meaning representation by recursively clustering synonymous expressions
Text Logical form in this representation Induces ISA hierarchy among clusters and
applies hierarchical smoothing (shrinkage)
57
58
USP: System Overview
Input: Dependency trees for sentences Converts dependency trees into quasi-logical
forms (QLFs) Starts with QLF clusters at atom level Recursively builds up clusters of larger forms Output:
Probability distribution over QLF clusters and their composition
MAP semantic parses of sentences58
59
Generating Quasi-Logical Forms
buys
Microsoft Powerset
nsubj dobj
Convert each node into an unary atom
59
60
Generating Quasi-Logical Forms
nsubj dobj
n1, n2, n3 are Skolem constants
buys(n1)
Microsoft(n2) Powerset(n3)
60
61
Generating Quasi-Logical Forms
nsubj dobj
Convert each edge into a binary atom
buys(n1)
Microsoft(n2) Powerset(n3)
61
62
Generating Quasi-Logical Forms
Convert each edge into a binary atom
buys(n1)
Microsoft(n2) Powerset(n3)
nsubj(n1,n2) dobj(n1,n3)
62
63
A Semantic Parse
buys(n1)
Microsoft(n2) Powerset(n3)
nsubj(n1,n2) dobj(n1,n3)
Partition QLF into subformulas
63
64
A Semantic Parse
buys(n1)
Microsoft(n2) Powerset(n3)
nsubj(n1,n2) dobj(n1,n3)
Subformula Lambda form: Replace Skolem constant not in unary atom
with a unique lambda variable 64
65
A Semantic Parse
buys(n1)
Microsoft(n2) Powerset(n3)
λx2.nsubj(n1,x2
)
Subformula Lambda form: Replace Skolem constant not in unary atom
with a unique lambda variable
λx3.dobj(n1,x3
)
65
66
A Semantic Parse
buys(n1)
Microsoft(n2) Powerset(n3)
λx2.nsubj(n1,x2
)
Core form: No lambda variableArgument form: One lambda variable
λx3.dobj(n1,x3
)
Core form
Argument form Argument form
66
67
A Semantic Parse
buys(n1)
Microsoft(n2
)
Powerset(n3)
λx2.nsubj(n1,x2)
Assign subformula to object cluster
λx3.dobj(n1,x3) BUY
MICROSOFT
POWERSET
67
68
Object Cluster: BUY
buys(n1
)
Distribution over core forms
0.1
acquires(n1) 0.2
……
One formula in MLN
Learn weights for each pair ofcluster and core form
68
69
Object Cluster: BUY
buys(n1
)
May contain variable number of property clusters
0.1
acquires(n1) 0.2
……
BUYER
BOUGHT
PRICE
……
69
70
Property Cluster: BUYER
λx2.nsubj(n1,x2)
Distributions over argument forms, clusters, and number
0.5
0.4
……
MICROSOFT 0.2
GOOGLE 0.1
……
Zero 0.1
One 0.8
……
λx2.agent(n1,x2)
70
Three MLN formulas
7171
Probabilistic Model
71
Exponential prior on number of parameters Cluster mixtures:
Object Cluster: BUY
buys 0.1
acquires 0.4
…
……
Property Cluster: BUYER
0.5
0.4
…
MICROSOFT 0.2
GOOGLE 0.1
…
Zero 0.1
One 0.8
…
nsubj
agent
71
7272
Probabilistic Model
72
Exponential prior on number of parameters Cluster mixtures with hierarchical smoothing:
Object Cluster: BUY
buys 0.1
acquires 0.4
…
……
Property Cluster: BUYER
0.5
0.4
…
MICROSOFT 0.2
GOOGLE 0.1
…
Zero 0.1
One 0.8
…
nsubj
agent
E.g., picking MICROSOFT as BUYER argument depends not only on BUY, but also on its ISA ancestors
72
73
Abstract Lambda Form
buys(n1) λx2.nsubj(n1,x2) λx3.dobj(n1,x3)
BUYS(n1) λx2.BUYER(n1,x2) λx3.BOUGHT(n1,x3)
Final logical form is obtained via lambda reduction
73
747474
Challenge: State Space Too Large
Potential cluster number exp(token-number) Also, meaning units and clusters often small
Use combinatorial search
74
757575
Inference: Find MAP Parse
Initialize
Search Operator
Lambda reduction
induces
protein CD11B
nsubj dobj
IL-4
nn
protein
IL-4
nn
protein
IL-4
nn
75
767676
Learning: Greedily Maximize Posterior
enhances 1.0induces 1.0
MERGE COMPOSE
amino acid 1.0induces 0.2enhances 0.8
……Initialize
Search Operators enhances 1.0induces 1.0 acid 1.0amino 1.0
acid 1.0amino 1.0
76
777777
Operator: Abstract
induces 0.30.1
…
enhances
ISA ISA
inhibits 0.2suppresses 0.1
induces 0.6
up-regulates 0.2
…
INDUCE
INHIBIT
inhibits 0.4
0.2
…
suppresses
INHIBIT
inhibits 0.4
0.2
…
suppressesinduces 0.6
up-regulates 0.2
…
INDUCE
MERGE with
REGULATE?
Captures substantial similarities 77
787878
Experiments
Apply to machine reading:
Extract knowledge from text and answer questions Evaluation: Number of answers and accuracy GENIA dataset: 1999 Pubmed abstracts Use simple factoid questions, e.g.:
What does anti-STAT1 inhibit? What regulates MIP-1 alpha?
78
7979
Total and Correct Answers
0
100
200
300
400
500
KW-SYN TextRunner RESOLVER DIRT USP
USP extracted five times as many correct answers as TextRunner
Highest precision of 91%
79
8080
Qualitative Analysis
Resolve many nontrivial variations Argument forms that mean the same, e.g.,
expression of X X expression
X stimulates Y Y is stimulated with X Active vs. passive voices Synonymous expressions Etc.
80
8181
Clusters And Compositions
Clusters in core forms investigate, examine, evaluate, analyze, study, assay diminish, reduce, decrease, attenuate synthesis, production, secretion, release dramatically, substantially, significantly ……
Compositionsamino acid, t cell, immune response, transcription factor,
initiation site, binding site …81
8282
Question-Answer Example
Q: What does IL-2 control?
A: The DEX-mediated IkappaBalpha induction
Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells.
82
838383
Overview
Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions
83
8484
Web-Scale Joint Inference
Challenge: Efficiently identify the relevant Key: Induce and leverage an ontology
Ontology Capture essential properties & Abstract away unimportant variations
Upper-level nodes Skip irrelevant branches Wanted: Combine the following
Probabilistic ontology induction (e.g., USP) Coarse-to-fine learning and inference
[Felzenszwalb & McAllester, 2007; Petrov, Ph.D. Thesis]
84
8585
Knowledge Reasoning
Most facts/rules are not explicitly stated “Dark matter” in the natural language universe
kale contains calcium calcium prevent osteoporosis
kale prevents osteoporosis Keys:
Induce generic reasoning patterns Incorporate reasoning in extraction
Additional sources of indirect supervision
85
8686
Harness Social Computing Bootstrap online community
Knowledge Base
86
8787
Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop“Tell me everything about dicer applied
to synapse …”
87
Knowledge Base
8888
Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop
“Your extraction from my paper is correct except for blah …”
88
Knowledge Base
8989
Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop Form positive feedback loop
89
Knowledge Base
9090
Acknowledgments
Pedro Domingos, Colin Cherry, Kristina Toutanova, Lucy Vanderwende, Oren Etzioni, Dan Weld, Matt Richardson, Parag Singla, Stanley Kok, Daniel Lowd, Marc Sumner
ARO, AFRL, ONR, DARPA, NSF
90
9191
Summary
Statistical relational learning offers promising solutions for machine reading
Markov logic provides a language for this Syntax: Weighted first-order logical formulas Semantics: Feature templates of Markov nets
Open-source software: Alchemy
A success story: USP
Three key research directions
alchemy.cs.washington.edu
alchemy.cs.washington.edu/papers/poon09
91