part ii: graphical models. challenges of probabilistic models specifying well-defined probabilistic...
Post on 18-Dec-2015
217 views
TRANSCRIPT
Part II: Graphical models
Challenges of probabilistic models
• Specifying well-defined probabilistic models with many variables is hard (for modelers)
• Representing probability distributions over those variables is hard (for computers/learners)
• Computing quantities using those distributions is hard (for computers/learners)
Representing structured distributions
Four random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
Domain
{0,1}{0,1}{0,1}{0,1}
Joint distribution0000000100100011010001010110011110001001101010111100110111101111
• Requires 15 numbers to specify probability of all values x1,x2,x3,x4
– N binary variables, 2N-1 numbers
• Similar cost when computing conditional probabilities
€
P(x2, x3, x4 | x1 =1) =P(x1 =1,x2,x3,x4 )
P(x1 =1)
How can we use fewer numbers?
Four random variables:
X1 coin toss produces heads
X2 coin toss produces heads
X3 coin toss produces heads
X4 coin toss produces heads
Domain
{0,1}{0,1}{0,1}{0,1}
Statistical independence• Two random variables X1 and X2 are independent if
P(x1|x2) = P(x1)
– e.g. coinflips: P(x1=H|x2=H) = P(x1=H) = 0.5
• Independence makes it easier to represent and work with probability distributions
• We can exploit the product rule:
€
P(x1,x2, x3, x4 ) = P(x1 | x2,x3,x4 )P(x2 | x3, x4 )P(x3 | x4 )P(x4 )
€
P(x1, x2, x3, x4 ) = P(x1)P(x2)P(x3)P(x4 )
If x1, x2, x3, and x4 are all independent…
Expressing independence
• Statistical independence is the key to efficient probabilistic representation and computation
• This has led to the development of languages for indicating dependencies among variables
• Some of the most popular languages are based on “graphical models”
Part II: Graphical models
• Introduction to graphical models– representation and inference
• Causal graphical models– causality– learning about causal relationships
• Graphical models and cognitive science– uses of graphical models– an example: causal induction
Part II: Graphical models
• Introduction to graphical models– representation and inference
• Causal graphical models– causality– learning about causal relationships
• Graphical models and cognitive science– uses of graphical models– an example: causal induction
Graphical models
• Express the probabilistic dependency structure among a set of variables (Pearl, 1988)
• Consist of– a set of nodes, corresponding to variables– a set of edges, indicating dependency– a set of functions defined on the graph that
specify a probability distribution
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Undirected graphical models
• Consist of– a set of nodes– a set of edges– a potential for each clique, multiplied together to yield the distribution over variables
• Examples– statistical physics: Ising model, spinglasses– early neural networks (e.g. Boltzmann machines)
X1
X2
X3 X4
X5
Directed graphical modelsX3 X4
X5
X1
X2
• Consist of– a set of nodes– a set of edges– a conditional probability distribution for each node, conditioned on its parents, multiplied together
to yield the distribution over variables
• Constrained to directed acyclic graphs (DAGs)• Called Bayesian networks or Bayes nets
Bayesian networks and Bayes
• Two different problems– Bayesian statistics is a method of inference– Bayesian networks are a form of representation
• There is no necessary connection– many users of Bayesian networks rely upon
frequentist statistical methods– many Bayesian inferences cannot be easily
represented using Bayesian networks
Properties of Bayesian networks
• Efficient representation and inference– exploiting dependency structure makes it easier
to represent and compute with probabilities
• Explaining away– pattern of probabilistic reasoning characteristic
of Bayesian networks, especially early use in AI
Properties of Bayesian networks
• Efficient representation and inference– exploiting dependency structure makes it easier
to represent and compute with probabilities
• Explaining away– pattern of probabilistic reasoning characteristic
of Bayesian networks, especially early use in AI
Efficient representation and inferenceFour random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
The Markov assumption
Every node is conditionally independent of its non-descendants, given its parents
€
P(xi | xi+1,...,xk ) = P(xi | Pa(X i ))
where Pa(Xi) is the set of parents of Xi
€
P(x1,..., xk ) = P(x i | Pa(X i))i=1
k
∏
(via the product rule)
Efficient representation and inferenceFour random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
1 1
24
total = 7 (vs 15)
Reading a Bayesian network
• The structure of a Bayes net can be read as the generative process behind a distribution
• Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents
Reading a Bayesian networkFour random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
Reading a Bayesian network
• The structure of a Bayes net can be read as the generative process behind a distribution
• Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents
• Simple rules for determining whether two variables are dependent or independent
• Independence makes inference more efficient
Computing with Bayes nets
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
€
P(x2, x3,x4 | x1 =1) =P(x1 =1, x2,x3, x4 )
P(x1 =1)
Computing with Bayes nets
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
€
P(x1 =1) = P(x1 =1, x2,x3,x4 )x2 ,x3 ,x4
∑sum over 8 values
Computing with Bayes nets
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
€
P(x1 =1) = P(x1 | x3, x4 )P(x2 | x3)P(x3)P(x4 )x2 ,x3 ,x4
∑
Computing with Bayes nets
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
€
P(x1 =1) = P(x1 | x3, x4 )P(x3)P(x4 )x3 ,x4
∑sum over 4 values
Computing with Bayes nets
• Inference algorithms for Bayesian networks exploit dependency structure
• Message-passing algorithms– “belief propagation” passes simple messages between
nodes, exact for tree-structured networks
• More general inference algorithms– exact: “junction-tree”– approximate: Monte Carlo schemes (see Part IV)
Properties of Bayesian networks
• Efficient representation and inference– exploiting dependency structure makes it easier
to represent and compute with probabilities
• Explaining away– pattern of probabilistic reasoning characteristic
of Bayesian networks, especially early use in AI
Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:
Explaining away
Rain Sprinkler
Grass Wet
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
)(
)()|()|(
wP
rPrwPwrP =
Compute probability it rained last night, given that the grass is wet:
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
∑′′
′′′′=
sr
srPsrwP
rPrwPwrP
,
),(),|(
)()|()|(
Compute probability it rained last night, given that the grass is wet:
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
),(),(),(
)()|(
srPsrPsrP
rPwrP
¬+¬+=
Compute probability it rained last night, given that the grass is wet:
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
Compute probability it rained last night, given that the grass is wet:
),()(
)()|(
srPrP
rPwrP
¬+=
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
)()()(
)()|(
sPrPrP
rPwrP
¬+=
Compute probability it rained last night, given that the grass is wet:
Between 1 and P(s)
)(rP>
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
Compute probability it rained last night, given that the grass is wet and sprinklers were left on:
)|(
)|(),|(),|(
swP
srPsrwPswrP =
Both terms = 1
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
Compute probability it rained last night, given that the grass is wet and sprinklers were left on:
)(rP=)|(),|( srPswrP =
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
Explaining away
Rain Sprinkler
Grass Wet
)(rP=)|(),|( srPswrP =)()()(
)()|(
sPrPrP
rPwrP
¬+= )(rP>
“Discounting” to prior probability.
.andif0 sSrR ¬=¬==
),|()()(),,( RSWPSPRPWSRP =
rRsSRSwWP ==== orif1),|(
• Formulate IF-THEN rules:– IF Rain THEN Wet– IF Wet THEN Rain
• Rules do not distinguish directions of inference• Requires combinatorial explosion of rules
Contrast w/ production system
Rain
Grass Wet
Sprinkler
IF Wet AND NOT Sprinkler THEN Rain
• Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active• Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!
• Excitatory links: Rain Wet, Sprinkler Wet
Contrast w/ spreading activation
Rain Sprinkler
Grass Wet
• Observing grass wet, Rain and Sprinkler become more active• Observing grass wet and sprinkler, Rain becomes less active: explaining away
• Excitatory links: Rain Wet, Sprinkler Wet• Inhibitory link: Rain Sprinkler
Contrast w/ spreading activation
Rain Sprinkler
Grass Wet
• Each new variable requires more inhibitory connections• Not modular
– whether a connection exists depends on what others exist– big holism problem – combinatorial explosion
Contrast w/ spreading activationRain
Sprinkler
Grass Wet
Burst pipe
Contrast w/ spreading activation
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
(McClelland & Rumelhart, 1981)
Graphical models
• Capture dependency structure in distributions
• Provide an efficient means of representing and reasoning with probabilities
• Allow kinds of inference that are problematic for other representations: explaining away– hard to capture in a production system– more natural than with spreading activation
Part II: Graphical models
• Introduction to graphical models– representation and inference
• Causal graphical models– causality– learning about causal relationships
• Graphical models and cognitive science– uses of graphical models– an example: causal induction
Causal graphical models
• Graphical models represent statistical dependencies among variables (ie. correlations)– can answer questions about observations
• Causal graphical models represent causal dependencies among variables (Pearl, 2000)
– express underlying causal structure– can answer questions about both observations and
interventions (actions upon a variable)
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Bayesian networks
Four random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
Nodes: variablesLinks: dependency
Each node has a conditional probability distribution
Data: observations of x1, ..., x4
Causal Bayesian networks
Four random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
Nodes: variablesLinks: causality
Each node has a conditional probability distribution
Data: observations of and interventions on x1, ..., x4
Interventions
Four random variables:
X1 coin toss produces heads
X2 pencil levitates
X3 friend has psychic powers
X4 friend has two-headed coin
X1 X2
X3X4P(x4) P(x3)
P(x2|x3)P(x1|x3, x4)
hold down pencil
X
Cut all incoming links for the node that we intervene on
Compute probabilities with “mutilated” Bayes net
• Strength: how strong is a relationship?• Structure: does a relationship exist?
E
B C
E
B CB B
Learning causal graphical models
• Strength: how strong is a relationship?
E
B C
E
B CB B
Causal structure vs. causal strength
• Strength: how strong is a relationship?– requires defining nature of relationship
E
B C
w0 w1
E
B C
w0
B B
Causal structure vs. causal strength
Parameterization
• Structures: h1 = h0 =
• Parameterization:
E
B C
E
B C
C B
0 01 00 11 1
h1: P(E = 1 | C, B) h0: P(E = 1| C, B)
p00
p10
p01
p11
p0
p0
p1
p1
Generic
Parameterization
• Structures: h1 = h0 =
• Parameterization:
E
B C
E
B C
w0 w1w0
w0, w1: strength parameters for B, C
C B
0 01 00 11 1
h1: P(E = 1 | C, B) h0: P(E = 1| C, B)
0w1
w0
w1+ w0
00w0
w0
Linear
Parameterization
• Structures: h1 = h0 =
• Parameterization:
E
B C
E
B C
w0 w1w0
w0, w1: strength parameters for B, C
C B
0 01 00 11 1
h1: P(E = 1 | C, B) h0: P(E = 1| C, B)
0w1
w0
w1+ w0 – w1 w0
00w0
w0
“Noisy-OR”
Parameter estimation
• Maximum likelihood estimation:
maximize i P(bi,ci,ei; w0, w1)
• Bayesian methods: as in Part I
• Structure: does a relationship exist?
E
B C
E
B CB B
Causal structure vs. causal strength
Approaches to structure learning
• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies E
B CB
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
E
B CB• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
E
B CB• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
E
B CB
Attempts to reduce inductive problem to deductive problem
• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies
(Pearl, 2000; Spirtes et al., 1993)
Approaches to structure learning
E
B CB
• Bayesian:– compute posterior
probability of structures,
given observed dataE
B C
E
B C
P(h|data) P(data|h) P(h)
P(h1|data) P(h0|data)
• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies
(Pearl, 2000; Spirtes et al., 1993)
(Heckerman, 1998; Friedman, 1999)
Bayesian Occam’s Razor
All possible data sets d
P(d
| h
)
h0 (no relationship)
h1 (relationship)
€
P(d
∑ d | h) = 1For any model h,
Causal graphical models
• Extend graphical models to deal with interventions as well as observations
• Respecting the direction of causality results in efficient representation and inference
• Two steps in learning causal models– strength: parameter estimation– structure: structure learning
Part II: Graphical models
• Introduction to graphical models– representation and inference
• Causal graphical models– causality– learning about causal relationships
• Graphical models and cognitive science– uses of graphical models– an example: causal induction
Uses of graphical models
• Understanding existing cognitive models– e.g., neural network models
• Representation and reasoning– a way to address holism in induction (c.f. Fodor)
• Defining generative models– mixture models, language models (see Part IV)
• Modeling human causal reasoning
Human causal reasoning
• How do people reason about interventions?(Gopnik, Glymour, Sobel, Schulz, Kushnir & Danks, 2004; Lagnado
& Sloman, 2004; Sloman & Lagnado, 2005; Steyvers, Tenenbaum, Wagenmakers & Blum, 2003)
• How do people learn about causal relationships?– parameter estimation (Shanks, 1995; Cheng, 1997)
– constraint-based models (Glymour, 2001)
– Bayesian structure learning (Steyvers et al., 2003; Griffiths & Tenenbaum, 2005)
Causation from contingencies
“Does C cause E?”(rate on a scale from 0 to 100)
E present (e+)
E absent (e-)
C present(c+)
C absent(c-)
a
b
c
d
Two models of causal judgment
• Delta-P (Jenkins & Ward, 1965):
• Power PC (Cheng, 1997):
dccbaacePcePP +−+=−≡Δ −+++ )()|()|(
)()|(1 dcd
P
ceP
Pp
+Δ
=−
Δ≡ −+Power
Buehner and Cheng (1997)
People
P
Power
0.00 0.25 0.50 0.75 1.00 P
Buehner and Cheng (1997)
People
P
Power
Constant ΔP, changing judgments
Buehner and Cheng (1997)
People
P
Power
Constant causal power, changing judgments
Buehner and Cheng (1997)
People
P
Power
ΔP = 0, changing judgments
• Strength: how strong is a relationship?• Structure: does a relationship exist?
E
B C
w0 w1
E
B C
w0
B B
Causal structure vs. causal strength
Causal strength
• Assume structure:
ΔP and causal power are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): – linear ΔP, Noisy-OR causal power
E
B C
w0 w1
B
• Hypotheses: h1 = h0 =
• Bayesian causal inference:
support =
E
B C
E
B CB B
€
P(d | h1) = P(d | w0,w1) p(w0,w1 | h1)0
1
∫0
1
∫ dw0 dw1
€
P(d | h0) = P(d | w0) p(w0 | h0)0
1
∫ dw0
Causal structure
P(d|h1)
P(d|h0)likelihood ratio (Bayes factor) gives evidence in favor of h1
People
ΔP (r = 0.89)
Power (r = 0.88)
Support (r = 0.97)
Buehner and Cheng (1997)
The importance of parameterization
• Noisy-OR incorporates mechanism assumptions:– generativity: causes increase probability of effects
– each cause is sufficient to produce the effect
– causes act via independent mechanisms(Cheng, 1997)
• Consider other models:– statistical dependence: 2 test
– generic parameterization (cf. Anderson, 1990)
People
Support (Noisy-OR)
2
Support (generic)
Generativity is essential
• Predictions result from “ceiling effect”– ceiling effects only matter if you believe a cause increases the
probability of an effect
P(e+|c+)P(e+|c-)
8/88/8
6/86/8
4/84/8
2/82/8
0/80/8
Support 10050
0
Blicket detector (Dave Sobel, Alison Gopnik, and colleagues)
See this? It’s a blicket machine. Blickets make it go.
Let’s put this oneon the machine.
Oooh, it’s a blicket!
– Two objects: A and B– Trial 1: A B on detector – detector active– Trial 2: A on detector – detector active– 4-year-olds judge whether each object is a blicket
• A: a blicket (100% say yes)
• B: probably not a blicket (34% say yes)
“Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004)
AB Trial A TrialA B
Possible hypotheses
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
E
BA
Bayesian inference
• Evaluating causal models in light of data:
• Inferring a particular causal relation:
€
P(hi | d) =P(d | hi)P(hi)
P(d | h j )P(h j )j
∑
∑∈
→=→H
jh
jj dhPhEAPdEAP )|()|()|(
Bayesian inference
With a uniform prior on hypotheses, and the generic parameterization
A B
Probability of being a blicket
0.32 0.32
0.34 0.34
Modeling backwards blocking
Assume…
• Links can only exist from blocks to detectors
• Blocks are blickets with prior probability q
• Blickets always activate detectors, but detectors never activate on their own– deterministic Noisy-OR, with wi = 1 and w0 = 0
Modeling backwards blocking
P(E=1 | A=0, B=0): 0 0 0 0
P(E=1 | A=1, B=0): 0 0 1 1P(E=1 | A=0, B=1): 0 1 0 1P(E=1 | A=1, B=1): 0 1 1 1
E
BA
E
BA
E
BA
E
BA
P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2
€
P(B → E | d) = P(h01) + P(h11) = q
€
P(B → E | d) =P(h01) + P(h11)
P(h01) + P(h10) + P(h11)=
q
q + q(1− q)
P(E=1 | A=1, B=1): 0 1 1 1
E
BA
E
BA
E
BA
E
BA
P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2
Modeling backwards blocking
€
P(B → E | d) =P(h11)
P(h10) + P(h11)= q
P(E=1 | A=1, B=0): 0 1 1
P(E=1 | A=1, B=1): 1 1 1
E
BA
E
BA
E
BA
P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2
Modeling backwards blocking
Manipulating prior probability(Tenenbaum, Sobel, Griffiths, & Gopnik, submitted)
AB Trial A TrialInitial
Summary
• Graphical models provide solutions to many of the challenges of probabilistic models– defining structured distributions– representing distributions on many variables– efficiently computing probabilities
• Causal graphical models provide tools for defining rational models of human causal reasoning and learning