CS B351LEARNING PROBABILISTIC MODELS
MOTIVATION
Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net
Next few lectures: where does the Bayes net come from?
Win?
Strength Opponent Strength
Win?
Offense strength
Opp. Off.
Strength
Defense strength
Opp. Def.
Strength
Pass yds
Rush yds Rush yds
allowed
Score allowed
SWin?
Offense strength
Opp. Off.
Strength
Defense strength
Opp. Def.
Strength
Pass yds
Rush yds Rush yds
allowed
Score allowed
Strength of
schedule
At Home
?
Injuries?Opp
injuries?
SWin?
Offense strength
Opp. Off.
Strength
Defense strength
Opp. Def.
Strength
Pass yds
Rush yds Rush yds
allowed
Score allowed
Strength of
schedule
At Home
?
Injuries?Opp
injuries?
AGENDA
Learning probability distributions from example data
Influence of structure on performance Maximum likelihood estimation (MLE) Bayesian estimation
PROBABILISTIC ESTIMATION PROBLEM
Our setting: Given a set of examples drawn from the target
distribution Each example is complete (fully observable)
Goal: Produce some representation of a belief state so
we can perform inferences & draw certain predictions
DENSITY ESTIMATION
Given dataset D={d[1],…,d[M]} drawn from underlying distribution P*
Find a distribution that matches P* as “close” as possible
High-level issues: Usually, not enough data to get an accurate
picture of P*, which forces us to approximate. Even if we did have P*, how do we define
“closeness” (both theoretically and in practice)? How do we maximize “closeness”?
WHAT CLASS OF PROBABILITY MODELS?
For small discrete distributions, just use a tabular representation Very efficient learning techniques
For large discrete distributions or continuous ones, the choice of probability model is crucial Increasing complexity =>
Can represent complex distributions more accurately Need more data to learn well (risk of overfitting) More expensive to learn and to perform inference
TWO LEARNING PROBLEMS
Parameter learning What entries should be put into the model’s
probability tables? Structure learning
Which variables should be represented / transformed for inclusion in the model?
What direct / indirect relationships between variables should be modeled?
More “high level” problem Once structure is chosen, a set of (unestimated)
parameters emerge These need to be estimated using parameter learning
LEARNING COIN FLIPS Cherry and lime candies are in an opaque
bag Observe that c out of N draws are cherries
(data)
LEARNING COIN FLIPS Observe that c out of N draws are cherries
(data) Intuition: c/N might be a good hypothesis for
the fraction of cherries in the bag(or it might not, depending on the draw!)
“Intuitive” parameter estimate: empirical distribution P(cherry) c / N(this will be justified more thoroughly later)
STRUCTURE LEARNING EXAMPLE: HISTOGRAM BUCKET SIZES
Histograms are used to estimate distributions of continuous or large #s of discrete values… but how fine?
0 20 40 60 80 100
120
140
160
180
200
012345678
0 20 40 60 80 1001201401601802000
2
4
6
8
10
12
14
16
0 100 2000
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112
128
144
160
176
192
0
1
2
3
4
5
6
STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS
Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)
Case 1: 15 free parameters (16 entries – sum to 1 constraint) P(ABCD) = p1
P(ABCD) = p2
… P(ABCD) = p15
P(ABCD) = 1-p1-…-p15
Case 2: 4 free parameters P(A)=p1, P(A)=1-p1
…
P(D)=p4, P(D)=1-p4
STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS
Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)
P(A,B,C,D) Would be able to fit ALL relationships in the data
P(A)P(B)P(C)P(D) Inherently does not have the capability to
accurately model correlations like A~=B Leads to biased estimates: overestimate or
underestimate the true probabilities
1
2
3
0
0.1
0.2
0.3
1
2
3
1
2
3
0
0.1
0.2
1
2
3
Original joint distribution P(X,Y)Learned using independence
assumption P(X)P(Y)
XY
YX
STRUCTURE LEARNING: EXPRESSIVE POWER
Making more independence assumptions always makes a probabilistic model less expressive
If the independence relationships assumed by structure model A are a superset of those in structure B, then B can express any probability distribution that A can
X
Y Z
X
Y Z
X
Y Z
C
F1 F2 Fk
C
F1 F2 Fk
Or
?
ARCS DO NOT NECESSARILY ENCODE CAUSALITY!
A
B
C
C
B
A
2 BN’s that can encode the same joint probability distribution
READING OFF INDEPENDENCE RELATIONSHIPS
Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?
No! C parent’s (B) are
given, and so it is independent of its non-descendents (A)
Independence is symmetric:C A | B => A C | B
A
B
C
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
Parameters estimated via empirical distribution (“Intuitive fit”)
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11
LEARNING IN THE FACE OF NOISY DATA
Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT
X Y
Model 1
X Y
Model 2
Parameters estimated via empirical distribution (“Intuitive fit”)
P(X=H) = 9/20P(Y=H) = 8/20
P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are
likely to be larger!
STRUCTURE LEARNING: FIT VS COMPLEXITY
Must trade off fit of data vs. complexity of model
Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity
to noise
STRUCTURE LEARNING: FIT VS COMPLEXITY
Must trade off fit of data vs. complexity of model
Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity
to noise
Typical approaches explore multiple structures, while optimizing the trade off between fit and complexity
Need a way of measuring “complexity” (e.g., number of edges, number of parameters) and “fit”
FURTHER READING ON STRUCTURE LEARNING
Structure learning with statistical independence testing
Score-based methods (e.g., Bayesian Information Criterion)
Bayesian methods with structure priors Cross-validated model selection (more on this
later)
STATISTICAL PARAMETER LEARNING
LEARNING COIN FLIPS Observe that c out of N draws are cherries
(data) Let the unknown fraction of cherries be q
(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and
identically distributed (i.i.d)
LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and
identically distributed (i.i.d) Probability of drawing 2 cherries is *q q Probability of drawing 2 limes is (1-q)2
Probability of drawing 1 cherry and 1 lime: *(1- )q q
LIKELIHOOD FUNCTION
Likelihood of data d={d1,…,dN} given q
P(d|q) = Pj P(dj|q) = qc (1-q)N-c
i.i.d assumption Gather c cherry terms together, then N-c lime terms
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1/1 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
2/2 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
2/3 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2/4 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
2/5 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.10.20.30.40.50.60.70.80.9 10
0.0000002
0.0000004
0.0000006
0.0000008
0.000001
0.0000012
10/20 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-31
2E-31
3E-31
4E-31
5E-31
6E-31
7E-31
8E-31
9E-31
50/100 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Peaks of likelihood function seem to hover around the fraction of cherries…
Sharpness indicates some notion of certainty…
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-31
2E-31
3E-31
4E-31
5E-31
6E-31
7E-31
8E-31
9E-31
50/100 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq( )q = 0 gives the maximum likelihood
estimate
MAXIMUM LIKELIHOOD
dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0
=> q = c/N
OTHER MLE RESULTS
Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram)
Continuous Gaussian distributions Mean = average data Standard deviation = standard deviation of data
AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION
P(q|d) = 1/ Z P(d|q) P(q) is the posterior Distribution of hypotheses given the data
P(d|q) is the likelihood P(q) is the hypothesis prior
q
d[1] d[2] d[M]
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
qi
d[1] d[2] d[M]
Y
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
qi
d[1] d[2] d[M]
Y
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!
= (c+1) / (N+2)
qi
d[1] d[2] d[M]
Y
Can think of this as a “correction” using “virtual counts”
NONUNIFORM PRIORS
P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability that I believe in q
10 q
P(q)
BETA DISTRIBUTION
Betaa,b(q) = g qa-1 (1-q)b-1
a, b hyperparameters > 0 g is a normalization
constant a=b=1 is uniform
distribution
POSTERIOR WITH BETA PRIOR
Posterior qc (1-q)N-c P(q)= g qc+a-1 (1-q)N-c+b-1
= Betaa+c,b+N-c(q)
Prediction = meanE[ ]q =(c+a)/(N+a+b)
POSTERIOR WITH BETA PRIOR
What does this mean? Prior specifies a “virtual
count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b
Effect of prior diminishes with more data
CHOOSING A PRIOR
Part of the design process; must be chosen according to your intuition
Uninformed belief = =1a b , strong belief => ,a b high
EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)
distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in
practice still takes the form of “virtual counts”
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 20 40 60 80 1001201401601802000
0.020.040.060.080.1
0.120.140.160.18
0 1
5 10
RECAP
Learning probabilistic models Parameter vs. structure learning Single-parameter learning via coin flips
Maximum Likelihood Bayesian Learning with Beta prior
MAXIMUM LIKELIHOOD FOR BN
For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values
Alarm
Earthquake Burglar
E: 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380
E B P(A|E,B)
T T 0.95
F T 0.95
T F 0.34
F F 0.003
FITTING CPTS
Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT
Note that for large k=|PaXi|, very few datapoints will share the values of paXi! O(|D|/2k), but some values may be even rarer Large domains |Val(Xi)| can also be a problem Data fragmentation