statistical learning (from data to distributions)

62
STATISTICAL LEARNING (FROM DATA TO DISTRIBUTIONS)

Upload: thetis

Post on 18-Feb-2016

50 views

Category:

Documents


1 download

DESCRIPTION

Statistical Learning (From data to distributions) . Agenda. Learning a discrete probability distribution from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation. Motivation. Agent has made observations ( data ) Now must make sense of it ( hypotheses ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Learning (From data to distributions)

STATISTICAL LEARNING(FROM DATA TO DISTRIBUTIONS)

Page 2: Statistical Learning (From data to distributions)

AGENDA Learning a discrete probability distribution

from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation

Page 3: Statistical Learning (From data to distributions)

MOTIVATION Agent has made observations (data) Now must make sense of it (hypotheses)

Hypotheses alone may be important (e.g., in basic science)

For inference (e.g., forecasting) To take sensible actions (decision making)

A basic component of economics, social and hard sciences, engineering, …

Page 4: Statistical Learning (From data to distributions)

MACHINE LEARNING VS. STATISTICS Machine Learning automated statistics This lecture

Bayesian learning Maximum likelihood (ML) learning Maximum a posteriori (MAP) learning Learning Bayes Nets (R&N 20.1-3)

Future lectures try to do more with even less data Decision tree learning Neural nets Support vector machines …

Page 5: Statistical Learning (From data to distributions)

PUTTING GENERAL PRINCIPLES TO PRACTICE Note: this lecture will cover general principles

of statistical learning on toy problems Grounded in some of the most theoretically

principled approaches to learning But the techniques are far more general Practical applications to larger problems requires

a bit of mathematical savvy and “elbow grease”

Page 6: Statistical Learning (From data to distributions)

BEAUTIFUL RESULTS IN MATH GIVE RISE TO SIMPLE IDEAS

OR

HOW TO JUMP THROUGH HOOPS IN ORDER TO JUSTIFY SIMPLE IDEAS …

Page 7: Statistical Learning (From data to distributions)

CANDY EXAMPLE• Candy comes in 2 flavors, cherry and lime, with identical

wrappers• Manufacturer makes 5 indistinguishable bags

• Suppose we draw• What bag are we holding? What flavor will we draw

next?

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 8: Statistical Learning (From data to distributions)

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data d: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 9: Statistical Learning (From data to distributions)

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data d: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

P(hi|d)

P(d|hi)

We want this…

But all we have is this!

Page 10: Statistical Learning (From data to distributions)

USING BAYES’ RULE P(hi|d) = a P(d|hi) P(hi) is the posterior

(Recall, 1/a = P(d) = Si P(d|hi) P(hi)) P(d|hi) is the likelihood P(hi) is the hypothesis prior

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Page 11: Statistical Learning (From data to distributions)

LIKELIHOOD AND PRIOR Likelihood is the probability of observing the

data, given the hypothesis model

Hypothesis prior is the probability of a hypothesis, before having observed any data

Page 12: Statistical Learning (From data to distributions)

COMPUTING THE POSTERIOR Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) d = { 10 x }

P(d|h1) = 0P(d|h2) = 0.2510

P(d|h3) = 0.510

P(d|h4) = 0.7510

P(d|h5) = 110

P(d|h1)P(h1)=0P(d|h2)P(h2)=9e-8P(d|h3)P(h3)=4e-4P(d|h4)P(h4)=0.011P(d|h5)P(h5)=0.1

P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90

Sum = 1/a = 0.1114

Page 13: Statistical Learning (From data to distributions)

POSTERIOR HYPOTHESES

Page 14: Statistical Learning (From data to distributions)

PREDICTING THE NEXT DRAW P(X|d) = Si P(X|hi,d)P(hi|d)

= Si P(X|hi)P(hi|d)

P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90

H

D X

P(X|h1) =0P(X|h2) =0.25P(X|h3) =0.5P(X|h4) =0.75P(X|h5) =1

Probability that next candy drawn is a lime

P(X|d) = 0.975

Page 15: Statistical Learning (From data to distributions)

P(NEXT CANDY IS LIME | D)

Page 16: Statistical Learning (From data to distributions)

PROPERTIES OF BAYESIAN LEARNING If exactly one hypothesis is correct, then the

posterior probability of the correct hypothesis will tend toward 1 as more data is observed

The effect of the prior distribution decreases as more data is observed

Page 17: Statistical Learning (From data to distributions)

HYPOTHESIS SPACES OFTEN INTRACTABLE To learn a probability distribution, a

hypothesis would have to be a joint probability table over state variables 2n entries => hypothesis space is 2n-1-

dimensional! 2^(2n) deterministic hypotheses

6 boolean variables => over 1022 hypotheses And what the heck would a prior be?

Page 18: Statistical Learning (From data to distributions)

LEARNING COIN FLIPS Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Suppose draws are independent and

identically distributed (i.i.d) Observe that c out of N draws are cherries

(data)

Page 19: Statistical Learning (From data to distributions)

LEARNING COIN FLIPS Let the unknown fraction of cherries be q

(hypothesis) Intuition: c/N might be a good hypothesis (or it might not, depending on the draw!)

Page 20: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = Pj P(dj|q) = qc (1-q)N-c

i.i.d assumption Gather c cherry terms together, then N-c lime terms

Page 21: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)

Page 22: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)

Page 23: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q

)

Page 24: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q

)

Page 25: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)

Page 26: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(da

ta|q

)

Page 27: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

Page 28: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover

around the fraction of cherries… Sharpness indicates some notion of

certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

Page 29: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD P(d|q) be the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

Page 30: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD The maximum of P(d|q) is obtained at the

same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

So the MLE is the same as maximizing log likelihood… but: Multiplications turn into additions We don’t have to deal with such tiny numbers

Page 31: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD The maximum of P(d|q) is obtained at the

same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = log [ qc (1-q)N-c]

Page 32: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD The maximum of P(d|q) is obtained at the

same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]

Page 33: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD The maximum of P(d|q) is obtained at the

same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

Page 34: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD The maximum of P(d|q) is obtained at the

same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = c log q + (N-c) log (1-q) At a maximum of a function, its derivative is 0 So, dl/dq(q) = 0 at the maximum likelihood estimate

Page 35: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD The maximum of P(d|q) is obtained at the

same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function

l(q) = log P(d|q) = c log q + (N-c) log (1-q) At a maximum of a function, its derivative is 0 So, dl/dq(q) = 0 at the maximum likelihood estimate => 0 = c/q – (N-c)/(1-q)

=> q = c/N

Page 36: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD FOR BN For any BN, the ML parameters of any CPT

can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)T T 0.95F T 0.95T F 0.34F F 0.003

Page 37: Statistical Learning (From data to distributions)

OTHER MLE RESULTS Multi-valued variables: take fraction of counts

for each value Continuous Gaussian distributions: take

average value as mean, standard deviation of data as standard deviation

Page 38: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD PROPERTIES As the number of data points approaches

infinity, the MLE approaches the true estimate

With little data, MLEs can vary wildly

Page 39: Statistical Learning (From data to distributions)

MAXIMUM LIKELIHOOD IN CANDY BAG EXAMPLE hML = argmaxhi P(d|hi) P(X|d) P(X|hML)

hML = undefined h5

P(X|hML)

P(X|d)

Page 40: Statistical Learning (From data to distributions)

BACK TO COIN FLIPS The MLE is easy to compute… but what

about those small sample sizes? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

Page 41: Statistical Learning (From data to distributions)

MAXIMUM A POSTERIORI ESTIMATION Maximum a posteriori (MAP) estimation Idea: use the hypothesis prior to get a better

initial estimate than ML, without resorting to full Bayesian estimation “Most coins I’ve seen have been fair coins, so I

won’t let the first few tails sway my estimate much”

“Now that I’ve seen 100 tails in a row, I’m pretty sure it’s not a fair coin anymore”

Page 42: Statistical Learning (From data to distributions)

MAXIMUM A POSTERIORI P(q|d) is the posterior probability of the

hypothesis, given the data argmaxq P(q|d) is known as the maximum a

posteriori (MAP) estimate Posterior of hypothesis q given data d={d1,

…,dN} P(q|d) = 1/a P(d|q) P(q) Max over q doesn’t affect a So MAP estimate is argmaxq P(d|q) P(q)

Page 43: Statistical Learning (From data to distributions)

MAXIMUM A POSTERIORI hMAP = argmaxhi P(hi|d) P(X|d) P(X|hMAP)

hMAP = h3 h4 h5

P(X|hMAP)

P(X|d)

Page 44: Statistical Learning (From data to distributions)

ADVANTAGES OF MAP AND MLE OVER BAYESIAN ESTIMATION Involves an optimization rather than a large

summation Local search techniques

For some types of distributions, there are closed-form solutions that are easily computed

Page 45: Statistical Learning (From data to distributions)

BACK TO COIN FLIPS Need some prior distribution P(q) P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

Page 46: Statistical Learning (From data to distributions)

MAP ESTIMATE Could maximize qc (1-q)N-c P(q) using some

optimization Turns out for some families of P(q), the MAP

estimate is easy to compute

10 q

P(q)

Beta distributions

Page 47: Statistical Learning (From data to distributions)

BETA DISTRIBUTION Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

Page 48: Statistical Learning (From data to distributions)

POSTERIOR WITH BETA PRIORPosterior qc (1-q)N-c P(q)

= g qc+a-1 (1-q)N-c+b-1

MAP estimateq=(c+a-1)/(N+a+b-2)

Posterior is also abeta distribution!

Page 49: Statistical Learning (From data to distributions)

POSTERIOR WITH BETA PRIORWhat does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

MAP estimate is a/(a+b)Effect of prior diminishes

with more data

Page 50: Statistical Learning (From data to distributions)

MAP FOR BN For any BN, the MAP parameters of any CPT

can be derived by the fraction of observed values in the data + the prior virtual counts

Alarm

Earthquake Burglar

E: 500+100 B: 200+200

N=1000+1000

P(E) = 0.5 0.3 P(B) = 0.2 0.2

A|E,B: (19+100)/(20+100)A|B: (188+180)/(200+200)A|E: (170+100)/(500+400)A| : (1+3)/(380 + 300)

E B P(A|E,B)T T 0.95 0.99F T 0.94 0.92T F 0.34 0.3

F F 0.0030.006

Page 51: Statistical Learning (From data to distributions)

EXTENSIONS OF BETA PRIORS Multiple discrete values: Dirichlet prior

Mathematical expression more complex, but in practice still takes the form of “virtual counts”

Mean, standard deviation for Gaussian distributions: Gamma prior

Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions

Page 52: Statistical Learning (From data to distributions)

RECAP Bayesian learning: infer the entire

distribution over hypotheses Robust but expensive for large problems

Maximum Likelihood (ML): optimize likelihood Good for large datasets, not robust for small

ones Maximum A Posteriori (MAP): weight

likelihood by a hypothesis prior during optimization A compromise solution for small datasets Like Bayesian learning, how to specify the prior?

Page 53: Statistical Learning (From data to distributions)

APPLICATION TO STRUCTURE LEARNING So far we’ve assumed a BN structure, which

assumes we have enough domain knowledge to declare strict independence relationships

Problem: how to learn a sparse probabilistic model whose structure is unknown?

Page 54: Statistical Learning (From data to distributions)

BAYESIAN STRUCTURE LEARNING But what if the Bayesian network has

unknown structure? Are Earthquake and Burglar independent?

P(E,B) = P(E)P(B)?

Earthquake Burglar

?B E0 01 10 11 0… …

Page 55: Statistical Learning (From data to distributions)

BAYESIAN STRUCTURE LEARNING Hypothesis 1: dependent

P(E,B) as a probability table has 3 free parameters

Hypothesis 2: independent Individual tables P(E),P(B) have 2 total free

parameters Maximum likelihood will always prefer

hypothesis 1!Earthquake Burglar

B E0 01 10 11 0… …

Page 56: Statistical Learning (From data to distributions)

BAYESIAN STRUCTURE LEARNING Occam’s razor: Prefer simpler explanations

over complex ones Penalize free parameters in probability

tables

Earthquake Burglar

B E0 01 10 11 0… …

Page 57: Statistical Learning (From data to distributions)

ITERATIVE GREEDY ALGORITHM Fit a simple model (ML), compute likelihood Repeat:

Earthquake Burglar

B E0 01 10 11 0… …

Page 58: Statistical Learning (From data to distributions)

ITERATIVE GREEDY ALGORITHM Fit a simple model (ML), compute likelihood Repeat:

Pick a candidate edge to add to the BN

Earthquake Burglar

B E0 01 10 11 0… …

Page 59: Statistical Learning (From data to distributions)

ITERATIVE GREEDY ALGORITHM Fit a simple model (ML), compute likelihood Repeat:

Pick a candidate edge to add to the BN Compute new ML probabilities

Earthquake Burglar

B E0 01 10 11 0… …

Page 60: Statistical Learning (From data to distributions)

ITERATIVE GREEDY ALGORITHM Fit a simple model (ML), compute likelihood Repeat:

Pick a candidate edge to add to the BN Compute new ML probabilities If new likelihood exceeds old likelihood +

threshold, Add the edge

Otherwise, repeat with a different edge Earthquake Burglar

B E0 01 10 11 0… …

(Usually log likelihood)

Page 61: Statistical Learning (From data to distributions)

OTHER TOPICS IN LEARNING DISTRIBUTIONS Learning continuous distributions using

kernel density estimation / nearest neighbor smoothing

Expectation Maximization for hidden variables Learning Bayes nets with hidden variables Learning HMMs from observations Learning Gaussian Mixture Models of continuous

data

Page 62: Statistical Learning (From data to distributions)

NEXT TIME Introduction to machine learning R&N 18.1-2