# logistic regression - city university of new...

Post on 28-Mar-2018

218 views

Embed Size (px)

TRANSCRIPT

Today

Logistic Regression Maximum Entropy Formulation

Decision Trees Redux Now using Information Theory

Graphical Models Representing conditional dependence

graphically

1

Logistic Regression Optimization

Take the gradient in terms of w

2

E(w) = ln p(t|w) = N1

n=0

{tn ln yn + (1 tn) ln(1 yn)}

wE =N1

n=0

E

yn

ynan

wan

Where y0 = p(c0|xn) = (an)

an = wT xn

Optimization

We know the gradient of the error function, but how do we find the maximum value?

Setting to zero is nontrivial Numerical approximation

3

wE =N1

n=0

(yn tn)xn

Entropy

Measure of uncertainty, or Measure of Information

High uncertainty equals high entropy. Rare events are more informative than

common events.

4

H(x) =

x

p(x) log2 p(x)

Examples of Entropy

Uniform distributions have higher distributions.

5

Maximum Entropy Logistic Regression is also known as

Maximum Entropy. Entropy is convex.

Convergence Expectation. Constrain this optimization to enforce good

classification. Increase maximum likelihood of the data

while making the distribution of weights most even. Include as many useful features as possible.

6

Maximum Entropy with Constraints

From Klein and Manning Tutorial 7

Optimization formulation

If we let the weights represent likelihoods of value for each feature.

8

maxH(w;x, t) =

x

w log2 w

s.t. wTx = t

and ||w||2 = 1For each feature i

Solving MaxEnt formulation

Convex optimization with a concave objective function and linear constraints.

Lagrange Multipliers

9

maxH(w;x, t) =

x

w log2 w

s.t. wTx = t

and ||w||2 = 1For each feature i

L(p,) =

w

w log2 w +N

i=1

i(wTi xi t) + 0 (||w||2 1)

Dual representation of the maximum likelihood estimation of

Logistic Regression

Decision Trees

Nested if-statements for classification Each Decision Tree Node contains a

feature and a split point. Challenges:

Determine which feature and split point to use Determine which branches are worth including

at all (Pruning)

10

Decision Trees

11

color

h w w

w w h h

blue brown green

Ranking Branches

Last time, we used classification accuracy to measure value of a branch.

12

height

Ranking Branches

Measure Decrease in Entropy of the class distribution following the split

13

height

InfoGain Criterion Calculate the decrease in Entropy across a

split point. This represents the amount of information

contained in the split. This is relatively indifferent to the position on

the decision tree. More applicable to N-way classification. Accuracy represents the mode of the distribution Entropy can be reduced while leaving the mode

unaffected.

14

Graphical Models and Conditional Independence

More generally about probabilities, but used in classification and clustering.

Both Linear Regression and Logistic Regression use probabilistic models.

Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables.

15

(Joint) Probability Tables

Represent multinomial joint probabilities between K variables as K-dimensional tables

Assuming D binary variables, how big is this table?

What is we had multinomials with M entries?

16

p(x) = p(flu?, achiness?, headache?, . . . , temperature?)

Probability Models

What if the variables are independent?

If x and y are independent:

The original distribution can be factored

How big is this table, if each variable is binary?

17

p(x) = p(flu?, achiness?, headache?, . . . , temperature?)

p(x, y) = p(x)p(y)

p(x) = p(flu?)p(achiness?)p(headache?) . . . p(temperature?)

Conditional Independence

Independence assumptions are convenient (Nave Bayes), but rarely true.

More often some groups of variables are dependent, but others are independent.

Still others are conditionally independent.

18

Conditional Independence

If two variables are conditionally independent.

E.g. y = flu?, x = achiness?, z = headache?

19

p(x, z|y) = p(x|y)p(z|y)p(x, z) = p(x)p(z)

x z|y

Factorization if a joint

Assume

How do you factorize:

20

x z|y

p(x, y, z)

p(x, y, z) = p(x, z|y)p(y)p(x|y)p(z|y)p(y)

Factorization if a joint

What if there is no conditional independence?

How do you factorize:

21

p(x, y, z)

p(x, y, z) = p(x, z|y)p(y)p(x|y, z)p(z|y)p(y)

Structure of Graphical Models Graphical models allow us to represent

dependence relationships between variables visually Graphical models are directed acyclic graphs

(DAG). Nodes: random variables Edges: Dependence relationship No Edge: Independent variables Direction of the edge: indicates a parent-child

relationship Parent: Source Trigger Child: Destination Response

22

Example Graphical Models

Parents of a node i are denoted i Factorization of the joint in a graphical

model:

23

p(x, y) = p(x)p(y) p(x, y) = p(x|y)p(y)

p(x0, . . . , xn1) =n1

i=0

p(xi|i)

x y x y

Basic Graphical Models Independent Variables

Observations

When we observe a variable, (fix its value from data) we color the node grey.

Observing a variable allows us to condition on it. E.g. p(x,z|y)

Given an observation we can generate pdfs for the other variables.

24

x y z

x y z

Example Graphical Models

X = cloudy? Y = raining? Z = wet ground? Markov Chain

25

x y z

p(x, y, z) =

n{x,y,z}

p(n|n) = p(x)p(y|x)p(z|y)

Example Graphical Models

Markov Chain

Are x and z conditionally independent given y?

26

x y z

p(x, y, z) =

n{x,y,z}

p(n|n) = p(x)p(y|x)p(z|y)

p(x, z|y) = p(x|y)p(z|y)

Example Graphical Models

Markov Chain

27

x y z p(x, y, z) =

n{x,y,z}

p(n|n) = p(x)p(y|x)p(z|y)

p(x, z|y) = p(x|z, y)p(z|y)

p(x|z, y) = p(x, y, z)p(y, z)

=p(x)p(y|x)p(z|y)

p(y)p(z|y)

=p(x)p(y|x)

p(y)=

p(x, y)

p(y)= p(x|y)

p(x, z|y) = p(x|y)p(z|y)x z|y

One Trigger Two Responses

X = achiness? Y = flu? Z = fever?

28

x

y

z

p(x, y, z) =

n{x,y,z}

(n|n) = p(x|y)p(y)p(z|y)

Example Graphical Models

Are x and z conditionally independent given y?

29

p(x, z|y) = p(x|y)p(z|y)

x

y

z

p(x, y, z) =

n{x,y,z}

(n|n) = p(x|y)p(y)p(z|y)

Example Graphical Models

30

xy

zp(x, y, z) =

n{x,y,z}

(n|n) = p(x|y)p(y)p(z|y)

p(x, z|y) = p(x|z, y)p(z|y)

p(x|z, y) = p(x, y, z)p(y, z)

=p(x|y)p(y)p(z|y)

p(y)p(z|y)= p(x|y)

p(x, z|y) = p(x|y)p(z|y)x z|y

Two Triggers One Response

X = rain? Y = wet sidewalk? Z = spilled coffee?

31

x

y

z

p(x, y, z) =

n{x,y,z}

p(n|n) = p(x)p(y|x, z)p(z)

Example Graphical Models

Are x and z conditionally independent given y?

32

p(x, z|y) = p(x|y)p(z|y)

x

y

z

p(x, y, z) =

n{x,y,z}

p(n|n) = p(x)p(y|x, z)p(z)

Example Graphical Models

33

xy

z

p(x, y, z) =

n{x,y,z}

p(n|n) = p(x)p(y|x, z)p(z)

p(x, z|y) = p(x|z, y)p(z|y)

p(x|z, y) = p(x, y, z)p(y, z)

=p(x)p(y|x, z)p(z)p(y|x, z)p(z)

= p(x)

p(x, z|y) = p(x)p(z|y)x not z|y

Factorization

34

x0

x1

x2 x4

x3

x5

p(x0, x1, x2, x3, x4, x5) =?

Factorization

35

x0

x1

x2 x4

x3

x5

p(x0, x1, x2, x3, x4, x5) =

p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

How Large are the probability tables?

36

p(x0, x1, x2, x3, x4, x5) =

p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

Model Parameters as Nodes

Treating model parameters as a random variable, we can include these in a graphical model

Multivariate Bernouli

37

0

x0

1

x1

2

x2

Model Parameters as Nodes

Treating model parameters as a random variable, we can include these in a graphical model

Multinomial

38

x0

x1 x2

Nave Bayes Classification

Observed variables xi are independent given the class variable y

The distribution can be optimized using maximum likelihood on each variable separately.

Can easily combine various types of distributions

39

x0

y

x1 x2

p(y|x0x1, x2) p(x0, x1, x2|y)p(y)p(y|x0x1, x2) p(x0|y)p(x1|y)p(x2|y)p(y)

Graphical Models Graphical representation of dependency

relationships Directed Acyclic Graphs Nodes as random variables Edges define dependency relations What can we do with Graphical Models

Learn parameters to fit data Understand independence relationships between

variables Perform inference (marginals and conditionals) Compute likelihoods for classification.

40

Plate Notation

To indicate a repeated variable, draw a plate around it.

41

x0

y

x1 xn

y

xi

n

Completely observed Graphical Model

Observations for every node

Simplest (least general) graph, assume each independent

42

Completely Observed graphical models

Suppose we have observations for every node.

Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

In the simplest least general graph, assume each independent. Train 6separate models.

Fl Fe Si Ac Sw He

2nd simplest graph most general assume no independence. Build a6-dimensional table. (Divide by total count.)

Fl Fe Si Ac Sw He

20 / 37

Completely Observed graphical models

Suppose we have observations for every node.

Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

In the simplest least general graph, assume each independent. Train 6separate models.

Fl Fe Si Ac Sw He

2nd simplest graph most general assume no independence. Build a6-dimensional table. (Divide by total count.)

Fl Fe Si Ac Sw He

20 / 37

Completely observed Graphical Model

Observations for every node

Second simplest graph, assume complete dependence

43

Completely Observed graphical models

Suppose we have observations for every node.

Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

In the simplest least general graph, assume each independent. Train 6separate models.

Fl Fe Si Ac Sw He

2nd simplest graph most general assume no independence. Build a6-dimensional table. (Divide by total count.)

Fl Fe Si Ac Sw He

20 / 37

Completely Observed graphical models

Suppose we have observations for every node.

Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

In the simplest least general graph, assume each independent. Train 6separate models.

Fl Fe Si Ac Sw He

2nd simplest graph most general assume no independence. Build a6-dimensional table.

Fl Fe Si Ac Sw He

20 / 36

Maximum Likelihood

Each node has a conditional probability table,

Given the tables, we can construct the pdf.

Use Maximum Likelihood to find the best

settings of 44

Maximum Likelihood Conditional Probability Tables

Consider this Graphical Model

x0x1

x2

x3

x4

x5

Each node has a conditional probability table i .

Given the table, we have a pdf

p(x|) =M1

i=0

p(xi |i , i)

We have m variables in x, and N data points, X.

Maximum (log) Likelihood

= argmax

ln p(X|)

= argmax

N1

n=0

ln p(Xn|)

= argmax

N1

n=0

lnM1

i=0

p(xin|i )

= argmax

N1

n=0

M1

i=0

ln p(xin|i )

21 / 36

p(x|) =M1

i=0

p(xi|i, i)

Maximum likelihood = argmax

ln p( X|)

= argmax

N1

n=0

ln p( Xn|)

= argmax

N1

n=0

lnM1

i=0

p(xin|i)

= argmax

N1

n=0

M1

i=0

ln p(xin|i)

45

Count functions Count the number of times something

appears in the data

46

Maximum Likelihood CPTs

First, Kroneckers delta function.

(xn, xm) =

{

1 if xn = xm0 otherwise

Counts: the number of times something appears in the data

m(xi ) =N1

n=0

(xi , xin)

m(X) =N1

n=0

(X,Xn)

N =

x1

m(x1) =

x1

(

x2

(x1, x2)

)

=

x1

(

x2

(

x3

(x1, x2, x3)

))

. . .

22 / 36

m(xi) =N1

n=0

(xi, xin)

m( X) =N1

n=0

( X, Xn)

N =

x1

m(x1) =

x1

x2

(x1, x2)

=

x1

x2

x3

(x1, x2, x3)

. . .

Maximum Likelihood

Define a function: Constraint:

47

Maximum likelihood CPTs

l() =N1

n=0

ln p(Xn|)

=N1

n=0

ln

X

p(X|)(xn,X)

=N1

n=0

X

(xn,X) ln p(X|)

=

xn

m(X) ln p(X|)

=

xn

m(X) lnM1

i=0

p(xi |i , i )

=

xn

M1

i=0

m(X) ln p(xi |i , i )

=M1

i=0

xi ,i

X\xi\i

m(X) ln p(xi |i , i )

=M1

i=0

xi ,i

m(xi , i) ln p(xi |i , i )

Define a function:(xi ,i ) = p(xi |i , i )

Constraint:

xi

(xi ,i ) = 1

23 / 36

Maximum likelihood CPTs

l() =N1

n=0

ln p(Xn|)

=N1

n=0

ln

X

p(X|)(xn,X)

=N1

n=0

X

(xn,X) ln p(X|)

=

xn

m(X) ln p(X|)

=

xn

m(X) lnM1

i=0

p(xi |i , i )

=

xn

M1

i=0

m(X) ln p(xi |i , i )

=M1

i=0

xi ,i

X\xi\i

m(X) ln p(xi |i , i )

=M1

i=0

xi ,i

m(xi , i) ln p(xi |i , i )

Define a function:(xi ,i ) = p(xi |i , i )

Constraint:

xi

(xi ,i ) = 1

23 / 36

Maximum likelihood CPTs

l() =N1

n=0

ln p(Xn|)

=N1

n=0

ln

X

p(X|)(xn,X)

=N1

n=0

X

(xn,X) ln p(X|)

=

xn

m(X) ln p(X|)

=

xn

m(X) lnM1

i=0

p(xi |i , i )

=

xn

M1

i=0

m(X) ln p(xi |i , i )

=M1

i=0

xi ,i

X\xi\i

m(X) ln p(xi |i , i )

=M1

i=0

xi ,i

m(xi , i) ln p(xi |i , i )

Define a function:(xi ,i ) = p(xi |i , i )

Constraint:

xi

(xi ,i ) = 1

23 / 36

Maximum Likelihood

Use Lagrange Multipliers

48

l() =M1

i=0

xi

i

m(xi,i) ln (xi,i)M1

i=0

i

i

xi

(xi,i) 1

l()

(xi,i)=

m(xi,i)

(xi,i) i = 0

(xi,i) =m(xi,i)

i

xi

m(xi,i)

i= 1 the constraint

i =

xi

m(xi,i) = m(i)

(xi,i) =m(xi,i)

m(i) counts!

Maximum A Posteriori Training

Bayesians would never do that, the thetas need a prior.

49

(xi,i) =m(xi,i) +

m(i) + |xi|

Conditional Dependence Test Can check conditional independence in a graphical model

Is achiness (x3) independent of the flue (x0) given fever(x1)? Is achiness (x3) independent of sinus infections(x2) given fever

(x1)?

50

p(x) = p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

p(x3|x0, x1, x2) =p(x0, x1, x2, x3)

p(x0, x1, x2)

=p(x0)p(x1|x0)p(x2|x0)p(x3|x1)

p(x0)p(x1|x0)p(x2|x0)= p(x3|x1)

x3 x0, x2|x1

D-Separation and Bayes Ball

Intuition: nodes are separated or blocked by sets of nodes. E.g. nodes x1 and x2, block the path from x0

to x5. So x0 is cond. ind.from x5 given x1 and x2

51

D-separation and Bayes Ball

!

"

#

$x0

x1

x2

x3

x4

x5

Intuition: nodes are separated, or blocked by sets of nodes

Example: nodes x1 and x2, block the path from x0 to x5 ,then x0 x5|x2, x3

28 / 36

Bayes Ball Algorithm

Shade nodes xc Place a ball at each node in xa Bounce balls around the graph according

to rules If no balls reach xb, then cond. ind.

52

xa xb|xc

Ten rules of Bayes Ball Theorem

53

Bayes Ball Example

Bayes Ball Example - I

x0 x4|x2?

x0

x1

x2

x3

x4

x5

32 / 36

54

Bayes Ball Example

Bayes Ball Example - II

x0 x5|x1, x2?

x0

x1

x2

x3

x4

x5

33 / 36

55

Undirected Graphs What if we allow undirected graphs? What do they correspond to? Not Cause/Effect, or Trigger/Response,

but general dependence Example: Image pixels, each pixel is a

bernouli P(x11,, x1M,, xM1,, xMM) Bright pixels have bright neighbors

No parents, just probabilities. Grid models are called Markov

Random Fields

56

Undirected Graphs

Undirected separability is easy. To check conditional independence of A

and B given C, check the Graph reachability of A and B without going through nodes in C

57

D

B

C

A

Next Time

More fun with Graphical Models

Read Chapter 8.1, 8.2

58