logistic regression - city university of new...

Today

•  Logistic Regression – Maximum Entropy Formulation

•  Decision Trees Redux – Now using Information Theory

•  Graphical Models – Representing conditional dependence

graphically

1

Logistic Regression Optimization

•  Take the gradient in terms of w

2

E(�w) = − ln p(�t|�w) = −N−1�

n=0

{tn ln yn + (1− tn) ln(1− yn)}

∇�wE =N−1�

n=0

∂E

∂yn

∂yn∂an

∇�wan

Where y0 = p(c0|�xn) = σ(an)

an = �wT �xn

Optimization

•  We know the gradient of the error function, but how do we find the maximum value?

•  Setting to zero is nontrivial •  Numerical approximation

3

∇�wE =N−1�

n=0

(yn − tn)�xn

Entropy

•  Measure of uncertainty, or Measure of “Information”

•  High uncertainty equals high entropy. •  Rare events are more “informative” than

common events.

4

H(x) = −�

x

p(x) log2 p(x)

Examples of Entropy

•  Uniform distributions have higher distributions.

5

Maximum Entropy •  Logistic Regression is also known as

Maximum Entropy. •  Entropy is convex.

– Convergence Expectation. •  Constrain this optimization to enforce good

classification. •  Increase maximum likelihood of the data

while making the distribution of weights most even. –  Include as many useful features as possible.

6

Maximum Entropy with Constraints

•  From Klein and Manning Tutorial 7

Optimization formulation

•  If we let the weights represent likelihoods of value for each feature.

8

maxH(w;x, t) = −�

x

w log2 w

s.t. wTx = t

and ||w||2 = 1

For each feature i

Solving MaxEnt formulation

•  Convex optimization with a concave objective function and linear constraints.

•  Lagrange Multipliers

9

maxH(w;x, t) = −�

x

w log2 w

s.t. wTx = t

and ||w||2 = 1

For each feature i

L(p,�λ) = −�

w

w log2 w +N�

i=1

λi(wTi xi − t) + λ0 (||w||2 − 1)

Dual representation of the maximum likelihood estimation of

Logistic Regression

Decision Trees

•  Nested ‘if’-statements for classification •  Each Decision Tree Node contains a

feature and a split point. •  Challenges:

– Determine which feature and split point to use – Determine which branches are worth including

at all (Pruning)

10

Decision Trees

11

color

h w w

w w h h

blue brown green

<66 <140 <150

<66 <64 <145 <170

m m

m

m m f f

f

f f

Ranking Branches

•  Last time, we used classification accuracy to measure value of a branch.

12

height

<68

1M / 5F 5M / 1F

50% Accuracy before Branch

83.3% Accuracy after Branch

33.3% Accuracy Improvement

6M / 6F

Ranking Branches

•  Measure Decrease in Entropy of the class distribution following the split

13

height

<68

1M / 5F 5M / 1F

H(x) = 2 before Branch

83.3% Accuracy after Branch

33.3% Accuracy Improvement

6M / 6F

InfoGain Criterion •  Calculate the decrease in Entropy across a

split point. •  This represents the amount of information

contained in the split. •  This is relatively indifferent to the position on

the decision tree. – More applicable to N-way classification. – Accuracy represents the mode of the distribution – Entropy can be reduced while leaving the mode

unaffected.

14

Graphical Models and Conditional Independence

•  More generally about probabilities, but used in classification and clustering.

•  Both Linear Regression and Logistic Regression use probabilistic models.

•  Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables.

15

(Joint) Probability Tables

•  Represent multinomial joint probabilities between K variables as K-dimensional tables

•  Assuming D binary variables, how big is this table?

•  What is we had multinomials with M entries?

16

p(x) = p(flu?, achiness?, headache?, . . . , temperature?)

Probability Models

•  What if the variables are independent?

•  If x and y are independent:

•  The original distribution can be factored

•  How big is this table, if each variable is binary?

17

p(x) = p(flu?, achiness?, headache?, . . . , temperature?)

p(x, y) = p(x)p(y)

p(x) = p(flu?)p(achiness?)p(headache?) . . . p(temperature?)

Conditional Independence

•  Independence assumptions are convenient (Naïve Bayes), but rarely true.

•  More often some groups of variables are dependent, but others are independent.

•  Still others are conditionally independent.

18

Conditional Independence

•  If two variables are conditionally independent.

•  E.g. y = flu?, x = achiness?, z = headache?

19

p(x, z|y) = p(x|y)p(z|y)p(x, z) �= p(x)p(z)

x ⊥⊥ z|y

Factorization if a joint

•  Assume

•  How do you factorize:

20

x ⊥⊥ z|y

p(x, y, z)

p(x, y, z) = p(x, z|y)p(y)p(x|y)p(z|y)p(y)

Factorization if a joint

•  What if there is no conditional independence?

•  How do you factorize:

21

p(x, y, z)

p(x, y, z) = p(x, z|y)p(y)p(x|y, z)p(z|y)p(y)

Structure of Graphical Models •  Graphical models allow us to represent

dependence relationships between variables visually –  Graphical models are directed acyclic graphs

(DAG). –  Nodes: random variables –  Edges: Dependence relationship –  No Edge: Independent variables –  Direction of the edge: indicates a parent-child

relationship –  Parent: Source – Trigger –  Child: Destination – Response

22

Example Graphical Models

•  Parents of a node i are denoted πi

•  Factorization of the joint in a graphical model:

23

p(x, y) = p(x)p(y) p(x, y) = p(x|y)p(y)

p(x0, . . . , xn−1) =n−1�

i=0

p(xi|πi)

x y x y

Basic Graphical Models •  Independent Variables

•  Observations

•  When we observe a variable, (fix its value from data) we color the node grey.

•  Observing a variable allows us to condition on it. E.g. p(x,z|y)

•  Given an observation we can generate pdfs for the other variables.

24

x y z

x y z


•  X = cloudy? •  Y = raining? •  Z = wet ground? •  Markov Chain

25

x y z

p(x, y, z) =�

n∈{x,y,z}

p(n|πn) = p(x)p(y|x)p(z|y)


•  Markov Chain

•  Are x and z conditionally independent given y?

26

x y z

p(x, y, z) =�

n∈{x,y,z}


p(x, z|y) = p(x|y)p(z|y)


•  Markov Chain

27

x y z p(x, y, z) =

�

n∈{x,y,z}


p(x, z|y) = p(x|z, y)p(z|y)

p(x|z, y) = p(x, y, z)

p(y, z)=

p(x)p(y|x)p(z|y)p(y)p(z|y)

=p(x)p(y|x)

p(y)=

p(x, y)

p(y)= p(x|y)

p(x, z|y) = p(x|y)p(z|y)x ⊥⊥ z|y

One Trigger Two Responses

•  X = achiness? •  Y = flu? •  Z = fever?

28

x

y

z

p(x, y, z) =�

n∈{x,y,z}

(n|πn) = p(x|y)p(y)p(z|y)



29


x

y

z

p(x, y, z) =�

n∈{x,y,z}



30

xy

zp(x, y, z) =

�

n∈{x,y,z}



p(x|z, y) = p(x, y, z)

p(y, z)=

p(x|y)p(y)p(z|y)p(y)p(z|y)

= p(x|y)p(x, z|y) = p(x|y)p(z|y)x ⊥⊥ z|y

Two Triggers One Response

•  X = rain? •  Y = wet sidewalk? •  Z = spilled coffee?

31

x

y

z

p(x, y, z) =�

n∈{x,y,z}

p(n|πn) = p(x)p(y|x, z)p(z)



32


x

y

z

p(x, y, z) =�

n∈{x,y,z}



33

xy

z

p(x, y, z) =�

n∈{x,y,z}



p(x|z, y) = p(x, y, z)

p(y, z)=

p(x)p(y|x, z)p(z)p(y|x, z)p(z)

= p(x)

p(x, z|y) = p(x)p(z|y)x not ⊥⊥ z|y

Factorization

34

x0

x1

x2 x4

x3

x5

p(x0, x1, x2, x3, x4, x5) =?

Factorization

35

x0

x1

x2 x4

x3

x5

p(x0, x1, x2, x3, x4, x5) =

p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

How Large are the probability tables?

36

p(x0, x1, x2, x3, x4, x5) =

p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

Model Parameters as Nodes

•  Treating model parameters as a random variable, we can include these in a graphical model

•  Multivariate Bernouli

37

µ0

x0

µ1

x1

µ2

x2

Model Parameters as Nodes

•  Treating model parameters as a random variable, we can include these in a graphical model

•  Multinomial

38

x0

µ

x1 x2

Naïve Bayes Classification

•  Observed variables xi are independent given the class variable y

•  The distribution can be optimized using maximum likelihood on each variable separately.

•  Can easily combine various types of distributions

39

x0

y

x1 x2

p(y|x0x1, x2) ∝ p(x0, x1, x2|y)p(y)p(y|x0x1, x2) ∝ p(x0|y)p(x1|y)p(x2|y)p(y)

Graphical Models •  Graphical representation of dependency

relationships •  Directed Acyclic Graphs •  Nodes as random variables •  Edges define dependency relations •  What can we do with Graphical Models

–  Learn parameters – to fit data –  Understand independence relationships between

variables –  Perform inference (marginals and conditionals) –  Compute likelihoods for classification.

40

Plate Notation

•  To indicate a repeated variable, draw a plate around it.

41

x0

y

x1 xn …

y

xi

n

Completely observed Graphical Model

•  Observations for every node

•  Simplest (least general) graph, assume each independent

42

Completely Observed graphical models

Suppose we have observations for every node.

Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

In the simplest – least general – graph, assume each independent. Train 6separate models.

Fl Fe Si Ac Sw He

2nd simplest graph – most general – assume no independence. Build a6-dimensional table. (Divide by total count.)

Fl Fe Si Ac Sw He

20 / 37





Fl Fe Si Ac Sw He


Fl Fe Si Ac Sw He

20 / 37

Completely observed Graphical Model

•  Observations for every node

•  Second simplest graph, assume complete dependence

43





Fl Fe Si Ac Sw He


Fl Fe Si Ac Sw He

20 / 37





Fl Fe Si Ac Sw He

2nd simplest graph – most general – assume no independence. Build a6-dimensional table.

Fl Fe Si Ac Sw He

20 / 36

Maximum Likelihood

•  Each node has a conditional probability table, θ

•  Given the tables, we can construct the pdf.

•  Use Maximum Likelihood to find the best

settings of θ 44

Maximum Likelihood Conditional Probability Tables

Consider this Graphical Model

x0x1

x2

x3

x4

x5

Each node has a conditional probability table !i .

Given the table, we have a pdf

p(x|!) =M!1!

i=0

p(xi |"i , !i)

We have m variables in x, and N data points, X.

Maximum (log) Likelihood

!" = argmax!

ln p(X|!)

= argmax!

N!1"

n=0

ln p(Xn|!)

= argmax!

N!1"

n=0

lnM!1!

i=0

p(xin|!i )

= argmax!

N!1"

n=0

M!1"

i=0

ln p(xin|!i )

21 / 36

p(�x|θ) =M−1�

i=0

p(xi|πi, θi)

Maximum likelihood θ∗ = argmax

θln p( �X|θ)

= argmaxθ

N−1�

n=0

ln p( �Xn|θ)

= argmaxθ

N−1�

n=0

lnM−1�

i=0

p(xin|θi)

= argmaxθ

N−1�

n=0

M−1�

i=0

ln p(xin|θi)

45

Count functions •  Count the number of times something

appears in the data

46

Maximum Likelihood CPTs

First, Kronecker’s delta function.

!(xn, xm) =

!

1 if xn = xm0 otherwise

Counts: the number of times something appears in the data

m(xi ) =N!1"

n=0

!(xi , xin)

m(X) =N!1"

n=0

!(X,Xn)

N ="

x1

m(x1) ="

x1

#

"

x2

!(x1, x2)

$

="

x1

#

"

x2

#

"

x3

!(x1, x2, x3)

$$

. . .

22 / 36

m(xi) =N−1�

n=0

δ(xi, xin)

m( �X) =N−1�

n=0

δ( �X, �Xn)

N =�

x1

m(x1) =�

x1

��

x2

δ(x1, x2)

�=

�

x1

��

x2

��

x3

δ(x1, x2, x3)

��. . .

Maximum Likelihood

•  Define a function: •  Constraint:

47

Maximum likelihood CPTs

l(!) =N!1!

n=0

ln p(Xn|!)

=N!1!

n=0

ln"

X

p(X|!)!(xn,X)

=N!1!

n=0

!

X

"(xn,X) ln p(X|!)

=!

xn

m(X) ln p(X|!)

=!

xn

m(X) lnM!1"

i=0

p(xi |#i , !i )

=!

xn

M!1!

i=0

m(X) ln p(xi |#i , !i )

=M!1!

i=0

!

xi ,"i

!

X\xi\"i

m(X) ln p(xi |#i , !i )

=M!1!

i=0

!

xi ,"i

m(xi , #i) ln p(xi |#i , !i )

Define a function:!(xi ,#i ) = p(xi |#i , !i )

Constraint:!

xi

!(xi ,#i ) = 1

23 / 36


l(!) =N!1!

n=0

ln p(Xn|!)

=N!1!

n=0

ln"

X

p(X|!)!(xn,X)

=N!1!

n=0

!

X

"(xn,X) ln p(X|!)

=!

xn

m(X) ln p(X|!)

=!

xn

m(X) lnM!1"

i=0

p(xi |#i , !i )

=!

xn

M!1!

i=0

m(X) ln p(xi |#i , !i )

=M!1!

i=0

!

xi ,"i

!

X\xi\"i

m(X) ln p(xi |#i , !i )

=M!1!

i=0

!

xi ,"i

m(xi , #i) ln p(xi |#i , !i )


Constraint:!

xi

!(xi ,#i ) = 1

23 / 36


l(!) =N!1!

n=0

ln p(Xn|!)

=N!1!

n=0

ln"

X

p(X|!)!(xn,X)

=N!1!

n=0

!

X

"(xn,X) ln p(X|!)

=!

xn

m(X) ln p(X|!)

=!

xn

m(X) lnM!1"

i=0

p(xi |#i , !i )

=!

xn

M!1!

i=0

m(X) ln p(xi |#i , !i )

=M!1!

i=0

!

xi ,"i

!

X\xi\"i

m(X) ln p(xi |#i , !i )

=M!1!

i=0

!

xi ,"i

m(xi , #i) ln p(xi |#i , !i )


Constraint:!

xi

!(xi ,#i ) = 1

23 / 36

Maximum Likelihood

•  Use Lagrange Multipliers

48

l(θ) =M−1�

i=0

�

xi

�

πi

m(xi,πi) ln θ(xi,πi)−M−1�

i=0

�

πi

λπi

��

xi

θ(xi,πi)− 1

�

∂l(θ)

∂θ(xi,πi)=

m(xi,πi)

θ(xi,πi)− λπi = 0

θ(xi,πi) =m(xi,πi)

λπi

�

xi

m(xi,πi)

λπi

= 1 – the constraint

λπi =�

xi

m(xi,πi) = m(πi)

θ(xi,πi) =m(xi,πi)

m(πi)– counts!

Maximum A Posteriori Training

•  Bayesians would never do that, the thetas need a prior.

49

θ(xi,πi) =m(xi,πi) + �

m(πi) + �|xi|

Conditional Dependence Test •  Can check conditional independence in a graphical model

–  “Is achiness (x3) independent of the flue (x0) given fever(x1)?” –  “Is achiness (x3) independent of sinus infections(x2) given fever

(x1)?”

50

p(x) = p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

p(x3|x0, x1, x2) =p(x0, x1, x2, x3)

p(x0, x1, x2)

=p(x0)p(x1|x0)p(x2|x0)p(x3|x1)

p(x0)p(x1|x0)p(x2|x0)

= p(x3|x1)

x3 ⊥⊥ x0, x2|x1

D-Separation and Bayes Ball

•  Intuition: nodes are separated or blocked by sets of nodes. – E.g. nodes x1 and x2, “block” the path from x0

to x5. So x0 is cond. ind.from x5 given x1 and x2

51

D-separation and Bayes Ball

!

"

#

$x0

x1

x2

x3

x4

x5

Intuition: nodes are separated, or blocked by sets of nodes

Example: nodes x1 and x2, “block” the path from x0 to x5 ,then x0 !! x5|x2, x3

28 / 36

Bayes Ball Algorithm

•  Shade nodes xc

•  Place a “ball” at each node in xa

•  Bounce balls around the graph according to rules

•  If no balls reach xb, then cond. ind.

52

xa ⊥⊥ xb|xc

Ten rules of Bayes Ball Theorem

53

Bayes Ball Example

Bayes Ball Example - I

x0 !! x4|x2?

x0

x1

x2

x3

x4

x5

32 / 36

54

Bayes Ball Example

Bayes Ball Example - II

x0 !! x5|x1, x2?

x0

x1

x2

x3

x4

x5

33 / 36

55

Undirected Graphs •  What if we allow undirected graphs? •  What do they correspond to? •  Not Cause/Effect, or Trigger/Response,

but general dependence •  Example: Image pixels, each pixel is a

bernouli –  P(x11,…, x1M,…, xM1,…, xMM) –  Bright pixels have bright neighbors

•  No parents, just probabilities. •  Grid models are called Markov

Random Fields

56

Undirected Graphs

•  Undirected separability is easy. •  To check conditional independence of A

and B given C, check the Graph reachability of A and B without going through nodes in C

57

D

B

C

A

Next Time

•  More fun with Graphical Models

•  Read Chapter 8.1, 8.2

58

logistic regression - city university of new...

Documents