logistic regression - city university of new...

Click here to load reader

Post on 28-Mar-2018

218 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • Today

    Logistic Regression Maximum Entropy Formulation

    Decision Trees Redux Now using Information Theory

    Graphical Models Representing conditional dependence

    graphically

    1

  • Logistic Regression Optimization

    Take the gradient in terms of w

    2

    E(w) = ln p(t|w) = N1

    n=0

    {tn ln yn + (1 tn) ln(1 yn)}

    wE =N1

    n=0

    E

    yn

    ynan

    wan

    Where y0 = p(c0|xn) = (an)

    an = wT xn

  • Optimization

    We know the gradient of the error function, but how do we find the maximum value?

    Setting to zero is nontrivial Numerical approximation

    3

    wE =N1

    n=0

    (yn tn)xn

  • Entropy

    Measure of uncertainty, or Measure of Information

    High uncertainty equals high entropy. Rare events are more informative than

    common events.

    4

    H(x) =

    x

    p(x) log2 p(x)

  • Examples of Entropy

    Uniform distributions have higher distributions.

    5

  • Maximum Entropy Logistic Regression is also known as

    Maximum Entropy. Entropy is convex.

    Convergence Expectation. Constrain this optimization to enforce good

    classification. Increase maximum likelihood of the data

    while making the distribution of weights most even. Include as many useful features as possible.

    6

  • Maximum Entropy with Constraints

    From Klein and Manning Tutorial 7

  • Optimization formulation

    If we let the weights represent likelihoods of value for each feature.

    8

    maxH(w;x, t) =

    x

    w log2 w

    s.t. wTx = t

    and ||w||2 = 1For each feature i

  • Solving MaxEnt formulation

    Convex optimization with a concave objective function and linear constraints.

    Lagrange Multipliers

    9

    maxH(w;x, t) =

    x

    w log2 w

    s.t. wTx = t

    and ||w||2 = 1For each feature i

    L(p,) =

    w

    w log2 w +N

    i=1

    i(wTi xi t) + 0 (||w||2 1)

    Dual representation of the maximum likelihood estimation of

    Logistic Regression

  • Decision Trees

    Nested if-statements for classification Each Decision Tree Node contains a

    feature and a split point. Challenges:

    Determine which feature and split point to use Determine which branches are worth including

    at all (Pruning)

    10

  • Decision Trees

    11

    color

    h w w

    w w h h

    blue brown green

  • Ranking Branches

    Last time, we used classification accuracy to measure value of a branch.

    12

    height

  • Ranking Branches

    Measure Decrease in Entropy of the class distribution following the split

    13

    height

  • InfoGain Criterion Calculate the decrease in Entropy across a

    split point. This represents the amount of information

    contained in the split. This is relatively indifferent to the position on

    the decision tree. More applicable to N-way classification. Accuracy represents the mode of the distribution Entropy can be reduced while leaving the mode

    unaffected.

    14

  • Graphical Models and Conditional Independence

    More generally about probabilities, but used in classification and clustering.

    Both Linear Regression and Logistic Regression use probabilistic models.

    Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables.

    15

  • (Joint) Probability Tables

    Represent multinomial joint probabilities between K variables as K-dimensional tables

    Assuming D binary variables, how big is this table?

    What is we had multinomials with M entries?

    16

    p(x) = p(flu?, achiness?, headache?, . . . , temperature?)

  • Probability Models

    What if the variables are independent?

    If x and y are independent:

    The original distribution can be factored

    How big is this table, if each variable is binary?

    17

    p(x) = p(flu?, achiness?, headache?, . . . , temperature?)

    p(x, y) = p(x)p(y)

    p(x) = p(flu?)p(achiness?)p(headache?) . . . p(temperature?)

  • Conditional Independence

    Independence assumptions are convenient (Nave Bayes), but rarely true.

    More often some groups of variables are dependent, but others are independent.

    Still others are conditionally independent.

    18

  • Conditional Independence

    If two variables are conditionally independent.

    E.g. y = flu?, x = achiness?, z = headache?

    19

    p(x, z|y) = p(x|y)p(z|y)p(x, z) = p(x)p(z)

    x z|y

  • Factorization if a joint

    Assume

    How do you factorize:

    20

    x z|y

    p(x, y, z)

    p(x, y, z) = p(x, z|y)p(y)p(x|y)p(z|y)p(y)

  • Factorization if a joint

    What if there is no conditional independence?

    How do you factorize:

    21

    p(x, y, z)

    p(x, y, z) = p(x, z|y)p(y)p(x|y, z)p(z|y)p(y)

  • Structure of Graphical Models Graphical models allow us to represent

    dependence relationships between variables visually Graphical models are directed acyclic graphs

    (DAG). Nodes: random variables Edges: Dependence relationship No Edge: Independent variables Direction of the edge: indicates a parent-child

    relationship Parent: Source Trigger Child: Destination Response

    22

  • Example Graphical Models

    Parents of a node i are denoted i Factorization of the joint in a graphical

    model:

    23

    p(x, y) = p(x)p(y) p(x, y) = p(x|y)p(y)

    p(x0, . . . , xn1) =n1

    i=0

    p(xi|i)

    x y x y

  • Basic Graphical Models Independent Variables

    Observations

    When we observe a variable, (fix its value from data) we color the node grey.

    Observing a variable allows us to condition on it. E.g. p(x,z|y)

    Given an observation we can generate pdfs for the other variables.

    24

    x y z

    x y z

  • Example Graphical Models

    X = cloudy? Y = raining? Z = wet ground? Markov Chain

    25

    x y z

    p(x, y, z) =

    n{x,y,z}

    p(n|n) = p(x)p(y|x)p(z|y)

  • Example Graphical Models

    Markov Chain

    Are x and z conditionally independent given y?

    26

    x y z

    p(x, y, z) =

    n{x,y,z}

    p(n|n) = p(x)p(y|x)p(z|y)

    p(x, z|y) = p(x|y)p(z|y)

  • Example Graphical Models

    Markov Chain

    27

    x y z p(x, y, z) =

    n{x,y,z}

    p(n|n) = p(x)p(y|x)p(z|y)

    p(x, z|y) = p(x|z, y)p(z|y)

    p(x|z, y) = p(x, y, z)p(y, z)

    =p(x)p(y|x)p(z|y)

    p(y)p(z|y)

    =p(x)p(y|x)

    p(y)=

    p(x, y)

    p(y)= p(x|y)

    p(x, z|y) = p(x|y)p(z|y)x z|y

  • One Trigger Two Responses

    X = achiness? Y = flu? Z = fever?

    28

    x

    y

    z

    p(x, y, z) =

    n{x,y,z}

    (n|n) = p(x|y)p(y)p(z|y)

  • Example Graphical Models

    Are x and z conditionally independent given y?

    29

    p(x, z|y) = p(x|y)p(z|y)

    x

    y

    z

    p(x, y, z) =

    n{x,y,z}

    (n|n) = p(x|y)p(y)p(z|y)

  • Example Graphical Models

    30

    xy

    zp(x, y, z) =

    n{x,y,z}

    (n|n) = p(x|y)p(y)p(z|y)

    p(x, z|y) = p(x|z, y)p(z|y)

    p(x|z, y) = p(x, y, z)p(y, z)

    =p(x|y)p(y)p(z|y)

    p(y)p(z|y)= p(x|y)

    p(x, z|y) = p(x|y)p(z|y)x z|y

  • Two Triggers One Response

    X = rain? Y = wet sidewalk? Z = spilled coffee?

    31

    x

    y

    z

    p(x, y, z) =

    n{x,y,z}

    p(n|n) = p(x)p(y|x, z)p(z)

  • Example Graphical Models

    Are x and z conditionally independent given y?

    32

    p(x, z|y) = p(x|y)p(z|y)

    x

    y

    z

    p(x, y, z) =

    n{x,y,z}

    p(n|n) = p(x)p(y|x, z)p(z)

  • Example Graphical Models

    33

    xy

    z

    p(x, y, z) =

    n{x,y,z}

    p(n|n) = p(x)p(y|x, z)p(z)

    p(x, z|y) = p(x|z, y)p(z|y)

    p(x|z, y) = p(x, y, z)p(y, z)

    =p(x)p(y|x, z)p(z)p(y|x, z)p(z)

    = p(x)

    p(x, z|y) = p(x)p(z|y)x not z|y

  • Factorization

    34

    x0

    x1

    x2 x4

    x3

    x5

    p(x0, x1, x2, x3, x4, x5) =?

  • Factorization

    35

    x0

    x1

    x2 x4

    x3

    x5

    p(x0, x1, x2, x3, x4, x5) =

    p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

  • How Large are the probability tables?

    36

    p(x0, x1, x2, x3, x4, x5) =

    p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

  • Model Parameters as Nodes

    Treating model parameters as a random variable, we can include these in a graphical model

    Multivariate Bernouli

    37

    0

    x0

    1

    x1

    2

    x2

  • Model Parameters as Nodes

    Treating model parameters as a random variable, we can include these in a graphical model

    Multinomial

    38

    x0

    x1 x2

  • Nave Bayes Classification

    Observed variables xi are independent given the class variable y

    The distribution can be optimized using maximum likelihood on each variable separately.

    Can easily combine various types of distributions

    39

    x0

    y

    x1 x2

    p(y|x0x1, x2) p(x0, x1, x2|y)p(y)p(y|x0x1, x2) p(x0|y)p(x1|y)p(x2|y)p(y)

  • Graphical Models Graphical representation of dependency

    relationships Directed Acyclic Graphs Nodes as random variables Edges define dependency relations What can we do with Graphical Models

    Learn parameters to fit data Understand independence relationships between

    variables Perform inference (marginals and conditionals) Compute likelihoods for classification.

    40

  • Plate Notation

    To indicate a repeated variable, draw a plate around it.

    41

    x0

    y

    x1 xn

    y

    xi

    n

  • Completely observed Graphical Model

    Observations for every node

    Simplest (least general) graph, assume each independent

    42

    Completely Observed graphical models

    Suppose we have observations for every node.

    Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

    In the simplest least general graph, assume each independent. Train 6separate models.

    Fl Fe Si Ac Sw He

    2nd simplest graph most general assume no independence. Build a6-dimensional table. (Divide by total count.)

    Fl Fe Si Ac Sw He

    20 / 37

    Completely Observed graphical models

    Suppose we have observations for every node.

    Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

    In the simplest least general graph, assume each independent. Train 6separate models.

    Fl Fe Si Ac Sw He

    2nd simplest graph most general assume no independence. Build a6-dimensional table. (Divide by total count.)

    Fl Fe Si Ac Sw He

    20 / 37

  • Completely observed Graphical Model

    Observations for every node

    Second simplest graph, assume complete dependence

    43

    Completely Observed graphical models

    Suppose we have observations for every node.

    Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

    In the simplest least general graph, assume each independent. Train 6separate models.

    Fl Fe Si Ac Sw He

    2nd simplest graph most general assume no independence. Build a6-dimensional table. (Divide by total count.)

    Fl Fe Si Ac Sw He

    20 / 37

    Completely Observed graphical models

    Suppose we have observations for every node.

    Flu Fever Sinus Ache Swell HeadY L Y Y Y NN M N N N NY H N N Y YY M Y N N Y

    In the simplest least general graph, assume each independent. Train 6separate models.

    Fl Fe Si Ac Sw He

    2nd simplest graph most general assume no independence. Build a6-dimensional table.

    Fl Fe Si Ac Sw He

    20 / 36

  • Maximum Likelihood

    Each node has a conditional probability table,

    Given the tables, we can construct the pdf.

    Use Maximum Likelihood to find the best

    settings of 44

    Maximum Likelihood Conditional Probability Tables

    Consider this Graphical Model

    x0x1

    x2

    x3

    x4

    x5

    Each node has a conditional probability table i .

    Given the table, we have a pdf

    p(x|) =M1

    i=0

    p(xi |i , i)

    We have m variables in x, and N data points, X.

    Maximum (log) Likelihood

    = argmax

    ln p(X|)

    = argmax

    N1

    n=0

    ln p(Xn|)

    = argmax

    N1

    n=0

    lnM1

    i=0

    p(xin|i )

    = argmax

    N1

    n=0

    M1

    i=0

    ln p(xin|i )

    21 / 36

    p(x|) =M1

    i=0

    p(xi|i, i)

  • Maximum likelihood = argmax

    ln p( X|)

    = argmax

    N1

    n=0

    ln p( Xn|)

    = argmax

    N1

    n=0

    lnM1

    i=0

    p(xin|i)

    = argmax

    N1

    n=0

    M1

    i=0

    ln p(xin|i)

    45

  • Count functions Count the number of times something

    appears in the data

    46

    Maximum Likelihood CPTs

    First, Kroneckers delta function.

    (xn, xm) =

    {

    1 if xn = xm0 otherwise

    Counts: the number of times something appears in the data

    m(xi ) =N1

    n=0

    (xi , xin)

    m(X) =N1

    n=0

    (X,Xn)

    N =

    x1

    m(x1) =

    x1

    (

    x2

    (x1, x2)

    )

    =

    x1

    (

    x2

    (

    x3

    (x1, x2, x3)

    ))

    . . .

    22 / 36

    m(xi) =N1

    n=0

    (xi, xin)

    m( X) =N1

    n=0

    ( X, Xn)

    N =

    x1

    m(x1) =

    x1

    x2

    (x1, x2)

    =

    x1

    x2

    x3

    (x1, x2, x3)

    . . .

  • Maximum Likelihood

    Define a function: Constraint:

    47

    Maximum likelihood CPTs

    l() =N1

    n=0

    ln p(Xn|)

    =N1

    n=0

    ln

    X

    p(X|)(xn,X)

    =N1

    n=0

    X

    (xn,X) ln p(X|)

    =

    xn

    m(X) ln p(X|)

    =

    xn

    m(X) lnM1

    i=0

    p(xi |i , i )

    =

    xn

    M1

    i=0

    m(X) ln p(xi |i , i )

    =M1

    i=0

    xi ,i

    X\xi\i

    m(X) ln p(xi |i , i )

    =M1

    i=0

    xi ,i

    m(xi , i) ln p(xi |i , i )

    Define a function:(xi ,i ) = p(xi |i , i )

    Constraint:

    xi

    (xi ,i ) = 1

    23 / 36

    Maximum likelihood CPTs

    l() =N1

    n=0

    ln p(Xn|)

    =N1

    n=0

    ln

    X

    p(X|)(xn,X)

    =N1

    n=0

    X

    (xn,X) ln p(X|)

    =

    xn

    m(X) ln p(X|)

    =

    xn

    m(X) lnM1

    i=0

    p(xi |i , i )

    =

    xn

    M1

    i=0

    m(X) ln p(xi |i , i )

    =M1

    i=0

    xi ,i

    X\xi\i

    m(X) ln p(xi |i , i )

    =M1

    i=0

    xi ,i

    m(xi , i) ln p(xi |i , i )

    Define a function:(xi ,i ) = p(xi |i , i )

    Constraint:

    xi

    (xi ,i ) = 1

    23 / 36

    Maximum likelihood CPTs

    l() =N1

    n=0

    ln p(Xn|)

    =N1

    n=0

    ln

    X

    p(X|)(xn,X)

    =N1

    n=0

    X

    (xn,X) ln p(X|)

    =

    xn

    m(X) ln p(X|)

    =

    xn

    m(X) lnM1

    i=0

    p(xi |i , i )

    =

    xn

    M1

    i=0

    m(X) ln p(xi |i , i )

    =M1

    i=0

    xi ,i

    X\xi\i

    m(X) ln p(xi |i , i )

    =M1

    i=0

    xi ,i

    m(xi , i) ln p(xi |i , i )

    Define a function:(xi ,i ) = p(xi |i , i )

    Constraint:

    xi

    (xi ,i ) = 1

    23 / 36

  • Maximum Likelihood

    Use Lagrange Multipliers

    48

    l() =M1

    i=0

    xi

    i

    m(xi,i) ln (xi,i)M1

    i=0

    i

    i

    xi

    (xi,i) 1

    l()

    (xi,i)=

    m(xi,i)

    (xi,i) i = 0

    (xi,i) =m(xi,i)

    i

    xi

    m(xi,i)

    i= 1 the constraint

    i =

    xi

    m(xi,i) = m(i)

    (xi,i) =m(xi,i)

    m(i) counts!

  • Maximum A Posteriori Training

    Bayesians would never do that, the thetas need a prior.

    49

    (xi,i) =m(xi,i) +

    m(i) + |xi|

  • Conditional Dependence Test Can check conditional independence in a graphical model

    Is achiness (x3) independent of the flue (x0) given fever(x1)? Is achiness (x3) independent of sinus infections(x2) given fever

    (x1)?

    50

    p(x) = p(x0)p(x1|x0)p(x2|x0)p(x3|x1)p(x4|x2)p(x5|x1, x4)

    p(x3|x0, x1, x2) =p(x0, x1, x2, x3)

    p(x0, x1, x2)

    =p(x0)p(x1|x0)p(x2|x0)p(x3|x1)

    p(x0)p(x1|x0)p(x2|x0)= p(x3|x1)

    x3 x0, x2|x1

  • D-Separation and Bayes Ball

    Intuition: nodes are separated or blocked by sets of nodes. E.g. nodes x1 and x2, block the path from x0

    to x5. So x0 is cond. ind.from x5 given x1 and x2

    51

    D-separation and Bayes Ball

    !

    "

    #

    $x0

    x1

    x2

    x3

    x4

    x5

    Intuition: nodes are separated, or blocked by sets of nodes

    Example: nodes x1 and x2, block the path from x0 to x5 ,then x0 x5|x2, x3

    28 / 36

  • Bayes Ball Algorithm

    Shade nodes xc Place a ball at each node in xa Bounce balls around the graph according

    to rules If no balls reach xb, then cond. ind.

    52

    xa xb|xc

  • Ten rules of Bayes Ball Theorem

    53

  • Bayes Ball Example

    Bayes Ball Example - I

    x0 x4|x2?

    x0

    x1

    x2

    x3

    x4

    x5

    32 / 36

    54

  • Bayes Ball Example

    Bayes Ball Example - II

    x0 x5|x1, x2?

    x0

    x1

    x2

    x3

    x4

    x5

    33 / 36

    55

  • Undirected Graphs What if we allow undirected graphs? What do they correspond to? Not Cause/Effect, or Trigger/Response,

    but general dependence Example: Image pixels, each pixel is a

    bernouli P(x11,, x1M,, xM1,, xMM) Bright pixels have bright neighbors

    No parents, just probabilities. Grid models are called Markov

    Random Fields

    56

  • Undirected Graphs

    Undirected separability is easy. To check conditional independence of A

    and B given C, check the Graph reachability of A and B without going through nodes in C

    57

    D

    B

    C

    A

  • Next Time

    More fun with Graphical Models

    Read Chapter 8.1, 8.2

    58