andrew rosenberg- lecture 1.2: probability and statistics csc 84020 - machine learning

Upload: roots999

Post on 06-Apr-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    1/48

    Lecture 1.2: Probability and StatisticsCSC 84020 - Machine Learning

    Andrew Rosenberg

    January 29, 2009

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    2/48

    Today

    Probability and Statistics

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    3/48

    Background

    What exposure have you had to probability and statistics?

    Conditional probabilities?Bayes rule?The difference between a posterior, a conditional and a prior?

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    4/48

    Articial Intelligence

    Classical Articial Intelligence

    Expert SystemsTheorem ProversShakeyChess

    Largely characterised by determinism.

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    5/48

    Articial Intelligence

    Modern Articial Intelligence

    Fingerprint IDInternet SearchVision facial ID, etc.Speech RecognitionAsimoJeopardy http://www.research.ibm.com/deepqa/

    Statistical modeling to generalize from data.

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    6/48

    Natural Intelligence?

    Brief Tangent

    Is there a role of probability and statistics in Natural Intelligence?

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    7/48

    Caveats about Probability and Statistics

    Black Swans and The Long Tail.

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    8/48

    Black Swans

    In the 17th century, all observed swans were white .

    Therefore, based on evidence, it was deemed impossible for aswan to be anything other than white.

    l k

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    9/48

    Black Swans

    In the 17th century, all observed swans were white .

    Therefore, based on evidence, it was deemed impossible for aswan to be anything other than white.

    In the early 18th century, black swans were discovered in WesternAustralia.

    Black Swans are rare, sometimes unpredictable, events that haveextreme impact.

    Almost all Statistical models underestimate the likelihood of unseen events.

    Th L T il

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    10/48

    The Long Tail

    Many events follow an exponential distribution.

    These distributions typically have a very long tail. That is, a longregion with relatively low probability mass.

    Often, interesting events occur in the Long Tail, but its difficult toaccurately model the behavior in this region of the distribution

    P b bili Th

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    11/48

    Probability Theory

    Example: Boxes and Balls.

    Two boxes: 1 red, 1 blue

    In the red box there are 2 apples and 6 oranges. In the blue box

    there are 3 apples and 1 orange.

    B d F it

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    12/48

    Boxes and Fruit

    Suppose we draw from the Red box 60% of the time and the BlueBox 40% of the time.

    We are equally likely to draw any piece of fruit once the box isselected.

    The identity of the Box is a random variable B . The identity of the fruit is a random variable , F .

    B can take one of two values: r (red box) or b (blue box).F can take one of two values: a (apple) or o (orange).

    We want to answer questions like what is the total probability of picking an apple? and given that I chose an orange, what is theprobability that it was drawn from the blue box?.

    Some basics

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    13/48

    Some basics

    The probability of an event is the fraction of times that anevent occurs out of some number of trials, as the number of trials approaches innity.Probabilities lie in the range of [0,1].

    Mutually exclusive events are those events that cannotsimultaneously occur.The sum of the probabilities of all mutually exclusive eventsmust equal 1.

    If two events are independent, p (X , Y ) = p (X )p (Y ) andp (X |Y ) = p (X )

    Joint Probability

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    14/48

    Joint Probability

    Joint probability table of the example.

    o ablue 1 3 4red 6 2 8

    7 5 12Let nij be the number of times event i and event j simultaneously

    occur. For example, selecting an orange from the blue box.

    p (X = x i , Y = y j ) =nij N

    Joint Probability

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    15/48

    Joint Probability

    A more generalized representation of ajoint probability

    .r j = i nij

    n ij

    c i = j nij N = i j nij Let nij be the number of times event i and event j simultaneously

    occur. For example, selecting an orange from the blue box.

    p (X = x i , Y = y j ) =nij N

    Marginalization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    16/48

    Marginalization

    Now consider the probability of X irrespective of Y .

    p (X = x i ) =c i N

    The number of instances in column i is the sum of the instances in

    each cell.

    c i =L

    j =1

    nij

    Therefore, we can marginalize or sum over Y:

    p (X = x i ) =L

    j =1

    p (X = x i , Y = y j )

    Conditional Probability

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    17/48

    Conditional Probability

    Now consider only instances whereX = x i . The fraction of these instances where Y = y j is written p (Y = y j |X = x i ).This is a conditional probability the probability of y given x .

    p (Y = y j |X = x i ) =nij c i

    Also,

    p (X = x i , Y = y j ) =nij N

    =nij c i

    c i N

    = p (Y = y j |X = x i )p (X = x i )

    Sum and Product Rules

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    18/48

    Sum and Product Rules

    In general we will usep (X ) to refer to a distribution over a randomvariable, and p (x i ) to refer to the distribution evaluated at aparticular value.

    Sum Rule

    p (X ) =Y

    p (X , Y )

    Product Rule

    p (X , Y ) = p (Y |X )p (X )

    Bayes Theorem

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    19/48

    Bayes Theorem

    p (Y |X ) =p (X |Y )p (Y )

    p (X )

    The denominator can be viewed as a normalization term:

    p (X ) =Y

    p (X |Y )p (Y )

    Return to Boxes and Fruit

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    20/48

    Return to Boxes and Fruit

    Now we can return to the question If an orange was chosen, whatbox did it come from, or dene the distribution, p (B |F = o ).

    p (B = r |F = o ) =p (F = o |B = r )p (B = r )

    p (F = o )

    =34

    4109

    20

    =34

    410

    209

    =23

    p (B = b |F = o ) = 1 23 =13

    Interpretation of Bayes Rule

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    21/48

    Interpretation of Bayes Rule

    p (B |F ) =p (F |B )p (B )

    p (F )

    p (B ) is called the prior of B . This is information we have before observing anything about the fruit that was drawn.

    p (B |F ) is call the posterior probability , or simply the posterior .This is the distribution of B after observing F .In our example, the prior probability of B = r was 410 , but theposterior was 23 .

    The probability that the box was red increased after observation of F .

    Continuous Probabilities

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    22/48

    Continuous Probabilities

    So far we have been dealing with discrete probabilities, whereX

    can take one of M discrete values. What if X could takecontinuous values?

    (Enter calculus.)

    The probability of a real-valued random variable falling within(x , x + x ) is p (x )x as x .p(x) is the probability density or probability density functionover x .Thus the probability that x will lie in an interval (a,b) is given by:

    p (x (a , b )) = a

    b p (x )dx

    Graphical Example of continuous probabilities.

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    23/48

    Graphical Example of continuous probabilities.

    Continuous Probability Identities

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    24/48

    y

    p (x ) 1

    p (x )dx = 1

    Sum Rulep (x ) = p (x , y )dy

    Product Rule

    p (x , y ) = p (y |x )p (x )

    Expected Values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    25/48

    p

    Given a random variable x characterized by a distribution p (x ),what is the expected value of x ?

    Expected Values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    26/48

    p

    Given a random variable x characterized by a distribution p (x ),what is the expected value of x ?

    The expectation of x .

    E [x ] =x

    p (x )x

    or

    E [x ] = p (x )x dx

    Expected Values Example 1

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    27/48

    p p

    What is the expected value when rolling one die?

    x p(x)

    1

    2

    3

    4

    56

    Expected Values Example 1

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    28/48

    p p

    What is the expected value when rolling one die?

    x p(x)

    1 16

    2 16

    3 16

    4 16

    516

    6 16

    Distribution of Dice values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    29/48

    Expected Values Example 1

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    30/48

    E [x ] =x

    p (x )x

    = 1 1

    6 + 2 1

    6 + 3 1

    6 + 4 1

    6 + 5 1

    6 + 6 1

    6=

    216

    = 3 .5

    Expected Values Example 1

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    31/48

    E [x ] =x

    p (x )x

    = 1 1

    6 + 2 1

    6 + 3 1

    6 + 4 1

    6 + 5 1

    6 + 6 1

    6=

    216

    = 3 .5

    E [x ] =1

    N

    N

    i

    x i

    Expected Values Example 2

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    32/48

    What is the expected value when rolling two dice?

    x p(x)234

    56789101112

    Expected Values Example 2

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    33/48

    What is the expected value when rolling two dice?

    x p(x)2 1363 2364 3365 4366 5367 6368 5369 43610 33611 23612 136

    Expected Values Example 2

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    34/48

    E [x ] =x

    p (x )x

    = 2 1

    36 + 3 2

    36 + 4 3

    36 + 5 436 + 6

    536 + 7

    636

    +8 5

    36+ 9

    436

    + 10 336

    + 11 2

    36+ 12

    136

    =25236

    = 7

    Distribution of Dice values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    35/48

    Distribution of values of one die

    Distribution of Dice values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    36/48

    Distribution of values of two dice

    Distribution of Dice values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    37/48

    Distribution of values of three dice

    Distribution of Dice values

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    38/48

    Distribution of values of four dice

    Multinomial Distribution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    39/48

    Multinomial DistributionIf a variable, x , can take 1-of-K states, we can represent thisvariable as being drawn from amultinomial distribution .

    We say the probability of x being a member of state k is k ,elements of a vector .

    K

    k =1

    k = 1

    p (x | ) =K

    k =1

    x k k

    Expected Value of a Multinomial Distribution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    40/48

    Expectation

    E [x| ] =x

    p (x | )x = ( 0 , 1 , . . . , K 1 )T

    Gaussian Distribution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    41/48

    As the number of dice increases, the multinomial distributionapproaches a Gaussian Distribution , or Normal Distribution .

    One dimensional

    N (x |, 2

    ) =1

    2 2 exp 1

    2 2 (x )2

    D-dimensional

    N (x

    |, ) =

    1

    (2 )D / 2 | |1 / 2exp

    1

    2(x

    )T 1 (x

    )

    Gaussian Example

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    42/48

    Image from wikipedia.

    Gaussian Example

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    43/48

    Image from wikipedia.

    Expectation of a Gaussian

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    44/48

    E [x |, 2 ] = N (x |, 2 )xdx

    = 12 2 exp 12 2 (x )2 xdx or

    E [x| , ] = N (x| , ) xdx =

    1(2 )D / 2 | |1 / 2 exp

    12(x )

    T

    1

    (x ) xdx Well need some calculus for this, so next time.

    Variances

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    45/48

    The variance of x describes how much variability around the mean,E [x ].

    var[f ] = E [(f (x ) E [f (x )])2 ]var[f ] = E [f (x )2 ] E [f (x )]

    2

    Covariance

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    46/48

    The covariance of two random variables, x and y , expresses towhat extent the two vary together.

    cov[x , y ] = E x ,y [(x E (x ))( y E [y ])]= E x ,y [xy ] E [x ]E [y ]If two variables are independent their covariance equals zero.(Know how to prove this.)

    How does Machine Learning use Probabilities

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    47/48

    The expectation of a function is the guess.The covariance is the condence in this guess.

    These are simple operations. . .

    But how can we nd the best estimate of p (x )?

    Bye

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

    48/48

    Next

    Linear AlgebraVector Calculus

    http://find/http://goback/