andrew rosenberg- lecture 1.2: probability and statistics csc 84020 - machine learning

8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning

1/48

Lecture 1.2: Probability and StatisticsCSC 84020 - Machine Learning

Andrew Rosenberg

January 29, 2009
http://find/http://goback/


2/48

Today

Probability and Statistics


3/48

Background

What exposure have you had to probability and statistics?

Conditional probabilities?Bayes rule?The difference between a posterior, a conditional and a prior?


4/48

Articial Intelligence

Classical Articial Intelligence

Expert SystemsTheorem ProversShakeyChess

Largely characterised by determinism.


5/48

Articial Intelligence

Modern Articial Intelligence

Fingerprint IDInternet SearchVision facial ID, etc.Speech RecognitionAsimoJeopardy http://www.research.ibm.com/deepqa/

Statistical modeling to generalize from data.


6/48

Natural Intelligence?

Brief Tangent

Is there a role of probability and statistics in Natural Intelligence?


7/48

Caveats about Probability and Statistics

Black Swans and The Long Tail.


8/48

Black Swans

In the 17th century, all observed swans were white .

Therefore, based on evidence, it was deemed impossible for aswan to be anything other than white.

l k


9/48

Black Swans

In the 17th century, all observed swans were white .

Therefore, based on evidence, it was deemed impossible for aswan to be anything other than white.

In the early 18th century, black swans were discovered in WesternAustralia.

Black Swans are rare, sometimes unpredictable, events that haveextreme impact.

Almost all Statistical models underestimate the likelihood of unseen events.

Th L T il


10/48

The Long Tail

Many events follow an exponential distribution.

These distributions typically have a very long tail. That is, a longregion with relatively low probability mass.

Often, interesting events occur in the Long Tail, but its difficult toaccurately model the behavior in this region of the distribution

P b bili Th


11/48

Probability Theory

Example: Boxes and Balls.

Two boxes: 1 red, 1 blue

In the red box there are 2 apples and 6 oranges. In the blue box

there are 3 apples and 1 orange.

B d F it


12/48

Boxes and Fruit

Suppose we draw from the Red box 60% of the time and the BlueBox 40% of the time.

We are equally likely to draw any piece of fruit once the box isselected.

The identity of the Box is a random variable B . The identity of the fruit is a random variable , F .

B can take one of two values: r (red box) or b (blue box).F can take one of two values: a (apple) or o (orange).

We want to answer questions like what is the total probability of picking an apple? and given that I chose an orange, what is theprobability that it was drawn from the blue box?.

Some basics


13/48

Some basics

The probability of an event is the fraction of times that anevent occurs out of some number of trials, as the number of trials approaches innity.Probabilities lie in the range of [0,1].

Mutually exclusive events are those events that cannotsimultaneously occur.The sum of the probabilities of all mutually exclusive eventsmust equal 1.

If two events are independent, p (X , Y ) = p (X )p (Y ) andp (X |Y ) = p (X )

Joint Probability


14/48

Joint Probability

Joint probability table of the example.

o ablue 1 3 4red 6 2 8

7 5 12Let nij be the number of times event i and event j simultaneously

occur. For example, selecting an orange from the blue box.

p (X = x i , Y = y j ) =nij N

Joint Probability


15/48

Joint Probability

A more generalized representation of ajoint probability

.r j = i nij

n ij

c i = j nij N = i j nij Let nij be the number of times event i and event j simultaneously

occur. For example, selecting an orange from the blue box.

p (X = x i , Y = y j ) =nij N

Marginalization


16/48

Marginalization

Now consider the probability of X irrespective of Y .

p (X = x i ) =c i N

The number of instances in column i is the sum of the instances in

each cell.

c i =L

j =1

nij

Therefore, we can marginalize or sum over Y:

p (X = x i ) =L

j =1

p (X = x i , Y = y j )

Conditional Probability


17/48

Conditional Probability

Now consider only instances whereX = x i . The fraction of these instances where Y = y j is written p (Y = y j |X = x i ).This is a conditional probability the probability of y given x .

p (Y = y j |X = x i ) =nij c i

Also,

p (X = x i , Y = y j ) =nij N

=nij c i

c i N

= p (Y = y j |X = x i )p (X = x i )

Sum and Product Rules


18/48

Sum and Product Rules

In general we will usep (X ) to refer to a distribution over a randomvariable, and p (x i ) to refer to the distribution evaluated at aparticular value.

Sum Rule

p (X ) =Y

p (X , Y )

Product Rule

p (X , Y ) = p (Y |X )p (X )

Bayes Theorem


19/48

Bayes Theorem

p (Y |X ) =p (X |Y )p (Y )

p (X )

The denominator can be viewed as a normalization term:

p (X ) =Y

p (X |Y )p (Y )

Return to Boxes and Fruit


20/48

Return to Boxes and Fruit

Now we can return to the question If an orange was chosen, whatbox did it come from, or dene the distribution, p (B |F = o ).

p (B = r |F = o ) =p (F = o |B = r )p (B = r )

p (F = o )

=34

4109

20

=34

410

209

=23

p (B = b |F = o ) = 1 23 =13

Interpretation of Bayes Rule


21/48

Interpretation of Bayes Rule

p (B |F ) =p (F |B )p (B )

p (F )

p (B ) is called the prior of B . This is information we have before observing anything about the fruit that was drawn.

p (B |F ) is call the posterior probability , or simply the posterior .This is the distribution of B after observing F .In our example, the prior probability of B = r was 410 , but theposterior was 23 .

The probability that the box was red increased after observation of F .

Continuous Probabilities


22/48

Continuous Probabilities

So far we have been dealing with discrete probabilities, whereX

can take one of M discrete values. What if X could takecontinuous values?

(Enter calculus.)

The probability of a real-valued random variable falling within(x , x + x ) is p (x )x as x .p(x) is the probability density or probability density functionover x .Thus the probability that x will lie in an interval (a,b) is given by:

p (x (a , b )) = a

b p (x )dx

Graphical Example of continuous probabilities.


23/48

Graphical Example of continuous probabilities.

Continuous Probability Identities


24/48

y

p (x ) 1

p (x )dx = 1

Sum Rulep (x ) = p (x , y )dy

Product Rule

p (x , y ) = p (y |x )p (x )

Expected Values


25/48

p

Given a random variable x characterized by a distribution p (x ),what is the expected value of x ?

Expected Values


26/48

p

Given a random variable x characterized by a distribution p (x ),what is the expected value of x ?

The expectation of x .

E [x ] =x

p (x )x

or

E [x ] = p (x )x dx

Expected Values Example 1


27/48

p p

What is the expected value when rolling one die?

x p(x)

1

2

3

4

56



28/48

p p

What is the expected value when rolling one die?

x p(x)

1 16

2 16

3 16

4 16

516

6 16

Distribution of Dice values


29/48



30/48

E [x ] =x

p (x )x

= 1 1

6 + 2 1

6 + 3 1

6 + 4 1

6 + 5 1

6 + 6 1

6=

216

= 3 .5



31/48

E [x ] =x

p (x )x

= 1 1

6 + 2 1

6 + 3 1

6 + 4 1

6 + 5 1

6 + 6 1

6=

216

= 3 .5

E [x ] =1

N

N

i

x i



32/48

What is the expected value when rolling two dice?

x p(x)234

56789101112



33/48

What is the expected value when rolling two dice?

x p(x)2 1363 2364 3365 4366 5367 6368 5369 43610 33611 23612 136



34/48

E [x ] =x

p (x )x

= 2 1

36 + 3 2

36 + 4 3

36 + 5 436 + 6

536 + 7

636

+8 5

36+ 9

436

+ 10 336

+ 11 2

36+ 12

136

=25236

= 7



35/48

Distribution of values of one die



36/48

Distribution of values of two dice



37/48

Distribution of values of three dice



38/48

Distribution of values of four dice

Multinomial Distribution


39/48

Multinomial DistributionIf a variable, x , can take 1-of-K states, we can represent thisvariable as being drawn from amultinomial distribution .

We say the probability of x being a member of state k is k ,elements of a vector .

K

k =1

k = 1

p (x | ) =K

k =1

x k k

Expected Value of a Multinomial Distribution


40/48

Expectation

E [x| ] =x

p (x | )x = ( 0 , 1 , . . . , K 1 )T

Gaussian Distribution


41/48

As the number of dice increases, the multinomial distributionapproaches a Gaussian Distribution , or Normal Distribution .

One dimensional

N (x |, 2

) =1

2 2 exp 1

2 2 (x )2

D-dimensional

N (x

|, ) =

1

(2 )D / 2 | |1 / 2exp

1

2(x

)T 1 (x

)

Gaussian Example


42/48

Image from wikipedia.

Gaussian Example


43/48

Image from wikipedia.

Expectation of a Gaussian


44/48

E [x |, 2 ] = N (x |, 2 )xdx

= 12 2 exp 12 2 (x )2 xdx or

E [x| , ] = N (x| , ) xdx =

1(2 )D / 2 | |1 / 2 exp

12(x )

T

1

(x ) xdx Well need some calculus for this, so next time.

Variances


45/48

The variance of x describes how much variability around the mean,E [x ].

var[f ] = E [(f (x ) E [f (x )])2 ]var[f ] = E [f (x )2 ] E [f (x )]

2

Covariance


46/48

The covariance of two random variables, x and y , expresses towhat extent the two vary together.

cov[x , y ] = E x ,y [(x E (x ))( y E [y ])]= E x ,y [xy ] E [x ]E [y ]If two variables are independent their covariance equals zero.(Know how to prove this.)

How does Machine Learning use Probabilities


47/48

The expectation of a function is the guess.The covariance is the condence in this guess.

These are simple operations. . .

But how can we nd the best estimate of p (x )?

Bye


48/48

Next

Linear AlgebraVector Calculus

andrew rosenberg- lecture 1.2: probability and statistics csc 84020 - machine learning

Documents