hy 439 presented by: george fortetsanakishy439/project2011/project3/...poisson process random...

34
Statistical analysis using matlab HY 439 Presented by: George Fortetsanakis

Upload: others

Post on 23-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Statistical analysis using matlab

HY 439

Presented by: George Fortetsanakis

Page 2: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Roadmap

• Probability distributions

• Statistical estimation

• Fitting data to probability distributions

Page 3: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Continuous distributions

• Continuous random variable X takes values in subset of real numbers D⊆R

• X corresponds to measurement of some property, e.g., length, weight

• Not possible to talk about the probability of X taking a specific value

• Instead talk about probability of X lying in a given interval

0)( xXP

]) ,[()( 2121 xxXPxXxP

]) ,[()( xXPxXP

Page 4: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Probability density function (pdf)

• Continuous function p(x) defined for each x∈D

• Probability of X lying in interval I⊆D computed by integral:

• Examples:

• Important property:

lx

dxxplXP )()(

1)()( DxdxxpDXP

2

1

)(]) x,[()( 2121

x

x

dxxpxXPxXxP

x

dxxpXPxXP )(]) x,[()(

Page 5: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Cumulative distribution function (cdf)

• For each x∈D defines the probability

Important properties:

Complementary cumulative distribution function (ccdf)

)( xXP

x

dxxpXPxXPxF )(]) x,[()()(

0)( F

1)( F

)()()( 1221 xFxFxXxP

)(1)(1)()( xFxXPxXPxG

Page 6: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Exponential distribution

Probability density function

Cumulative distribution function

Memoryless property:

)()|( tTPTtTP

Page 7: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Poisson process

Random process that describes the timestamps of various events

• Telephone call arrivals

• Packet arrivals on a router

Time between two consecutive arrivals follows exponential distribution

Time intervals t1, t2, t3, … are drawn from exponential distribution

t1 t2 t3 t4 t5 t6

Arrival 1 Arrival 2 Arrival 3 Arrival 4 Arrival 5 Arrival 6 Arrival 7

Page 8: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Roadmap

• Probability distributions

• Statistical estimation

• Fitting data to probability distributions

Page 9: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Basic statistics

Suppose a set of measurements x = [x1 x2 … xn]

• Estimation of mean value: (matlab m=mean(x);)

• Estimation of standard deviation: (matlab s=std(x);)

n

xn

i

i 1

^

1

1

2^

^

n

xn

i

i

Page 10: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Estimate pdf

• Suppose dataset x = [x1 x2 … xk]

• Can we estimate the pdf that values in x follow?

Page 11: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Estimate pdf

• Suppose dataset x = [x1 x2 … xk]

• Can we estimate the pdf that values in x follow?

Produce histogram

Page 12: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Step 1

• Divide sampling space into a number of bins

• Measure the number of samples in each bin

-4 -2 0 2 4

-4 -2 0 2 4

3 samples 6 samples5 samples 2 samples

Page 13: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Step 2

• E = total area under histogram plot = 2*3 + 2*5 + 2*6 +2*2 = 32

• Normalize y axis by dividing by E

-4 -2 0 2 4

3

65

2

x

Freq

ue

ncy

-4 -2 0 2 4

3/32

6/325/32

2/32

x

P(x

)

Page 14: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Matlab code

function produce_histogram(x, bins)

% input parameters

% X =[x1; x2; … xn]: a column vector containing the data x1, x2, …, xn.

% bins = [b1; b2; …bk]: A vector that Divides the sampling space in bins

% centered around the points b1, b2, …, bk.

figure; % Create a new figure

[f y] = hist(x, bins); % Assign your data points to the corresponding bins

bar(y, f/trapz(y,f), 1); % Plot the histogram

xlabel('x'); % Name axis x

ylabel('p(x)'); % Name axis y

end

Page 15: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Histogram examples1

00

0 s

amp

les

10

00

0 s

amp

les

Bin spacing 0.1 Bin spacing 0.05

Page 16: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Empirical cdf

How can we estimate the cdf that values in x follow?

Use matlab function ecdf(x)

Empirical cdf estimated with 300 samples from normal distribution

Page 17: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Percentiles• Values of variable below which a certain percentage of observations fall

• 80th percentile is the value, below which 80 % of observations fall.

80th percentile

Page 18: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Estimate percentiles

Percentiles in matlab: p = prctile(x, y);

• y takes values in interval [0 100]

• 80th percentile: p = prctile(x, 80);

Median: the 50th percentile

• med = prctile(x, 50); or

• med = median(x);

Why is median different than the mean?

• Suppose dataset x = [1 100 100]: mean = 201/3=67, median = 100

Page 19: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Roadmap

• Elements of probability theory

• Probability distributions

• Statistical estimation

• Fitting data to probability distributions

Page 20: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Problem definition

Dataset D={x1, x2, …, xk} collected from an experiment

Families of distributions:

• Gaussian:

• Exponential:

• Generalized pareto:

Which family of distributions better describes the dataset D?

)}|( ..., ),|( ),|({ 21 Ν21 θθθ xPxPxPS N

,iθ

,,iθ

Page 21: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Step 1: Maximum likelihood estimation

• For each family i determine parameter that better fits the data

• Maximize likelihood of obtaining the data with respect to

k

j

j

j

k

j

k

xp

xp

xxxp

Dp

1

1

21

)|(lnmaxarg

)|(maxarg

)|,...,,(maxarg

)|(maxarg

*

i

θ

θ

θ

θθ

i

i

i

i

*

Likelihood function

Due to independence of samples

Page 22: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Example: exponential distribution

• Probability density function

• Define the log-likelihood function

• Set derivative equal to 0 to find maximum

k

i

i

k

i

i

k

i

k

i

xxkxel i

1111

)ln()ln()ln()(

k

i

i

k

i

i

x

kx

k

d

dl

1

*

1

00)(

Page 23: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Reform question

After MLE: instead of families we have specific distributions

Which distribution better describes the data?

Choose most appropriate distribution based on:

• Q-Q plots

• Kullback–Leibler divergence

)|( ..., ),|( ),|( **

221 θθθ*

1 xPxPxP N

Page 24: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Method of Q-Q plots

Checks how well a probability distribution describes the data

Algorithm

1. Draw random datasets Υ0, Υ1, Υ2, …, ΥΜ from distribution

2. Compute percentiles of these datasets at predefined set of points

3. Compute percentiles of experimental dataset D at the same points

4. Plot percentiles of Y0 against percentiles of each of Y1, Y2, .., YM

5. Plot percentiles of Y0 against percentiles of dataset D

If plot of step 5 is in the area defined by plots in step 4 the distribution

describes the data well

)|( *θii xP

)|( *θii xP

Page 25: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Plot percentiles of Y0 vs. percentiles of Y1

Page 26: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Plot percentiles of Y0 vs. percentiles of Y2

Page 27: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Plot percentiles of Y0 vs. percentiles of Y100

Page 28: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Construct envelope

Page 29: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Plot percentiles of Y0 vs. percentiles of D

Good fitting: The blue curve of original percentiles lies in the envelope

Page 30: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Plot percentiles of Y0 vs. percentiles of D

Bad fitting: The blue curve of original percentiles lies outside the envelope

Page 31: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Method of Kullback–Leibler divergence

Non-symmetric metric of difference between distributions P and Q

Discrete distributions

Continuous distributions

N

i

KLiq

ipipQPD

1 )(

)(log)()||(

dxxq

xpxpQPDKL

)(

)(log)()||(

Page 32: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Algorithm

1. Discretize the empirical pdf of the Dataset D

2. Discretize all distributions

3. Compute KL divergence of theoretical distributions with dataset D

4. Choose the distribution with the lowest KL divergence

-4 -2 0 2 4

3 samples 6 samples5 samples 2 samples

-3 -1 1 3

3/16

5/166/16

2/16

)|( ..., ),|( ),|( **

221 θθθ*

1 xPxPxP N

Page 33: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Online material

http://www.csd.uoc.gr/~hy439/schedule09.html

• Tutorials Statistics

Page 34: HY 439 Presented by: George Fortetsanakishy439/project2011/project3/...Poisson process Random process that describes the timestamps of various events • Telephone call arrivals •

Cross correlation

xcorr(x, y): estimates the cross correlation between two time series x and y

The larger the absolute value of the cross correlation the larger the

correlation of the two variables

][][)( mnnnmnxy yxEyxEmR

White noise Output of IIR filter

No correlation Some correlation