treatment and analysis of data – applied statistics ...lennart/statistics3.pdf · quantiles...

$: Treatment and analysis of data – Applied statistics ...lennart/statistics3.pdf · Quantiles (fractiles), percentiles, ... Sept-Oct 2006 Statistics for astronomers (L. Lindegren,$
Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 1

Treatment and analysis of data – Applied statisticsLecture 3: Sampling and descriptive statistics

Topics covered:

Parameters and statistics

Sample mean and sample standard deviation

Order statistics and quantiles

Confidence intervals and confidence levels

Error bars and box plots

Histograms

Cumulative and percentile plots

Probability plots


Population, parameters, sample and statistics

Sample space (in probability theory) ≡ population (in statistics)

A (random) sample isdrawn from the population

population described by certain parameters such as μ and σ

statistics such asm and s

sampling(data collection)

inference

data analysis

data


Parameters and statistics

A parameter is a quantity that describes a population (e.g. the population mean μand population standard deviation σ).

Data are obtained by sampling the population (e.g., x1, x2, ..., xn).

Any function of the data is called a statistic. Examples of statistics:

n - the number of data points

min(x1, x2, ..., xn) - the smallest data value

x1 + n1/3 - not a very useful statistic

m = (x1 + x2 + ... + xn)/n - the sample mean

s = [ Σi (xi–m)2 / (n–1) ]1/2 - the sample standard deviation


Descriptive statistics

Simple calculations on the data allow to condense them in a form useful e.g. in order to

• summarize results in a way that is quickly grasped

• assess the quality of the data

• compare different sets of data

• explore what kind of information the data may contain

• support a statement (make a conclusion more convincing)

When the data represent a more or less unknown distribution, the most important statistics may be

some measure of location, such as the sample mean or median

some measure of scale (or scatter, or precision), such as the sample standard deviation or interquartile range

This is often supported by graphics which give much more complete information on distributions. (A graph is also a statistic.)


Sample mean and sample standard deviation


Be careful to distinguish between:

• the sample standard deviation

which measures the dispersion among the values x1, x2, ..., xn around the sample mean value m, and

• the standard deviation of the sample mean, which is usually estimated as

and which may be quoted as the standard error (1σ uncertainty) of m.

E.g.: "the mean value and dispersion of the data are 12.3 ± 2.5" is ambiguous!

Important comment

( )∑=

−−

=n

ii mx

ns

1

2

11

( ) ( )∑=

−−

==n

ii mx

nnnsmD

1

2

11

][


Alternative measures of location and scale

The sample mean and sample standard deviation are very sensitive to outliers or stongly deviating points. In manual data analysis one can often cope interactively with these cases, but for automatic analysis it is better to use a more robust method.

In such cases, or when the distribution is known or suspected to be non-gaussian, there are many other useful measures of location and scale.

Instead of the sample mean m we may use the sample median xmed (see below).

Instead of the sample standard deviation (= RMS deviation from the sample mean), we may use the mean absolute deviation from the mean:

Often the sample median is used instead of the sample mean when calculating the MAD. In fact, for any fixed sample the median minimizes the MAD, so it is logical to use the median and MAD together.

∑=

−=n

ii mx

n 1

1MAD


Order statistics


Sample quantiles


Quantiles for the normal (Gaussian) distribution

value

frequency

68% of the area is within ±1σ

32% of the area is outside ±1σ

4.6% of the area is outside ±2σ

0.3% of the area is outside ±3σ

−1σ +1σ +2σ +3σ0−2σ−3σ


frequency

value

25 75 95500.5 5 99.5 percentile

0 0.67 1.65 2.57-2.57 -1.65 -0.67 standard deviations

Confidence intervals and levels – normal case (1)

Alternatively, the precision can be specified as a confidence interval, with an associated confidence level (CL):

x = 3.7 ± 2.5 (90% CL) or 1.2 < x < 6.2 (90% CL)

x > 1.2 (95% CL) [one-sided confidence interval]


Confidence intervals and levels – normal case (2)

Confidence Level two-sided confidence interval (for normal distr.)

50% [ −0.67σ, +0.67σ ]

68% [ −1.00σ, +1.00σ ]

90% [ −1.65σ, +1.65σ ]

95% [ −1.96σ, +1.96σ ]

99% [ −2.58σ, +2.58σ ]

99.9% [ −3.29σ, +3.29σ ]

Caution: older astronomical literature (< 1960) often uses “probable error” (p.e.), which corresponds to 50% CL or ±0.67σ.

Thus: (standard error) = 1.5 × (probable error)


Actual errors rarely follow the normal distribution:

usually points beyond ±3σ are much more frequent than expected for a normal distribution (0.3%)

the distribution is often skew, especially in the tails

sometimes the distribution is completely different, e.g. exponential

Although the standard deviation is applicable to many non-normal cases, it could be misleading without further specification of the distribution.

For instance, given only the information

x = 3.7 ± 1.5 (s.e.)

one might conclude that x > 8.2 is very unlikely (0.15%). However, if x has a log-normal distribution, the probability is in fact 2 – 3%.

Deviations from the normal distribution


Quantiles (fractiles), percentiles, quartiles, etc

Other names for quantiles at certain q-values:

Q(0.5) = median (or 50th percentile)

Q(0.25) = lower quartile (or 25th percentile)

Q(0.75) = upper quartile (or 75th percentile)

Q(0.1) = first decile, Q(0.2) = second decile, etc [not so often used]

The interquartile range IQR = Q(0.75) – Q(0.25) is sometimes used as a measure of precision (equal to 1.35σ for a normal distribution).

Half the "intersextile range" (not a standard term), [Q(5/6) – Q(1/6)]/2 = 0.97σ for a normal distribution, and is useful as a robust assessment of the dispersion.

NOTE: The terms quantile, fractile, and percentile are used almost synonymously in the literature, while median, quartile, decile etc have very specific meanings.


Error bars and box plots

Error bars usually indicate ±1σ (i.e. the confidence interval at 68% CL). If not, the exact meaning must definitely be stated in the figure caption.

Box plots (or box-whisker plots):

medianlower quartile

upper quartile

lowest “non-outlier”

highest “non-outlier”

“outliers” (>1.5×IQR from median)

“outlier”


Histograms

One-dimensional sample distributions are often shown as histograms. A histogram displays the number of data points per bin, versus the position of the bin (or the density of data points, if unequal bin sizes are used).

E.g., define the sequence x0, x1, ..., xn which are the boundaries of n bins.Equal bins of size Δx are obtained as xi = x0 + i Δx, i = 1, 2, ..., n.

Let hi be the number of data points with xi–1 ≤ x < xi . (Note position of <)In the histogram, hi (or sometimes hi /Δxi ) is plotted as a bar from xi–1 to xi .

Things to consider when constructing a histogram:

Which bin size to use? - compromise between resolution and noise. In any case, be careful to specify the bin size if it is not clear from the graph!

Where to start (x0)? - often arbitrary!

What to do with points outside x0, xn (if any)?

A difficulty with histograms is that they look radically different depending on the choices you make!


Different histograms of the same data... (1)

These histograms (of the same 200 points) differ only in the choice of starting value x0

bin size = 2 bin size = 2

bin size = 2 bin size = 2


Different histograms of the same data... (2)

These histograms (of the same 200 points) differ in bin size as well. It is better to make thebins too narrow than too wide: the eye can smooth out the noise but cannot recover lost resolution! Note that the uncertainty of any histogram value hi. is of order ±√hi.

bin size = 2

bin size = 0.5

bin size = 1

bin size = 1


Cumulative plots

An alternative to histogram is to plot the cumulative fraction, analoguous to the cumulative distribution function (cdf):

theoretical distributions empirical data

cumulative distribution function ⇔ cumulative fraction

probability density function ⇔ histogram

The cumulative fraction is a step function that increments by 1/n for each data point, starting from 0 and ending at 1.


Cumulative plot, example

Cumulative fraction plot for the same 200 data points as in the histograms. The two modes can be seen as the steeper parts of the curve around 10 and 15.

n = 200


Cumulative plots, some more examples (1)

You can transform the scale of data valuesto emphasize important intervals.

For example, for strictly positive data it often makes sense to use a logarithmic scale (this and following examples from bardeen.physics.csbsju.edu/stats/).


Cumulative plots, some more examples (2)

Cumulative plots are excellent to compare two samples: do they have the same distribution? (Cf. K-S test.) Works also for samples of unequal size.

The two samples B1 and B2 are clearly drawn from different populations. This is also evident from the box-plot, but not from the the mean/dispersion plot (right).

B1

B2


Percentile plots

The ragged appearance of the cumulative plot can be disturbing to the eye, especially for small n.

It may then be better to use a percentile plot (red line), which simply connects the n points with x(i) as abscissa and p = i/(n+1) as ordinate. This is actually a better estimate of the cumulative distribution function than the cumulative fraction plot.


Percentile plot, example

Percentile plot for the same 200 data points as in the histograms and as in the cumulative fraction plot (slide 20).

n = 200


Transformed percentiles...

Sometimes it's useful to transform the percentile scale to bring out more clearly the important parts of the distribution.

In this example (a sample drawn from from χ32) we are concerned about the tail of

large values, which is difficult to see in the standard percentile plot (left). By plotting 1 – p instead of p and using a logarithmic scale, the tail is emphasized.

n = 200 n = 200


Probability plots

As n → ∞ the percentile plot converges to the cdf F(x).

To see if the data follow a given distribution F(x), we could make a percentile plot with F–1(i/(n+1)) on the y-axis instead of i/(n+1). If the data follow F(x) we should then get (approximately) a straight line. This is a probability plot.

The nice thing about probability plots is that any linear transformation axi+b of the data will just shift and change the slope of the curve, but a straight line (for example) remains straight.

The most common type of this plot is the normal probability plot, using the standard normal cdf

The abscissae are x(i) and the ordinates are Φ–1(i/(n+1)) for i = 1, 2, ..., n.

∫∞−

⎟⎟⎠

⎞⎜⎜⎝

⎛−=Φ

x

ttx d2

exp2

1)(

2

π


The inverse standard normal cdf

To make normal probability plots you need to be able to compute the inverse standard normal cdf Φ–1(p) for any 0 < p < 1. Routines for this are are available in most numerical/statistical packages (can be found e.g. in Numerical Recipes).

If not readily available, use the following approximation which is always good enough for probability plots (maximum error is 0.003; Abramowitz & Stegun, Handbook of Mathematical Functions):

where

The values Φ–1(p) are sometimes called the normal scores.

⎪⎪

⎩

⎪⎪

⎨

⎧

<≤−Φ−

≤<−++

+

=Φ−

−

15.0if)1(

5.00if04481.099229.01

27061.030753.2

)(1

2

1

pp

pttt

t

p

pt ln2−=


Percentile vs. probability plot (1)

Percentile plot for 50 random numbers from a normal distribution with mean = 2 and s.d. = 5.

Note that you can use the percentile plot to estimate quantiles, e.g. the median and the first/last sextiles.

n = 50

0.5

median ≈ 1

5/6

5th sextile ≈ 7

1/61st sextile ≈ -3


Percentile vs. probability plot (2)

Normal probability plot for the same 50 random numbers.

The approximately straight relationship suggest that the data are indeed gaussian.

The median and the quantiles corresponding to ±1σ for the normal distribution are easily found.

n = 50

0

median ≈ 1

+1+1σ ≈ 7

-1-1σ ≈ -3


Normal probability plots, expected variation (n = 20)


Normal probability plots, expected variation (n = 200)


Normal probability plot for a non-normal sample

Normal probability plot for the bimodal sample earlier plotted in the histograms (slides 17-18).

n = 200

bin size = 0.5



Typical normal probability plot for a sample that is nearly gaussian, but with some outliers

n = 200



Normal probability plot for a sample drawn from the Cauchy distribution with location α = 2 and scale β = 5.

n = 200


Cauchy probability plot for the Cauchy sample

Cauchy probability plot for the same sample as in the previous slide. The inverse cdf for the standard Cauchy distribution is F–1( p) = tan [( p – 0.5) π].

n = 200

Probability plots may notbe very useful for extreme distributions like Caucy!


Related to probability plots...

100

101

102

20th Century’s 100 largest disasters worldwide

US Power outages (10M of customers,1985-1997)

Natural ($100B)

Technological ($10B)

Slope = -1(α=1)

10-2

10-1

100


A histogram plot from Hipparcos data analysis

Normalised differences between the FAST and NDAC parallax estimates for successive solutions (12, 18, 30, 37 months of data). n = 40,000 - 100,000.

ESA SP-1200Vol. 3, Fig. 16.28


The same data in a normal probability plot

Normalised differences between the FAST and NDAC parallax estimates for successive solutions (12, 18, 30, 37 months of data). n = 40,000 - 100,000.

ESA SP-1200Vol. 3, Fig. 16.29

Real data aresometimessurprisinglyGaussian!

treatment and analysis of data – applied statistics ...lennart/statistics3.pdf · quantiles...

Documents