Transcript
Page 1: 161.120 Introductory Statistics  Week 2 Lecture slides

161.120 Introductory Statistics Week 2 Lecture slides

• Graphical Displays of Univariate Data: Dot Plots & Stem-and-leaf Plots and Histograms– Text sections 2.4 and 2.5– CAST sections 2.2 and 2.4

• Describing Centre and Spread– Text sections 2.6 and 2.7– CAST sections 2.5 and 2.6

• Transformation & Discrete Data– CAST sections 2.7 and 2.8

Page 2: 161.120 Introductory Statistics  Week 2 Lecture slides

Dot Plots• A graphical display of a batch of numbers• Each value is shown as a dot against a numerical axis• Problem of overlapping

– Jittering dots (used in CAST)• randomly move the dots perpendicularly to the axis in order to

separate them somewhat– Stacking dots

• group values into classes, then vertically stack the dots in each class

• the heights of the stacks show the density for each class• The loss of detailed information in a stacked dot plot is rarely

important

Page 3: 161.120 Introductory Statistics  Week 2 Lecture slides

Stem and leaf PlotsBasically a stacked dot plot using digits instead of dots and

slightly different layout

• The 'axis' is drawn vertically

• A value is printed on the axis for each stack, giving the most significant digits that are common for all values on that stack. This is called the stem for the stack.

• The digits representing the values are called the leaf digits and are drawn in a row to the right of the stems

Page 4: 161.120 Introductory Statistics  Week 2 Lecture slides

• Decimal points are not shown in the stems or the leaves– The stem '12' and leaf '3' could represent

12300 or 1230 or 123 or 12.3 or 1.23 or 0.123, etc. so need to provide a key or state the units of the stem

• Distribution of values is shown by ‘canopy’ of leaves

• Sometimes not shown well – Can change the value of the leaves– Or split the stems

Page 5: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.8 Big Music Collection About how many CDs do you own?

Stem is ‘100s’ and leaf unit is ‘10s’. Final digit is truncated. Numbers ranged from 0 to about 450, with 450 being a clear outlier and most values ranging from 0 to 99. The shape is skewed right.

Page 6: 161.120 Introductory Statistics  Week 2 Lecture slides

Outlier: a data point that is not consistent with the bulk of the data.

Outliers and How to Handle Them

• Look for them via graphs.

• Can have big influence on conclusions.

• Can cause complications in some statistical analyses.

• Cannot discard without justification.

Page 7: 161.120 Introductory Statistics  Week 2 Lecture slides

Possible Reasons for Outliersand Reasonable Actions

• Mistake made while taking measurement or entering it into computer. If verified, should be discarded/corrected.

• Individual in question belongs to a different group than bulk of individuals measured. Values may be discarded if summary is desired and reported for the majority group only.

• Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded — they provide important information about location and spread.

Page 8: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.7 Tiny Boatsmen

Weights (in pounds) of 18 men on crew team:

Note: last weight in each list is unusually small. They are the coxswains for their teams, while others are rowers.

Cambridge:188.5, 183.0, 194.5, 185.0, 214.0, 203.5, 186.0, 178.5, 109.0

Oxford: 186.0, 184.5, 204.0, 184.5, 195.5, 202.5, 174.0, 183.0, 109.5

Page 9: 161.120 Introductory Statistics  Week 2 Lecture slides

Clusters• If a dot plot or stem and leaf plot separates into two or more groups

of values (clusters), this suggests that the 'individuals' from which the data were recorded may similarly be split into two or more groups.

• Clusters may correspond to males and females, different varieties of plants,…

• Detecting the cause of differences between the groups may lead to valuable insights into the data.

Page 10: 161.120 Introductory Statistics  Week 2 Lecture slides

Histograms

• Directly displays the 'canopy' shape, without separately displaying the individual values.

• Are particularly useful displays for large data sets

• Area equals relative frequency

– Each value must contribute the same area to the histogram

– Equal width classes• height of the rectangles equals the frequency of the class

• vertical axis labeled ‘frequency’

– Mixed class widths

• vertical axis labeled ‘density’

Page 11: 161.120 Introductory Statistics  Week 2 Lecture slides

Choice of histogram classes

• Histogram classes should be chosen to give an outline that is as smooth as possible

– Too narrow leads to jagged histogram

– Too wide leads to 'blocky' histogram and detail is lost

• Adjusting the class width and the starting position for the first class can give a surprising amount of variability in histogram shape for small data sets. As a result, you must be wary of over-interpreting features such as clusters or skewness in such histograms.

Page 12: 161.120 Introductory Statistics  Week 2 Lecture slides

• Values are centered around 20 cm.• Two possible low outliers.• Apart from outliers, spans range from about 16 to 23 cm.

Interpreting Histograms, Stemplots, and Dotplots

Page 13: 161.120 Introductory Statistics  Week 2 Lecture slides

• Find extremes (high, low), the median, and the quartiles (medians of lower and upper halves of the values).

• Quick overview of the data values.

• Information about the center, spread, and shape of data.

Five-Number Summaries

Page 14: 161.120 Introductory Statistics  Week 2 Lecture slides

Notation and Finding the QuartilesSplit the ordered values into the half that is below the median and the half that is above the median.Q1 = lower quartile

= median of data valuesthat are below the median

Q3 = upper quartile = median of data values that are above the median

Page 15: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.10 Fastest Speeds (cont)Ordered Data(in rows of 10 values) for the 87 males:

• Median = (87+1)/2 = 44th value in the list = 110 mph• Q1 = median of the 43 values below the median =

(43+1)/2 = 22nd value from the start of the list = 95 mph• Q3 = median of the 43 values above the median =

(43+1)/2 = 22nd value from the end of the list = 120 mph

55 60 80 80 80 80 85 85 85 8590 90 90 90 90 92 94 95 95 9595 95 95 100 100 100 100 100 100 100100 100 101 102 105 105 105 105 105 105105 105 109 110 110 110 110 110 110 110110 110 110 110 110 112 115 115 115 115115 115 120 120 120 120 120 120 120 120120 120 124 125 125 125 125 125 125 130130 140 140 140 140 145 150

Page 16: 161.120 Introductory Statistics  Week 2 Lecture slides

Percentiles

The kth percentile is a number that has k% of the data values at or below it and (100 – k)% of the data values at or above it.

• Lower quartile = 25th percentile• Median = 50th percentile• Upper quartile = 75th percentile

Page 17: 161.120 Introductory Statistics  Week 2 Lecture slides

Median, quartiles and area

• The data set is split into quarters by the median and quartiles. • Histogram area is proportional to relative frequency therefore

the median and quartiles split the histogram into four equal areas. 

Page 18: 161.120 Introductory Statistics  Week 2 Lecture slides

Basic Box plot

Page 19: 161.120 Introductory Statistics  Week 2 Lecture slides

What does a box plot tell you about the distribution?

• Centre

– The vertical line inside the box (the median) gives an indication of the centre of the distribution.

• Spread – The width of the box (the interquartile range) gives an indication

of the spread of values in the distribution. • IQR = UQ - LQ

• Shape – High density corresponds to adjacent box plot values being

close together. In particular, if the extreme and quartile on one side are closer to the median than the extreme and quartile on the other side, this shows that the distribution is skew.

Page 20: 161.120 Introductory Statistics  Week 2 Lecture slides
Page 21: 161.120 Introductory Statistics  Week 2 Lecture slides

Box plot: Clusters & Outliers• Clusters

– Boxplots cannot show clusters in a data set

– Before using a box plots check that clusters do not exist by using dot plot, stem and leaf plot or a histogram

• Outliers– The basic box plot does not clearly show an outlier

– Any values more than 1.5 times the IQR from the box are considered to be outliers and displayed with a separate cross

– Outliers are displayed with a separate cross

– The 'whiskers' that are drawn to the sides of the central box extend only

as far as the most extreme values that are not classified as outliers.

Page 22: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.10 Fastest Speeds Ever Driven

Five-Number Summary for 87 males

• Median = 110 mph measures the center of the data• Two extremes describe spread over 100% of data

Range = 150 – 55 = 95 mph• Two quartiles describe spread over middle 50% of data

Interquartile Range = 120 – 95 = 25 mph

Page 23: 161.120 Introductory Statistics  Week 2 Lecture slides

Fast

est S

peed

s

150

125

100

75

50

Boxplot of Males Fastest Speeds

Page 24: 161.120 Introductory Statistics  Week 2 Lecture slides

Comparing two or more groups

• Box plots are particularly useful for comparing different groups of values

• Rice yields in 1996

Page 25: 161.120 Introductory Statistics  Week 2 Lecture slides

Picturing Location and Spread with Boxplots

Boxplots for right handspans

of males and females. • Box covers the middle

50% of the data• Line within box marks

the median value• Possible outliers are

marked with asterisk• Apart from outliers, lines

extending from box reach to min and max values.

Page 26: 161.120 Introductory Statistics  Week 2 Lecture slides

2.5 Pictures for Quantitative Data

• Histograms: similar to bar graphs, used for any number of data values.

• Stem-and-leaf plots and dotplots: present all individual values, useful for small to moderate sized data sets.

• Boxplot or box-and-whisker plot: useful summary for comparing two or more groups.

Page 27: 161.120 Introductory Statistics  Week 2 Lecture slides

2.6 Numerical Summaries of Quantitative Data

Notation for Raw Data:n = number of individuals in a data setx1, x2 , x3,…, xn represent individual raw data values

Example: A data set consists of handspan values in centimeters for six females; the values are 21, 19, 20, 20, 22, and 19.

Then, n = 6x1= 21, x2 = 19, x3 = 20, x4 = 20, x5 = 22, and x6 = 19

Page 28: 161.120 Introductory Statistics  Week 2 Lecture slides

Describing the Location of a Data Set

• Mean: the numerical average

• Median: the middle value (if n odd) or the average of the middle two values (n even)

Symmetric: mean = medianSkewed Left: mean < medianSkewed Right: mean > median

Page 29: 161.120 Introductory Statistics  Week 2 Lecture slides

Determining the Mean and Median

The Meanwhere means “add together all the values” ix

nx

x i

The MedianIf n is odd: M = middle of ordered values.

Count (n + 1)/2 down from top of ordered list.If n is even: M = average of middle two ordered values.

Average values that are (n/2) and (n/2) + 1 down from top of ordered list.

Page 30: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.9 Will “Normal” RainfallGet Rid of Those Odors?

Mean = 18.69 inchesMedian = 16.72 inches

Data: Average rainfall (inches) for Davis, California for 47 years

In 1997-98, a company with odor problem blamed it on excessive rain.That year rainfall was 29.69 inches. More rain occurred in 4 other years.

Page 31: 161.120 Introductory Statistics  Week 2 Lecture slides

The Influence of Outliers on the Mean and Median

Larger influence on mean than median.High outliers will increase the mean. Low outliers will decrease the mean.

If ages at death are: 70, 72, 74, 76, and 78then mean = median = 74 years.

If ages at death are: 35, 72, 74, 76, and 78then median = 74 but mean = 67 years.

Page 32: 161.120 Introductory Statistics  Week 2 Lecture slides

2.7 Bell-Shaped Distributionsof Numbers

Many measurements follow a predictable pattern:• Most individuals are clumped around the center• The greater the distance a value is from the

center, the fewer individuals have that value.

Variables that follow such a pattern are said to be “bell-shaped”. A special case is called a normal distribution or normal curve.

Page 33: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.11 Bell-Shaped British Women’s Heights

Data: representative sample of 199 married British couples.Below shows a histogram of the wives’ heights with a normal curve superimposed. The mean height = 1602 millimeters.

Page 34: 161.120 Introductory Statistics  Week 2 Lecture slides

Describing Spread with Standard Deviation

Standard deviation measures variability by summarizing how far individual data values are from the mean.

Think of the standard deviation as roughly the average distance values fall from the mean.

Page 35: 161.120 Introductory Statistics  Week 2 Lecture slides

Describing Spread with Standard Deviation

Both sets have same mean of 100.Set 1: all values are equal to the mean so there is

no variability at all.Set 2: one value equals the mean and other four values

are 10 points away from the mean, so the average distance away from the mean is about 10.

Page 36: 161.120 Introductory Statistics  Week 2 Lecture slides

Formula for the (sample) standard deviation:

The value of s2 is called the (sample) variance. An equivalent formula, easier to compute, is:

Calculating the Standard Deviation

1

2

n

xxs i

1

22

n

xnxs i

Page 37: 161.120 Introductory Statistics  Week 2 Lecture slides

Step 1: Calculate , the sample mean.

Step 2: For each observation, calculate the difference between the data value and the mean.

Step 3: Square each difference in step 2.

Step 4: Sum the squared differences in step 3, and then divide this sum by n – 1.

Step 5: Take the square root of the value in step 4.

Calculating the Standard Deviationx

Page 38: 161.120 Introductory Statistics  Week 2 Lecture slides

Interpreting the Standard Deviation for Bell-Shaped Curves:

The Empirical RuleFor any bell-shaped curve, approximately • 68% of the values fall within 1 standard

deviation of the mean in either direction

• 95% of the values fall within 2 standard deviations of the mean in either direction

• 99.7% of the values fall within 3 standard deviations of the mean in either direction

Page 39: 161.120 Introductory Statistics  Week 2 Lecture slides

The Empirical Rule, the Standard Deviation, and the Range

• Empirical Rule => the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape.

• You can get a rough idea of the value of the standard deviation by dividing the range by 6.

6Ranges

Page 40: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.11 Women’s Heights (cont)Mean height for the 199 British women is 1602 mm and standard deviation is 62.4 mm.

• 68% of the 199 heights would fall in the range 1602 62.4, or 1539.6 to 1664.4 mm

• 95% of the heights would fall in the interval 1602 2(62.4), or 1477.2 to 1726.8 mm

• 99.7% of the heights would fall in the interval 1602 3(62.4), or 1414.8 to 1789.2 mm

Page 41: 161.120 Introductory Statistics  Week 2 Lecture slides

Example 2.11 Women’s Heights (cont)Summary of the actual results:

Note: The minimum height = 1410 mm and the maximum height = 1760 mm, for a range of 1760 – 1410 = 350 mm.So an estimate of the standard deviation is:

mm 3.586

3506

Ranges

Page 42: 161.120 Introductory Statistics  Week 2 Lecture slides

Standardized z-ScoresStandardized score or z-score:

deviation StandardMean valueObserved

z

Example: Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80:

25.18

7080

z

A pulse rate of 80 is 1.25 standard deviations above the mean pulse rate for adult men.

Page 43: 161.120 Introductory Statistics  Week 2 Lecture slides

The Empirical Rule Restated

For bell-shaped data, • About 68% of the values have

z-scores between –1 and +1. • About 95% of the values have

z-scores between –2 and +2. • About 99.7% of the values have

z-scores between –3 and +3.

Page 44: 161.120 Introductory Statistics  Week 2 Lecture slides

Transformations• Sometimes it is convenient to express numbers on a different scale

Americans easily recognise that 90° Fahrenheit is a hot day.

We understand temperatures better on the Celsius scale.

• No gain or loss of information (usually)

• Graphical and numerical summaries are affected.

Transformations can help us understand a data set

Page 45: 161.120 Introductory Statistics  Week 2 Lecture slides

Linear transformations

new value = a + b x old value– imperical to metric measurements grams = 28.3494 x ounces

– temperature Fahrenheit = 32 + 1.8 x Celsius

• Relative positions of the points do not change so we neither gain nor lose information.

Page 46: 161.120 Introductory Statistics  Week 2 Lecture slides

linear transformation

• Affect the centre and spread of the data

• Shape remains unchanged

• Graphical displays: only the numbers labeling the axis changes

• Do not help you to understand the distribution of values in the data

Page 47: 161.120 Introductory Statistics  Week 2 Lecture slides

Nonlinear transformationsExamples:

– The wavelength of radiation (in metres) may alternatively be recorded as a frequency (in cycles per second) -- a reciprocal relationship.

– A medical researcher might record the mean time between seizures for acute epileptic patients, or the rate of seizures per year -- another reciprocal relationship.

– The Richter scale transforms the measured intensity of earthquakes to a logarithmic scale.

Nonlinear transformations….

• Changes the relative distances between data values

• Changes the shape of a distribution

Page 48: 161.120 Introductory Statistics  Week 2 Lecture slides

Logarithmic transformations• The most commonly used nonlinear transformation replaces each value

by its logarithm

new value = log10(old value)

• base-10 logarithms easier to interpret and used in CAST, but natural logarithms (base e) have a similar effect.

• logarithms can be found only for positive numbers.

• log10(1) = 0, log10(10) = 1, log10(100) = 2, log10(1000) = 3,

• log10(0.1) = –1, log10(0.01) = –2, etc.

• Spreads out low values in a distribution and compresses high values.

• Useful for skew data with a long tail towards the high values. – It will spread out a dense cluster of low values and may detect clustering or

outliers that would not be visible in graphical displays of the original data.

Page 49: 161.120 Introductory Statistics  Week 2 Lecture slides

A family of nonlinear transformationsPower transformations- raises each value in the data set to a power p

• p < 1 increases the spread in the lower tail of data values and decrease the spread in the upper tail.

• p > 1 expanding the upper tail of the data values and compressing the lower tail. (Rarely helpful)

Page 50: 161.120 Introductory Statistics  Week 2 Lecture slides

Discrete Data Displays• Large counts

– The distribution of values can be summarised with the same methods as continuous data.

• Moderate counts– Most of the earlier displays can still be used, but

• Stacked dot plots are better than jittered dot plots– No information is lost by stacking since there can be a column of crosses

for each distinct value.

• Histogram class boundaries should end in '.5' to ensure that data values do not occur on the boundary of two classes.

• Since the median, quartiles and extremes are always whole numbers (or occasionally half-way between two whole numbers), box plots do not give a very effective comparison of groups.

• Small counts– A bar chart is a better representation of the data than a histogram


Top Related