objectives - dept. of statistics, texas a&m university

58
Objectives Describing distributions with numbers p Measures of center: mean, median OS3 Section 1.6.2 p Mean versus median p Measure of spread: standard deviation and the IQR OS3 Section 1.6.4 p The Boxplot p Changing the unit of measurement p The z-transform

Upload: others

Post on 22-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Objectives - Dept. of Statistics, Texas A&M University

ObjectivesDescribing distributions with numbers

p Measures of center: mean, median OS3 Section 1.6.2

p Mean versus median

p Measure of spread: standard deviation and the IQR OS3 Section 1.6.4

p The Boxplot

p Changing the unit of measurement

p The z-transform

Page 2: Objectives - Dept. of Statistics, Texas A&M University

Topics:Summarystatisticsp Learning Targets:

p Know what a mean and median is.p Know what a Quartile, IQR and standard deviation is.p Know where to roughly place the mean and standard deviation on a

histogram.p Understand the effect outliers will have an all of the above.p Understand how linear transformations of data will effect mean, standard

deviations, and quartiles. p Know the z-transform and how it can be used to make comparisons

between different data sets. p Know that the mean and standard deviation of z-transformed data is

zero and one.

Page 3: Objectives - Dept. of Statistics, Texas A&M University

NumericalsummariesofDatap Summarizing the data with a few numbers can simplify

comparisons between samples.

p Two simple ways to describe the data is:p A number which describes its center.p A number which describes its spread.

p These numbers will not give a true description of the distribution of the data. p For example, using these numbers we cannot identify

whether it is unimodal or bimodal.

Page 4: Objectives - Dept. of Statistics, Texas A&M University

Measuresofcenter

Page 5: Objectives - Dept. of Statistics, Texas A&M University

The sample mean or arithmetic average

To calculate the average, or sample mean, add all values, then divide by the number of cases. It is the “center of mass” of the histogram. We often denote it with the symbol (say “x bar”).

Sum of heights is 1598.3 and n = 25.Dividing by 25 gives 63.9 inches.

58.2 64.059.5 64.560.7 64.160.9 64.861.9 65.261.9 65.762.2 66.262.2 66.762.4 67.162.9 67.863.9 68.963.1 69.663.9

Measureofcenter1:themean

x = 1598.325

= 63.9

x

Page 6: Objectives - Dept. of Statistics, Texas A&M University

Measureofcenter2:themedianThe median (M) is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger.

1. Sort observations by size.n = number of observations

______________________________

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.5

10 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6

n = 24 èn/2 = 12

Median = (3.3+3.4) /2 = 3.35

2.b. If n is even, the median is the mean of the two middle observations.

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.5

10 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1

ç n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4

2.a. If n is odd, the median is observation (n+1)/2 down the list

Page 7: Objectives - Dept. of Statistics, Texas A&M University

The mean and median for a symmetric

distribution are the same

Comparing the mean and the median

If the distribution is symmetrical then the mean and median are the same. Mean is center of

balance

Median splits the area in half

Page 8: Objectives - Dept. of Statistics, Texas A&M University

Left skewed Right skewedMean

MedianMean

Median

Comparing the mean and the median

The median is a measure of center that is resistant to skewness and outliers. The mean is not. If there is a skew in the data the mean tends to be pulled towards the side of the skew (long tail).

Page 9: Objectives - Dept. of Statistics, Texas A&M University

Example:MeanandMedianq Comparing means and medians. Statcrunch: Applets -> Mean/SD

vs. Median/IQR (you can load your own data by using Data Table) and press compute (move the balls around and see what happens to the Mean and Median).

Here we plot the reviews of the Smart tracker. We see that a few bad reviews can push down the average but has no influence on the median.It is useful to state both both the average and median in a report.

Page 10: Objectives - Dept. of Statistics, Texas A&M University

q You can do the same thing in using your own distribution Applets -> Sampling Distributions (press compute) -> customize the plot by left clicking over the plot and tracing the plot you want. Observe how the mean and median change.

q Place the cursor over the distribution to change it. Watch how the mean and median change too.

Page 11: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep One of the 5 star ratings turns out to be misclassified, it

is reclassified as a 1 star rating? What happens to the mean and the median?

(A) The mean decreases and the median stays the same.

(B) The mean increases and the median stays the same.

(C)The mean stays the same and the median decreases.

Page 12: Objectives - Dept. of Statistics, Texas A&M University

Statcrunch:Summarystatistics

q Load the data into Statcrunch.

q To obtain the summary statistics:

o Go to Stat -> Summary Stat -> Columns, o A drop down menu appears. In select column(s) choose which variables

you want to summarize (you can choose several by highlighting and pressing tab at the same time).

o Click Compute!

q You will obtain several different summary statistics, which look like this:

Page 13: Objectives - Dept. of Statistics, Texas A&M University

Measuresofspread

Page 14: Objectives - Dept. of Statistics, Texas A&M University

M = median = 3.4

Q1= first quartile = 2.2

Q3= third quartile = 4.35

1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 1 2.39 2 2.5

10 3 2.811 4 2.912 5 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.1

Measureofspread1:thequartiles

The first quartile, Q1, is the value in the

sample that has 25% of the data at or

below it (ó it is the median of the lower

half of the sorted data, excluding M).

The third quartile, Q3, is the value in the

sample that has 75% of the data at or

below it (ó it is the median of the upper

half of the sorted data, excluding M).

The IQR (interquartile range) is the

difference between the third and first

quartile. It tells is where 50% of lie.

Page 15: Objectives - Dept. of Statistics, Texas A&M University

M = median = 3.4

Q3= third quartile = 4.35

Q1= first quartile = 2.2

25 6 6.124 5 5.623 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.54 4 1.93 3 1.62 2 1.21 1 0.6

Largest = max = 6.1

Smallest = min = 0.6

Disease X0

1

2

3

4

5

6

7

Year

s un

til d

eath

Five-numbersummaryandboxplotBOXPLOT

Interquartile rangeQ3 – Q1

4.35 − 2.2 = 2.15

Page 16: Objectives - Dept. of Statistics, Texas A&M University
Page 17: Objectives - Dept. of Statistics, Texas A&M University

Boxplotandhistogramsp Boxplots are zoomed out versions of the histogram which

give a summary of the distribution using the quartiles.

The histogram and boxplot of a symmetric distribution.

Page 18: Objectives - Dept. of Statistics, Texas A&M University

.

The histogram and boxplot of a right skewed distribution.

Page 19: Objectives - Dept. of Statistics, Texas A&M University

p For the Old Faithful Data the Boxplot is not very informative.

p It does not inform about multiple modes.

Page 20: Objectives - Dept. of Statistics, Texas A&M University

QuestionTime

Page 21: Objectives - Dept. of Statistics, Texas A&M University

QuestionTime

Page 22: Objectives - Dept. of Statistics, Texas A&M University

The most common measure of spread is the standard deviation which is the average spread about the mean:

deviation value x= -1. First calculate the deviations.

2s s=

3. Finally, take the square root to get the standard deviation s.

Measureofspread2:thestandarddeviation

Mean� 1 s.d.

x2 sum of squared deviations

1s

n=

-

2. Then calculate the variance s2.

Page 23: Objectives - Dept. of Statistics, Texas A&M University

The standard deviation “s” is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers.

deviation value x= -1. First calculate the deviations.

2s s=

3. Finally, take the square root to get the standard deviation s.

Measureofspread2:Acalculation

2 sum of squared deviations1

sn

=-

2. Then calculate the variance s2.

Numerical Example:

q We observe: 1,3,4,5,8

q The sample mean is 4.2

q Deviation of data from mean is -3.2 -1.2 -0.2 0.8 3.8

q Squared deviation is 10.24 1.44 0.04 0.64 14.44

q Average of squares is: 6.7.

q 6.7 is too large it almost covers the entire data set. This is because we squared the deviation. We standardize by taking the root s = 2.58 = √6.7

Page 24: Objectives - Dept. of Statistics, Texas A&M University

Inpictures

(a) We observe: 1,3,4,5,8 (b) The sample mean is 4.2 (c) Deviation of data from mean is -3.2 -1.2 -0.2 0.8 3.8 (d) Squared deviation is 10.24 1.44 0.04 0.64 14.44 (e) Average of squares is: 6.7. (f) The standard deviation is s = 2.58 = √6.7.

Page 25: Objectives - Dept. of Statistics, Texas A&M University
Page 26: Objectives - Dept. of Statistics, Texas A&M University

The standard deviation “s” is invariant to shifts. We add 10 the previous sample, the mean shifts by 10 but the standard deviation is the same.

(a) We observe: 10,13,14,15,18 (b) The sample mean is 14.2 (c) Deviation of data from mean is -3.2 -1.2 -0.2 0.8 3.8 (d) Squared deviation is 10.24 1.44 0.04 0.64 14.44 (e) Average of squares is: 6.7. (f) The standard deviation is s = 2.58 = √6.7.

Invarianceofstandarddeviationtoshift

Page 27: Objectives - Dept. of Statistics, Texas A&M University
Page 28: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep You observe the sample 1,1,16,8,9,10,14,19.p The average (sample mean) is 9.75 and the standard

deviation is 6.54p What happens to the average and standard deviation if

we subtract 5 from each observation?A. The mean is 4.75 and the standard deviation is 1.54B. The mean is 9.75 and the standard deviation is 6.54C.The mean is 4.75 and the standard deviation is 6.54

Page 29: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep You observe the sample 1,1,16,8,9,10,14,19.p The average (sample mean) is 9.75 and the standard

deviation is 6.54p What happens to the average and standard deviation if

we subtract 5 from each observation?A. The mean is 4.75 and the standard deviation is 1.54B. The mean is 9.75 and the standard deviation is 6.54C.The mean is 4.75 and the standard deviation is 6.54

Page 30: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep You observe the sample 1,1,16,8,9,10,14,19.p The average (sample mean) is 9.75 and the standard

deviation is 6.54p What happens to the average and standard deviation if

we subtract 5 from each observation?A. The mean is 4.75 and the standard deviation is 1.54B. The mean is 9.75 and the standard deviation is 6.54C.The mean is 4.75 and the standard deviation is 6.54

Page 31: Objectives - Dept. of Statistics, Texas A&M University

Heightsofstudents:meanandstd.dev.p Place the mean and standard deviation on the histogram

belowp Compare the std. dev. with the half the range.

Page 32: Objectives - Dept. of Statistics, Texas A&M University

Comparingheightsp This is a summary of a sample of heights taken in class.

We group them into groups of 5.

p We make a histogram of the raw data (20 observations) and the averages (in this case just 4 of them). Compare them.

Page 33: Objectives - Dept. of Statistics, Texas A&M University

Histogramofheightsandaverages

List what is the same and different:

Same:

Different:

Page 34: Objectives - Dept. of Statistics, Texas A&M University

Example1:IQRvsStd deviation

p In Statcrunch go to Applets -> Mean/SD vs. Median/IQR, tick the boxes standard deviation and IQR and press computer. Move the balls around and compare the IQR with the standard deviation.

Observe how an outlier can change the standard deviation (just like the mean), but the IQR is not effected.

Page 35: Objectives - Dept. of Statistics, Texas A&M University

Example2:IQRvsStd.deviationp Here we illustrate further differences between the two measures of

spread.

q If all the values in the data set take the same value, then IQR = 0 and standard deviation = 0. The standard deviation is only zero if all the values are the same.

q If most of the values in the data set are the same (but not all), in this case the standard deviation cannot be zero but the IQR = 0.

Page 36: Objectives - Dept. of Statistics, Texas A&M University

Properties oftheStandardDeviation

p Usually, s > 0. s = 0 only when all observations have the same value and there is no spread. Eg. The data is 1, 1, 1, 1, 1. The sample mean is 1. The difference about the mean is 0, 0, 0, 0, 0. Thus s = 0.

p s has the same units of measurement as the original observations.

p s measures spread about the mean.

Page 37: Objectives - Dept. of Statistics, Texas A&M University

p Rule of Thumb `Most’ of the data is within two standard deviations of the mean:

The more standard deviations it is from the mean, the more `extreme’ it is. We calculate the number of standard deviations using the z-transform (we do this in the next few slides).

p s is not resistant to outliers or skewness. That is, a few extreme values can change s considerably.

2 and 2 .x s x s- +

Page 38: Objectives - Dept. of Statistics, Texas A&M University

Disease X:

Symmetric distribution…

253.391.48

nxs

===

Multiple myeloma:253.343.17

nxs

===

… and a right-skewed distribution

Visualizing mean and std. dev.

x2x s- 2x s+

x 2x s+x s- x s+

x s- x s+

3x s+

Data mostly within 1 st. dev. of the mean and nearly all within 2 st. dev.

Data to the left of the mean are more bunched together than data to the right of the mean.

Page 39: Objectives - Dept. of Statistics, Texas A&M University

QuestionTime

Page 40: Objectives - Dept. of Statistics, Texas A&M University

ChangingtheunitofmeasurementVariables can be recorded in different units of measurement. Most often, one measurement unit is a linear transformation of another measurement unit: xnew = a + bxold.

Temperatures can be expressed in degrees Fahrenheit or degrees Celsius.TempFahrenheit = 32 + 1.8* TempCelsius è a + bx.

Linear transformations do not change the basic shape of a distribution (skew, symmetry, multimodal). But they do change the measure of centerand spread:

p Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (s, IQR) by b.

p Adding the same number a (positive or negative) to each observation adds ato measures of center (mean, median) and to quartiles but it does not change measures of spread (s, IQR).

Page 41: Objectives - Dept. of Statistics, Texas A&M University

Thenewmeanandstandarddeviationinalineartransformation

p Consider the transformation Fahrenheit = 32+1.8�Celsius.p The new mean after transformation is = 32 +1.8�old mean.p The standard deviation after transformation is

= 1.8�old standard deviation.q In general if the transformation is Y = a + bX

p The new mean after transformation is = a +b�old mean.p The standard deviation after transformation is

= |b|�old standard deviation (remember to use the positive, of b, the standard deviation can never be negative).

Page 42: Objectives - Dept. of Statistics, Texas A&M University

Example1q 5 students are 19, 19, 20, 21, 22 years old. Their mean age

is 20.2 years and their standard deviation is 1.3

qQuestion: In 10 years time what will be their mean and standard deviation?

qAnswer:

a = 10 and b =1. The mean age will be 20.2+10 = 30.2 but the amount of variation remains the same, 1.3 � 1 = 1.3 (the data has just shifted to the right).

Page 43: Objectives - Dept. of Statistics, Texas A&M University

Example2

q 5 children are 0.5, 1.5, 2, 3.2, 3.8 years old. The mean age is 2.2 years and their standard deviation is 1.32.qQuestion: Suppose we convert years into months, what

is their age and standard deviation?qAnswer:

a = 0 and b = 12. The mean age in months is 2.2 � 12 = 26.4 and the standard deviation is 1.32 � 12 = 15.84.Because the units have changed (from years to months), it `looks’ like the mean and std. dev. have increased. Of course. this is not the case, it is simply that a different unit of measurement has been used.

Page 44: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep Suppose we observe the data 1, 3, 4, 5, 6. It has the sample mean

3.8 and sample standard deviation 1.92. p Question: The data is transformed with the transformation y = -2x

i.e. -12, -10, -8, -6, -2 p What is the new sample mean and standard deviation?

A. The new mean is |2|�3.8 = 7.6 and the new standard deviation is |2|�1.92 = 3.84

B. The new mean is -2�3.8 = -7.6 and the new standard deviation is |2|�1.92 = 3.84

C. The new mean is |2|�3.8 = 7.6 and the new standard deviation is-2�1.92 = -3.84

Page 45: Objectives - Dept. of Statistics, Texas A&M University

Thez-transform/z-score

This is a “relative” distance, that takes into account the variation of the distribution.

OS3: Section 3.1 (page 130)

Page 46: Objectives - Dept. of Statistics, Texas A&M University

Comparingtestscoresp Below is the distribution of test scores for two exams

p The mean score in both exams is roughly the same (about 33.5).p A student scores 28 in both exams. Which exam did the student do

worse in?

Page 47: Objectives - Dept. of Statistics, Texas A&M University

Relativedistance/z-score/z-transformp In both exams the student scores 5.5 points less than the

mean. p -5.5 (we use a negative since it is less than the average) does

not adequately describe the performance of the student. p A better measure would be one which also took into account

the spread of grades. p The larger the “spread” the closer 28 point is to the mean

33.5.p One distance measure which takes into account spread is

called the z-score or z-transform.

Page 48: Objectives - Dept. of Statistics, Texas A&M University

p Exam 1

p Exam 2

z-transform =data - mean

st.dev

Score 1 =28� 34

10.34= �0.58

Score 1 =28� 33.2

5.88= �0.88

Page 49: Objectives - Dept. of Statistics, Texas A&M University

p Compare the z-transform with the distribution

Page 50: Objectives - Dept. of Statistics, Texas A&M University

z-transform =data - mean

st.dev

q Despite the student getting the same score in both exams and the mean score in both exams being the same, the student performed relatively worse in Exam 2 as compared with Exam 1.

q By comparing z-transforms we can determine which observations are more “unusual”.

q Determining “unusual” is very important in statistics.

q A student scores 45 in both exams. In which exam did the student do better relative to the class?

Page 51: Objectives - Dept. of Statistics, Texas A&M University

q Example: Often we want to compare temperatures in different regions of the World. For example, the temperatures between Manchester (UK) and College Station (USA). It does not make much sense comparing their raw temperature. College Station is a lot warmer during the summer months. On Sept 15th in Manchester it was 14C whereas in CS it was 87F (30C).

q But we can make a comparison against what is `normal’ in that part of the world.

q For example, during September on average it is 18 degrees Celsius in Manchester with standard deviation 3C, in College station on average it is 90 (32C) Fahrenheit with standard deviation 10F.

Weather

Page 52: Objectives - Dept. of Statistics, Texas A&M University

q Which place experienced the more “unusual” weather on Sept 15th?q Manchester is 4C less than the mean of 18C.q College Station is 3F less that the mean of 90F.q But these numbers do not take into account the regional

variability of the temperatures nor do they account for different units of measurements.

Weather

Page 53: Objectives - Dept. of Statistics, Texas A&M University

q The z-transform of the Manchester temperature is

q The z-transform of the College Station temperature is

Thez-transformoftemperaturesz-transform =

data - mean

st.dev

Manchester =14� 18

3= �1.33

CS =87� 90

10= �0.33

Page 54: Objectives - Dept. of Statistics, Texas A&M University

Make a plot (centered about zero) with both the transforms -1.33, -0.33

q The z-transforms show that the weather in Manchester on Sept 15th was more “unusually” cold than the weather in College Station.

q The z-transform is said to standardize data as it is free of units of measurements. It is centered about zero and has standard deviation one.

Thez-transformoftemperatures

Page 55: Objectives - Dept. of Statistics, Texas A&M University

p The z-transform measures the number of standard deviations the data is from the mean.

p If the z-transform is negative it is to the left of the mean. p If the z-transform is positive it is to the right of the mean. p The larger the z-transform the further it is from the mean;

the more extreme the data is relative to the mean.

Thez-transformandthestandarddeviation

z-transform =data - mean

st.dev

Page 56: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep The mean heights of students is 68 inches and the

standard deviation is 4 inches. p A student is 62 inches, how many standard deviation are

they from the mean?A. The z-transform is -1.5, they are 1.5 standard

deviations to the LEFT of the mean.B. The z-transform is -1.5, they are 1.5 standard

deviations to the RIGHT of the mean.C.The z-transform is -6, they are 6 standard deviations

to the LEFT of the mean.

Page 57: Objectives - Dept. of Statistics, Texas A&M University

QuestionTimep Male heights have a mean 68 inches and standard

deviation 4 inches. p Female heights have a mean 65 inches and a standard

deviation 2 inches. p Question Suppose Jane is 68 inches tall and Peter is 72

inches tall. Who is more “extreme/unusual”, relative to their gender?A. Jane B. Peter

Page 58: Objectives - Dept. of Statistics, Texas A&M University

AccompanyingproblemsassociatedwiththisChapter

p HW 2 and 3