161.120 introductory statistics week 3 lecture slides

41
161.120 Introductory Statistics Week 3 Lecture slides Exploring Bivariate Data: – Scatterplots Text section 5.1 CAST sections 3.1 and 3.2 Least Squares & Nonlinear relationships Text section 5.2 CAST sections 3.3 and 3.4 – Correlation Text section 5.3 and 5.4 CAST section 3.5 Multivariate Data CAST section 3.6

Upload: mckenzie-reed

Post on 04-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

161.120 Introductory Statistics Week 3 Lecture slides. Exploring Bivariate Data: Scatterplots Text section 5.1 CAST sections 3.1 and 3.2 Least Squares & Nonlinear relationships Text section 5.2 CAST sections 3.3 and 3.4 Correlation Text section 5.3 and 5.4 CAST section 3.5 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 161.120 Introductory Statistics  Week 3 Lecture slides

161.120 Introductory Statistics Week 3 Lecture slides

• Exploring Bivariate Data: – Scatterplots

• Text section 5.1• CAST sections 3.1 and 3.2

– Least Squares & Nonlinear relationships• Text section 5.2• CAST sections 3.3 and 3.4

– Correlation• Text section 5.3 and 5.4• CAST section 3.5

• Multivariate Data– CAST section 3.6

Page 2: 161.120 Introductory Statistics  Week 3 Lecture slides

• Univariate Data

– Single measurement from each individual

– Cannot associate variation in that measurement with other characteristics of the individuals

• all variation is unexplained.

• Bivariate Data

– Two measurements from each individual

– May be be able to associate variation in one measurement with changes in the other measurement

• can explain some of the variation.

– Examples are ...

Blood pressure and weight of males in their 50s

Carbohydrate content and moisture content of corn

– Our aim with such data is to find information about the relationship between the variables.

Page 3: 161.120 Introductory Statistics  Week 3 Lecture slides

Three Tools we will use …

• Scatterplot, a two-dimensional graph of data values

• Correlation, a statistic that measures the strength and direction of a linear relationship

• Regression equation, an equation that describes the average relationship between a response and explanatory variable

Page 4: 161.120 Introductory Statistics  Week 3 Lecture slides

Scatterplots

• The relationship between two variables cannot be determined from examination of the two variables in isolation.

Page 5: 161.120 Introductory Statistics  Week 3 Lecture slides

5.1 Looking for Patterns with Scatterplots

Questions to Ask about a Scatterplot

• What is the average pattern? Does it look like a straight line or is it curved?

• What is the direction of the pattern?

• How much do individual points vary from the average pattern?

• Are there any unusual data points?

Page 6: 161.120 Introductory Statistics  Week 3 Lecture slides

Positive/Negative Association

• Two variables have a positive association when the values of one variable tend to increase as the values of the other variable increase.

• Two variables have a negative association when the values of one variable tend to decrease as the values of the other variable increase.

Page 7: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.1 Height and Handspan

Data shown are the first 12 observations of a data set that includes the heights (in inches) and fully stretched handspans (in centimeters) of 167 college students.

Data: Height (in.) Span (cm)

71 23.5 69 22.0 66 18.5 64 20.5 71 21.0 72 24.0 67 19.5 65 20.5 76 24.5 67 20.0 70 23.0 62 17.0

and so on, for n = 167 observations.

Page 8: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.1 Height and Handspan

Taller people tend to have greater handspan measurements than shorter people do.

When two variables tend to increase together, we say that they have a positive association.

The handspan and height measurements may have a linear relationship.

Page 9: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.2 Driver Age and MaximumLegibility Distance of Highway Signs

• A research firm determined the maximum distance at which each of 30 drivers could read a newly designed sign.

• The 30 participants in the study ranged in age from 18 to 82 years old.

• We want to examine the relationship between age and the sign legibility distance.

Page 10: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.2 Driver Age and MaximumLegibility Distance of Highway Signs

• We see a negative association with a linear pattern.

• We will use a straight-line equation to model this relationship.

Page 11: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.3 The Development of Musical Preferences

• The 108 participants in the study ranged in age from 16 to 86 years old.

• We want to examine the relationship between song-specific age (age in the year the song was popular) and musical preference (positive score => above average, negative score => below average).

Page 12: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.3 The Development of Musical Preferences

• Popular music preferences acquired in late adolescence and early adulthood.

• The association is nonlinear.

Page 13: 161.120 Introductory Statistics  Week 3 Lecture slides

Groups and Outliers

• Use different plotting symbols or colors to represent different subgroups.

• Look for outliers: points that have an usual combination of data values.

Page 14: 161.120 Introductory Statistics  Week 3 Lecture slides

• In both univariate and bivariate data sets, outliers or clusters must be very distinct before we should conclude that they are real, in the absence of further external information confirming that the individuals are distinct.

• Particularly in small data sets, outliers, clusters and other patterns may arise by chance, without being associated with any real features in the individuals.

• Be careful not to over interpret features in scatterplot unless they are well defined, especially if the sample size is small.

Page 15: 161.120 Introductory Statistics  Week 3 Lecture slides

5.2 Describing Linear Patterns with a Regression Line

Two purposes of the regression line:• to estimate the average value of y at any

specified value of x• to predict the value of y for an individual,

given that individual’s x value

When the best equation for describing the relationship between x and y is a straight line, the equation is called the regression line.

Page 16: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.1 Height and Handspan (cont)

Regression equation: Handspan = -3 + 0.35 Height

Estimate the average handspan for people 60 inches tall:Average handspan = -3 + 0.35(60) = 18 cm.

Predict the handspan for someone who is 60 inches tall:Predicted handspan = -3 + 0.35(60) = 18 cm.

Page 17: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.1 Height and Handspan (cont)

Regression equation: Handspan = -3 + 0.35 Height

Slope = 0.35 => Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height.

In a statistical relationship, there is variation from the average pattern.

Page 18: 161.120 Introductory Statistics  Week 3 Lecture slides

The Equation for the Regression Line

is spoken as “y-hat,” and it is also referred to either as predicted y or estimated y.

b0 is the intercept of the straight line. The intercept is the value of y when x = 0.

b1 is the slope of the straight line. The slope tells us how much of an increase (or decrease) there is for the y variable when the x variable increases by one unit. The sign of the slope tells us whether y increases or decreases when x increases.

xbby 10ˆ y

Page 19: 161.120 Introductory Statistics  Week 3 Lecture slides

Regression equation: Distance = 577 - 3 Age

Example 5.2 Driver Age and MaximumLegibility Distance of Highway Signs (cont)

Estimate the average distance for 20-year-old drivers:Average distance = 577 – 3(20) = 517 ft.

Predict the legibility distance for a 20-year-old driver:Predicted distance = 577 – 3(20) = 517 ft.

Slope of –3 tells us that, on average, the legibility distance decreases 3 feet when age increases by one year

Page 20: 161.120 Introductory Statistics  Week 3 Lecture slides

Extrapolation

• Usually a bad idea to use a regression equation to predict values far outside the range where the original data fell.

• No guarantee that the relationship will continue beyond the range for which we have observed data.

Page 21: 161.120 Introductory Statistics  Week 3 Lecture slides

Prediction Errors and Residuals

• Prediction Error = difference between the observed value of y and the predicted value .

• Residual =

y

yy ˆ

Page 22: 161.120 Introductory Statistics  Week 3 Lecture slides

Regression equation: = 577 – 3x

Example 5.2 Driver Age and MaximumLegibility Distance of Highway Signs (cont)

Can compute the residual for all 30 observations.Positive residual => observed value higher than predicted.Negative residual => observed value lower than predicted.

516 – 511 = 5577 – 3(22)=51151622

590 – 517 = 73577 – 3(20)=51759020

510 – 523 = -13577 – 3(18)=52351018

Residualy = Distancex = Age xy 3577ˆ

y

Page 23: 161.120 Introductory Statistics  Week 3 Lecture slides

Least Squares Line and Formulas

• Least Squares Regression Line: minimizes the sum of squared prediction errors.

• SSE = Sum of squared prediction errors.

• Formulas for Slope and Intercept:

ii

iii

xx

yyxxb

21

xbyb 10

Page 24: 161.120 Introductory Statistics  Week 3 Lecture slides

Linear model

• Only appropriate when the cloud of crosses in a scatterplot of the data is regularly spread around a straight line.

• If the crosses are scattered round a curve, the relationship is called nonlinear and other models must be used.

• Outliers should be investigated

• Detecting problems with the model– Plot residuals against X to look for problems in the model

Page 25: 161.120 Introductory Statistics  Week 3 Lecture slides

Nonlinear Relationships• If the relationship between Y and X is nonlinear, a linear model

will give poor predictions and must be avoided.

• Transformation of one or both variables

– often possible to linearise the relationship and therefore use least squares to fit a linear model to the transformed variables

– For many data sets, a logarithmic transformation works, but a more general power transformation is sometimes needed to linearise the relationship

• Adding a quadratic term

– An alternative solution to the problem of curvature is to extend the simple linear model with the addition of a quadratic term

Page 26: 161.120 Introductory Statistics  Week 3 Lecture slides

5.3 Measuring Strength and Direction with Correlation

• The strength of the relationship is determined by the closeness of the points to a straight line.

• The direction is determined by whether one variable generally increases or generally decreases when the other variable increases.

Correlation r indicates the strength and the direction of a straight-line relationship.

Page 27: 161.120 Introductory Statistics  Week 3 Lecture slides

Interpretation of r and a Formula• r is always between –1 and +1• magnitude indicates the strength• r = –1 or +1 indicates a perfect linear relationship• sign indicates the direction• r = 0 indicates a slope of 0 so knowing x does not

change the predicted value of y

• Formula for correlation:

y

i

x

i

s

yy

s

xx

nr

1

1

Page 28: 161.120 Introductory Statistics  Week 3 Lecture slides
Page 29: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.1 Height and Handspan (cont)

Regression equation: Handspan = -3 + 0.35 Height

Correlation r = +0.74 =>

a somewhat strong positive linear relationship.

Page 30: 161.120 Introductory Statistics  Week 3 Lecture slides

Regression equation: Distance = 577 - 3 Age

Example 5.2 Driver Age and MaximumLegibility Distance of Highway Signs (cont)

Correlation r = -0.8 => a somewhat strong negative linear association.

Page 31: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.6 Left and Right Handspans

If you know the span of a person’s right hand, can you accurately predict his/her left handspan?Correlation r = +0.95 => a very strong positive linear relationship.

Page 32: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.7 Verbal SAT and GPA

Grade point averages (GPAs) and verbal SAT scores for a sample of 100 university students.Correlation r = 0.485 => a moderately strong positive linear relationship.

Page 33: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.8 Age and Hours of TV Viewing

Relationship between age and hours of daily television viewing for 1913 survey respondents.

Correlation r = 0.12 => a weak connection.Note: a few claimed to watch more than 20 hours/day!

Page 34: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.9 Hours of Sleep and Hours of Study

Relationship between reported hours of sleep the previous 24 hours and the reported hours of study during the same period for a sample of 116 college students.

Correlation r = –0.36 => a not too strong negative association.

Page 35: 161.120 Introductory Statistics  Week 3 Lecture slides

Correlation Coefficient r• Only describes the strength of linear relationships

– a good description of the strength of a relationship provided the crosses in a scatterplot of the data are not scattered round a curve.

• r may seriously underestimate the strength of a nonlinear relationship.

• A scatterplot should always be examined to help assess whether there are features in the data that the correlation coefficient cannot describe.

• Nonlinear relationships

– Transform the variables to linearise the relationship before evaluating r

Page 36: 161.120 Introductory Statistics  Week 3 Lecture slides

5.4 Why the Answers May Not Make Sense

• Allowing outliers to overly influence the results

• Combining groups inappropriately

• Using correlation and a straight-line equation to describe curvilinear data

Page 37: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.4 Height and Foot Length (cont)

Regression equation uncorrected data: 15.4 + 0.13 heightcorrected data: -3.2 + 0.42 height

Correlationuncorrected data: r = 0.28corrected data: r = 0.69

Three outliers were data entry errors.

Page 38: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.10 Earthquakes in US

Correlationall data: r = 0.73w/o SF: r = –0.96

San Francisco earthquake of 1906.

Page 39: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.11 Height and Lead Feet

Scatterplot of all data: College student heights and responses to the question “What is the fastest you have ever driven a car?”

Scatterplot by gender:Combining two groups led to illegitimate correlation

Page 40: 161.120 Introductory Statistics  Week 3 Lecture slides

Example 5.12 Don’t Predict without a Plot

Correlation: r = 0.96Regression Line: population = –2218 + 1.218(Year)Poor Prediction for Year 2005 = –2218 + 1.218(2005), about 224 million, which is less than the 1990 population.

Population of US (in millions) for each census year between 1790 and 1990.

Page 41: 161.120 Introductory Statistics  Week 3 Lecture slides

Multivariate Data

• Problem: How to display relationship between more than two variables?

• An array of scatterplots (matrix plot) of all pairs of variables is often informative

– especially if the scatterplots are dynamically linked (brushing).