# 161.120 introductory statistics week 3 lecture slides

Post on 04-Jan-2016

19 views

Embed Size (px)

DESCRIPTION

161.120 Introductory Statistics Week 3 Lecture slides. Exploring Bivariate Data: Scatterplots Text section 5.1 CAST sections 3.1 and 3.2 Least Squares & Nonlinear relationships Text section 5.2 CAST sections 3.3 and 3.4 Correlation Text section 5.3 and 5.4 CAST section 3.5 - PowerPoint PPT PresentationTRANSCRIPT

161.120 Introductory Statistics Week 3 Lecture slidesExploring Bivariate Data: ScatterplotsText section 5.1CAST sections 3.1 and 3.2Least Squares & Nonlinear relationshipsText section 5.2CAST sections 3.3 and 3.4CorrelationText section 5.3 and 5.4CAST section 3.5Multivariate DataCAST section 3.6

Univariate DataSingle measurement from each individualCannot associate variation in that measurement with other characteristics of the individuals all variation is unexplained. Bivariate Data Two measurements from each individualMay be be able to associate variation in one measurement with changes in the other measurementcan explain some of the variation. Examples are ... Blood pressure and weight of males in their 50s Carbohydrate content and moisture content of corn Our aim with such data is to find information about the relationship between the variables.

Three Tools we will use Scatterplot, a two-dimensional graph of data valuesCorrelation, a statistic that measures the strength and direction of a linear relationshipRegression equation, an equation that describes the average relationship between a response and explanatory variable

Scatterplots

The relationship between two variables cannot be determined from examination of the two variables in isolation.

5.1 Looking for Patterns with ScatterplotsQuestions to Ask about a ScatterplotWhat is the average pattern? Does it look like a straight line or is it curved?What is the direction of the pattern?How much do individual points vary from the average pattern?Are there any unusual data points?

Positive/Negative AssociationTwo variables have a positive association when the values of one variable tend to increase as the values of the other variable increase.Two variables have a negative association when the values of one variable tend to decrease as the values of the other variable increase.

Example 5.1 Height and HandspanData shown are the first 12 observations of a data set that includes the heights (in inches) and fully stretched handspans (in centimeters) of 167 college students.

Example 5.1 Height and HandspanTaller people tend to have greater handspan measurements than shorter people do. When two variables tend to increase together, we say that they have a positive association. The handspan and height measurements may have a linear relationship.

Example 5.2 Driver Age and MaximumLegibility Distance of Highway SignsA research firm determined the maximum distance at which each of 30 drivers could read a newly designed sign. The 30 participants in the study ranged in age from 18 to 82 years old. We want to examine the relationship between age and the sign legibility distance.

Example 5.2 Driver Age and MaximumLegibility Distance of Highway SignsWe see a negative association with a linear pattern. We will use a straight-line equation to model this relationship.

Example 5.3 The Development of Musical PreferencesThe 108 participants in the study ranged in age from 16 to 86 years old. We want to examine the relationship between song-specific age (age in the year the song was popular) and musical preference (positive score => above average, negative score => below average).

Example 5.3 The Development of Musical PreferencesPopular music preferences acquired in late adolescence and early adulthood. The association is nonlinear.

Groups and OutliersUse different plotting symbols or colors to represent different subgroups.

Look for outliers: points that have an usual combination of data values.

In both univariate and bivariate data sets, outliers or clusters must be very distinct before we should conclude that they are real, in the absence of further external information confirming that the individuals are distinct.

Particularly in small data sets, outliers, clusters and other patterns may arise by chance, without being associated with any real features in the individuals.

Be careful not to over interpret features in scatterplot unless they are well defined, especially if the sample size is small.

5.2 Describing Linear Patterns with a Regression LineTwo purposes of the regression line:to estimate the average value of y at any specified value of xto predict the value of y for an individual, given that individuals x valueWhen the best equation for describing the relationship between x and y is a straight line, the equation is called the regression line.

Example 5.1 Height and Handspan (cont)Regression equation: Handspan = -3 + 0.35 Height Estimate the average handspan for people 60 inches tall: Average handspan = -3 + 0.35(60) = 18 cm.

Predict the handspan for someone who is 60 inches tall: Predicted handspan = -3 + 0.35(60) = 18 cm.

Example 5.1 Height and Handspan (cont)Regression equation: Handspan = -3 + 0.35 Height Slope = 0.35 => Handspan increases by 0.35 cm, on average, for each increase of 1 inch in height.In a statistical relationship, there is variation from the average pattern.

The Equation for the Regression Line

is spoken as y-hat, and it is also referred to either as predicted y or estimated y.b0 is the intercept of the straight line. The intercept is the value of y when x = 0.b1 is the slope of the straight line. The slope tells us how much of an increase (or decrease) there is for the y variable when the x variable increases by one unit. The sign of the slope tells us whether y increases or decreases when x increases.

Regression equation: Distance = 577 - 3 AgeExample 5.2 Driver Age and MaximumLegibility Distance of Highway Signs (cont)Estimate the average distance for 20-year-old drivers: Average distance = 577 3(20) = 517 ft.

Predict the legibility distance for a 20-year-old driver: Predicted distance = 577 3(20) = 517 ft.Slope of 3 tells us that, on average, the legibility distance decreases 3 feet when age increases by one year

ExtrapolationUsually a bad idea to use a regression equation to predict values far outside the range where the original data fell.

No guarantee that the relationship will continue beyond the range for which we have observed data.

Prediction Errors and ResidualsPrediction Error = difference between the observed value of y and the predicted value .

Residual =

Regression equation: = 577 3xExample 5.2 Driver Age and MaximumLegibility Distance of Highway Signs (cont)Can compute the residual for all 30 observations.Positive residual => observed value higher than predicted.Negative residual => observed value lower than predicted.

Least Squares Line and FormulasLeast Squares Regression Line: minimizes the sum of squared prediction errors.SSE = Sum of squared prediction errors.Formulas for Slope and Intercept:

Linear modelOnly appropriate when the cloud of crosses in a scatterplot of the data is regularly spread around a straight line.If the crosses are scattered round a curve, the relationship is called nonlinear and other models must be used.

Outliers should be investigated

Detecting problems with the modelPlot residuals against X to look for problems in the model

Nonlinear RelationshipsIf the relationship between Y and X is nonlinear, a linear model will give poor predictions and must be avoided. Transformation of one or both variablesoften possible to linearise the relationship and therefore use least squares to fit a linear model to the transformed variablesFor many data sets, a logarithmic transformation works, but a more general power transformation is sometimes needed to linearise the relationshipAdding a quadratic termAn alternative solution to the problem of curvature is to extend the simple linear model with the addition of a quadratic term

5.3 Measuring Strength and Direction with CorrelationThe strength of the relationship is determined by the closeness of the points to a straight line.The direction is determined by whether one variable generally increases or generally decreases when the other variable increases.Correlation r indicates the strength and the direction of a straight-line relationship.

Interpretation of r and a Formular is always between 1 and +1magnitude indicates the strengthr = 1 or +1 indicates a perfect linear relationshipsign indicates the directionr = 0 indicates a slope of 0 so knowing x does not change the predicted value of yFormula for correlation:

Example 5.1 Height and Handspan (cont)Regression equation: Handspan = -3 + 0.35 Height Correlation r = +0.74 => a somewhat strong positive linear relationship.

Regression equation: Distance = 577 - 3 AgeExample 5.2 Driver Age and MaximumLegibility Distance of Highway Signs (cont)Correlation r = -0.8 => a somewhat strong negative linear association.

Example 5.6 Left and Right HandspansIf you know the span of a persons right hand, can you accurately predict his/her left handspan?Correlation r = +0.95 => a very strong positive linear relationship.

Example 5.7 Verbal SAT and GPAGrade point averages (GPAs) and verbal SAT scores for a sample of 100 university students.Correlation r = 0.485 => a moderately strong positive linear relationship.

Example 5.8 Age and Hours of TV ViewingRelationship between age and hours of daily television viewing for 1913 survey respondents.Correlation r = 0.12 => a weak connection.Note: a few claimed to watch more than 20 hours/day!

Example 5.9 Hours of Sleep and Hours of StudyRelationship between reported hours of