1 data analysis linear regression data analysis linear regression ernesto a. diaz department of...

37
1 Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Upload: nathaniel-floyd

Post on 18-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

1

Data Analysis

Linear Regression

Data Analysis

Linear Regression

Ernesto A. DiazDepartment of Mathematics

Redwood High School

Page 2: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

2

Let us pause for a few moments…

What are we working on in this chapter?

Page 3: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

3

Problem Statement

If we have a scatter plot that seems “linear”, can we find an equation that generates similar data? How accurate will it be?

Page 4: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

4

Regression

One important branch of inferential statistics, called regression analysis, is used to compare quantities or variables, to discover relationships that exist between them, and to formulate those relationships in useful ways.

Page 5: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

5

Linear Regression

Once a scatter diagram has been produced, we can draw a curve that best fits the pattern exhibited by the sample points.

The best-fitting curve for the sample points is called an estimated regression curve. If the points in the scatter diagram seem to lie approximately along a straight line, the relationship is assumed to be linear, and the line that best fits the data points is called the estimated linear regression.

Page 6: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

6

Linear RegressionLinear regression is the process of

determining the linear relationship between two variables.

If we assume that the best-fitting curve is a line, then the equation of that line will take the form

y = ax + b,where a is the slope of the line and b is

the y-coordinate of the y-intercept. To identify the estimated regression line, we must find the values of the “regression coefficients” a and b.

Page 7: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

7

Regression, 1st approach

Page 8: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

8

2nd Approach: Med-Med Line

Page 9: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

99

How do we evaluate accuracy? Root Mean Square Error (RMS)

Sum of Squares of Residuals (SSres)

2

y yRMS

n

2

1

ˆn

RES i ii

SS y y

Page 10: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

10

3rd Approach: Least SquaresFor each x-value in the data set, the corresponding y-value usually differs from the value it would have if the data point were exactly on the line. These differences are shown in the figure by vertical line segments. The most common procedure is to choose the line where the sum of the squares of all these differences is minimized. This is called the method of least squares, and the resulting line is called the least squares line.

Page 11: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

11

Linear Regression

Linear regression is the process of determining the linear relationship between two variables.

The line of best fit (regression line or the least squares line) is the line such that the sum of the squares of the vertical distances from the line to the data points (on a scatter diagram) is a minimum.

Page 12: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

12

Linear Regression Formulas

The least squares line (regression line)

that provides the best fit to the data points (x1, y1), (x2, y2),… (xn, yn) has the equation

22

, where

( ), and

y mx b

n xy x y y m xm b

nn x x

Page 13: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

13

Med-Med vs. Least Squares

The Median-Median Line is sometimes called the resistant line because it is not very influenced by one or two “bad” data points.

The Least Squares Line uses every point in its calculation, so it is affected by outliers.

Page 14: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

14

Example 1: Regression

Suppose that we wish to get an idea of how the number of hours preparing for a final exam relates to the score on the exam. Data is collected and shown below.

Hours 1 2 3 4 5 6 7 8 9 10

Score 50 62 62 74 70 86 78 90 96 94

Page 15: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

15

Linear Regression

The first step in analyzing these data is to graph the results as shown in the scatter diagram on the next slide.

Page 16: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

16

Scatter Diagram

0

20

40

60

80

100

120

0 5 10 15

Hours Studying

Ex

am

Sc

ore

Page 17: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

17

Linear Regression

If we let x denote hours studying and y denote exam score in the data of the previous slide and assume that the best-fitting curve is a line, then the equation of that line will take the form

y = mx + b,

where m is the slope of the line and b is the y-coordinate of the y-intercept. To identify the estimated regression line, we must find the values of the “regression coefficients” m and b.

Page 18: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

18

Solution

2 22

10(4592) (55)(762)=

10(385) (55)

4.86

762 (4.86)(55)49.47

10

n xy x ym

n x x

m

y m xb

n

4.86 49.47y x The equation is

Example 1: Computing a Least Squares Line

Page 19: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

19

Estimated Regression Line

0

20

40

60

80

100

120

0 5 10 15

Hours Studying

Ex

am

Sc

ore

Page 20: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

20

Example: Med-Med vs. Best Fit

Using Dobbie, Find the estimated regression line using both methods

Hours 1 2 3 4 5 6 7 8 9 10

Score 50 62 62 74 70 86 78 90 96 94

Page 21: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

21

Example 2: Predicting from a Regression Line

Use the result from the previous example to predict the exam score for a student that studied 6.5 hours.

II) Best Fit: Use the equation and replace x with 6.5.ˆ 4.86(6.5) 49.47 81.06y

ˆ 4.86 49.47y x

Based on the given data, the student should make about an 81%.

I) Med-Med: Use the equation and replace x with 6.5.ˆ 4.57(6.5) 52.19 81.90y

ˆ 4.57 52.19y x

Based on the given data, the student should make about an 82%.

Page 22: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Copyright © 2005 Pearson Education, Inc.22

13.8

Linear Correlation and Regression

Page 23: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-23Copyright © 2005 Pearson Education, Inc.

Linear Correlation

Linear correlation is used to determine whether there is a relationship between two quantities and, if so, how strong the relationship is.

The linear correlation coefficient, r, is a unitless measure that describes the strength of the linear relationship between two variables.

If the value is positive, as one variable increases, the other increases.

If the value is negative, as one variable increases, the other decreases.

The variable, r, will always be a value between –1 and 1 inclusive.

Page 24: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-24Copyright © 2005 Pearson Education, Inc.

Scatter Diagrams

A visual aid used with correlation is the scatter diagram, a plot of points (bivariate data). The independent variable, x, generally is a quantity that

can be controlled. The dependant variable, y, is the other variable.

The value of r is a measure of how far a set of points varies from a straight line. The greater the spread, the weaker the correlation and the

closer the r value is to 0.

Page 25: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-25Copyright © 2005 Pearson Education, Inc.

Correlation

Page 26: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-26Copyright © 2005 Pearson Education, Inc.

Correlation

Page 27: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-27Copyright © 2005 Pearson Education, Inc.

Linear Correlation Coefficient

The formula to calculate the correlation coefficient (r) is as follows:

2 22 2

n xy x yr

n x x n y y

Page 28: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-28Copyright © 2005 Pearson Education, Inc.

There are five applicants applying for a job as a medical transcriptionist. The following shows the results of the applicants when asked to type a chart. Determine the correlation coefficient between the words per minute typed and the number of mistakes.

Example: Words Per Minute versus Mistakes

934Nancy

1041Kendra

1253Phillip

1167George

824Ellen

MistakesWords per MinuteApplicant

Page 29: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-29Copyright © 2005 Pearson Education, Inc.

We will call the words typed per minute, x, and the mistakes, y. List the values of x and y and calculate the necessary sums.

Solution

306811156934

xy = 2,281y2 = 510 x2 =10,711y = 50x = 219

10

12

11

8

y

Mistakes

xyy2 x2x

41

53

67

24

WPM

4101001681

6361442809

7371214489

19264576

Page 30: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-30Copyright © 2005 Pearson Education, Inc.

Solution continued

The n in the formula represents the number of pieces of data. Here n = 5.

2 22 2

2 2

5 2281 219 50

5 10,711 219 5 510 50

11,405 10,950

5 10,711 47,961 5 510 2500

455

53,555 47,961 2550 2500

4550.86

5594 50

n xy x yr

n x x n y y

r

Page 31: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-31Copyright © 2005 Pearson Education, Inc.

Solution continued

Since 0.86 is fairly close to 1, there is a fairly strong positive correlation.

This result implies that the more words typed per minute, the more mistakes made.

Page 32: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-32Copyright © 2005 Pearson Education, Inc.

Linear Regression

Linear regression is the process of determining the linear relationship between two variables.

The line of best fit (line of regression or the least square line) is the line such that the sum of the vertical distances from the line to the data points is a minimum.

Page 33: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-33Copyright © 2005 Pearson Education, Inc.

The Line of Best Fit

Equation:

22

, where

, and

y mx b

n xy x y y m xm b

nn x x

Page 34: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-34Copyright © 2005 Pearson Education, Inc.

Example

Use the data in the previous example to find the equation of the line that relates the number of words per minute and the number of mistakes made while typing a chart.

Graph the equation of the line of best fit on a scatter diagram that illustrates the set of bivariate points.

Page 35: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-35Copyright © 2005 Pearson Education, Inc.

Solution

From the previous results, we know that

Now we find the y-intercept, b.

22

2

5(2,281) (219)(50)

5(10,711) 219

455

55940.081

n xy x ym

n x x

m

m

m

50 0.081 219

532.261

6.4525

y m xb

n

b

b

Therefore the line of best fit is y = 0.081x + 6.452

Page 36: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-36Copyright © 2005 Pearson Education, Inc.

Solution continued

To graph y = 0.081x + 6.452, plot at least two points and draw the graph.

8.88230

8.07220

7.26210

yx

Page 37: 1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School

Slide 13-37Copyright © 2005 Pearson Education, Inc.

Solution continued