multivariate data
DESCRIPTION
Multivariate data. Graphical Techniques. The scatter plot The two dimensional Histogram. Some Scatter Patterns. Non-Linear Patterns. Measures of strength of a relationship (Correlation). Pearson’s correlation coefficient (r) Spearman’s rank correlation coefficient (rho, r ). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/1.jpg)
Multivariate data
![Page 2: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/2.jpg)
Graphical Techniques
• The scatter plot
• The two dimensional Histogram
![Page 3: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/3.jpg)
Some Scatter Patterns
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140 0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y (but not always)
• Major axis of the ellipse has positive slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
![Page 4: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/4.jpg)
Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y (but not always)
• Major axis of the ellipse has negative slope slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140 0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
![Page 5: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/5.jpg)
Non-Linear Patterns
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
![Page 6: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/6.jpg)
Measures of strength of a relationship (Correlation)
• Pearson’s correlation coefficient (r)
• Spearman’s rank correlation coefficient (rho, )
![Page 7: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/7.jpg)
Pearsons correlation coefficient is defined as below:
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
![Page 8: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/8.jpg)
where:
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
![Page 9: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/9.jpg)
Properties of Pearson’s correlation coefficient r
1. The value of r is always between –1 and +1.2. If the relationship between X and Y is positive, then
r will be positive.3. If the relationship between X and Y is negative,
then r will be negative.4. If there is no relationship between X and Y, then r
will be zero.
5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.
6. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.
![Page 10: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/10.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r =1
![Page 11: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/11.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.95
![Page 12: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/12.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.7
![Page 13: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/13.jpg)
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
r = 0.4
![Page 14: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/14.jpg)
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
r = 0
![Page 15: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/15.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.4
![Page 16: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/16.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.7
![Page 17: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/17.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.8
![Page 18: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/18.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.95
![Page 19: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/19.jpg)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -1
![Page 20: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/20.jpg)
Computing formulae for the statistics:
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
![Page 21: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/21.jpg)
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
![Page 22: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/22.jpg)
Spearman’s rank
correlation coefficient
(rho)
![Page 23: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/23.jpg)
Spearman’s rank correlation coefficient (rho)
Spearman’s rank correlation coefficient is computed as follows:• Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n• Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n.
•For any case (i) let (xi, yi) denote the observations on X and Y and let (ri, si) denote the ranks on X and Y.
![Page 24: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/24.jpg)
• If the variables X and Y are strongly positively correlated the ranks on X should generally agree with the ranks on Y. (The largest X should be the largest Y, The smallest X should be the smallest Y).
• If the variables X and Y are strongly negatively correlated the ranks on X should in the reverse order to the ranks on Y. (The largest X should be the smallest Y, The smallest X should be the largest Y).
• If the variables X and Y are uncorrelated the ranks on X should randomly distributed with the ranks on Y.
![Page 25: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/25.jpg)
Spearman’s rank correlation coefficient
is defined as follows:
For each case let di = ri – si = difference in the two ranks.
Then Spearman’s rank correlation coefficient () is defined as follows:
1
61
21
2
nn
dn
ii
![Page 26: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/26.jpg)
Properties of Spearman’s rank correlation coefficient 1. The value of is always between –1 and +1.2. If the relationship between X and Y is positive, then
will be positive.3. If the relationship between X and Y is negative,
then will be negative.4. If there is no relationship between X and Y, then
will be zero.5. The value of will be +1 if the ranks of X
completely agree with the ranks of Y.6. The value of will be -1 if the ranks of X are in
reverse order to the ranks of Y.
![Page 27: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/27.jpg)
Examplexi 25.0 33.9 16.7 37.4 24.6 17.3 40.2
yi 24.3 38.7 13.4 32.1 28.0 12.5 44.9
Ranking the X’s and the Y’s we get:
ri 4 5 1 6 3 2 7
si 3 6 2 5 4 1 7
Computing the differences in ranks gives us:
di 1 -1 -1 1 -1 1 0
61
2
n
iid
![Page 28: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/28.jpg)
1
61
21
2
nn
dn
ii
177
661
2
47
31
487
361
893.028
25
![Page 29: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/29.jpg)
Computing Pearsons correlation coefficient, r, for the same problem:
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
![Page 30: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/30.jpg)
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
![Page 31: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/31.jpg)
To compute
first compute
xxS yyS xyS
35.59721
2
n
iixC
78.60531
n
iii yxE
41.62541
2
n
iiyD
9.1931
n
iiyB1.195
1
n
iixA
![Page 32: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/32.jpg)
Then
63.5347
1.19535.5972
22
n
ACSxx
38.8837
9.19341.6254
22
n
BDS yy
51.649
7
9.1931.19578.6053
n
BAESxy
![Page 33: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/33.jpg)
and
Compare with
945.038.88363.534
51.649r
893.0
![Page 34: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/34.jpg)
Comments: Spearman’s rank correlation coefficient and Pearson’s correlation coefficient r
1. The value of an also be computed from:
2. Spearman’s is Pearson’s r computed from the ranks.
n
ii
n
ii
n
iii
ssrr
ssrr
1
2
1
2
1
![Page 35: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/35.jpg)
3. Spearman’s is less sensitive to extreme observations. (outliers)
4. The value of Pearson’s r is much more sensitive to extreme outliers.
This is similar to the comparison between the median and the mean, the standard deviation and the pseudo-standard deviation. The mean and standard deviation are more sensitive to outliers than the median and pseudo- standard deviation.
![Page 36: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/36.jpg)
Simple Linear Regression
Fitting straight lines to data
![Page 37: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/37.jpg)
The Least Squares Line The Regression Line
• When data is correlated it falls roughly about a straight line.
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
![Page 38: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/38.jpg)
In this situation wants to:• Find the equation of the straight line through
the data that yields the best fit.
The equation of any straight line:is of the form:
Y = a + bX
b = the slope of the linea = the intercept of the line
![Page 39: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/39.jpg)
a
Run = x2-x1
Rise = y2-y1
b =RiseRun x2-x1
=y2-y1
![Page 40: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/40.jpg)
• a is the value of Y when X is zero
• b is the rate that Y increases per unit increase in X.
• For a straight line this rate is constant.
• For non linear curves the rate that Y increases per unit increase in X varies with X.
![Page 41: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/41.jpg)
Linear
![Page 42: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/42.jpg)
0
20
40
60
80
100
120
0 10 20 30 40 50
Non-linear
![Page 43: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/43.jpg)
Age Class 30-40 40-50 50-60 60-70 70-80Mipoint Age (X) 35 45 55 65 75Median BP (Y) 114 124 143 158 166
Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:
![Page 44: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/44.jpg)
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80
Y = 65.1 + 1.38 X
Graph:
![Page 45: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/45.jpg)
Interpretation of the slope and intercept
1. Intercept – value of Y at X = 0.– Predicted Blood pressure of a newborn (65.1).– This interpretation remains valid only if
linearity is true down to X = 0.
2. Slope – rate of increase in Y per unit increase in X.
– Blood Pressure increases 1.38 units each year.
![Page 46: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/46.jpg)
The Least Squares Line
Fitting the best straight line
to “linear” data
![Page 47: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/47.jpg)
Reasons for fitting a straight line to data
1. It provides a precise description of the relationship between Y and X.
2. The interpretation of the parameters of the line (slope and intercept) leads to an improved understanding of the phenomena that is under study.
3. The equation of the line is useful for prediction of the dependent variable (Y) from the independent variable (X).
![Page 48: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/48.jpg)
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
![Page 49: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/49.jpg)
LetY = a + b X
denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.
For example, if X = xi (as for the ith case) then the predicted value of Y is:
ii bxay ˆ
![Page 50: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/50.jpg)
For example if
Y = a + b X = 25.2 + 2.0 X
Is the equation of the straight line.
and if X = xi = 20 (for the ith case) then the
predicted value of Y is:
2.65200.22.25ˆ ii bxay
![Page 51: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/51.jpg)
If the actual value of Y is yi = 70.0 for case i, then the difference
is the error in the prediction for case i.
is also called the residual for case i
8.42.6570ˆ ii yy
iiiii bxayyyr ˆ
![Page 52: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/52.jpg)
If the residual
can be computed for each case in the sample,
The residual sum of squares (RSS) is
a measure of the “goodness of fit of the line
Y = a + bX to the data
iiiii bxayyyr ˆ
,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
![Page 53: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/53.jpg)
X
Y=a+bX
Y
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
r1
r2
r3 r4
![Page 54: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/54.jpg)
The optimal choice of a and b will result in the residual sum of squares
attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
![Page 55: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/55.jpg)
R.S.S = 3389.9
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 10 + (0.5)X
![Page 56: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/56.jpg)
R.S.S = 1861.9
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 15 + (0.5)X
![Page 57: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/57.jpg)
R.S.S = 833.9
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 20 + (0.5)X
![Page 58: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/58.jpg)
R.S.S = 883.1
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 20 + (1)X
![Page 59: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/59.jpg)
R.S.S = 303.98
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 20 + (0.7)X
![Page 60: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/60.jpg)
R.S.S = 225.74
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Y = 26.46 + (0.55)X
![Page 61: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/61.jpg)
The equation for the least squares line
Let
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
![Page 62: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/62.jpg)
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
Computing Formulae:
![Page 63: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/63.jpg)
Then the slope of the least squares line can be shown to be:
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
![Page 64: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/64.jpg)
and the intercept of the least squares line can be shown to be:
xS
Syxbya
xx
xy
![Page 65: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/65.jpg)
The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
![Page 66: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/66.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
![Page 67: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/67.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
![Page 68: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/68.jpg)
404,541
2
n
iix
914,161
n
iii yx
018,61
2
n
iiy
Fitting the Least Squares Line
6641
n
iix
2261
n
iiy
![Page 69: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/69.jpg)
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line
First compute the following three quantities:
![Page 70: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/70.jpg)
Computing Estimate of Slope and Intercept
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
![Page 71: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/71.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X
![Page 72: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/72.jpg)
Interpretation of the slope and intercept
1. Intercept – value of Y at X = 0.– Predicted death rate from lung cancer
(6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0).
2. Slope – rate of increase in Y per unit increase in X.
– Death rate from lung cancer for men in 1950 increases 0.228 units for each increase of 1 cigarette per capita consumption in 1930.
![Page 73: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/73.jpg)
Age Class 30-40 40-50 50-60 60-70 70-80Mipoint Age (X) 35 45 55 65 75Median BP (Y) 114 124 143 158 166
Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:
![Page 74: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/74.jpg)
125,161
2
n
iix
155,401
n
iii yx
341,1011
2
n
iiy
Fitting the Least Squares Line
2751
n
iix
7051
n
iiy
![Page 75: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/75.jpg)
10005
27516125
2
xxS
1936
5
705101341
2
yyS
1380
5
70527540155 xyS
Fitting the Least Squares Line
First compute the following three quantities:
![Page 76: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/76.jpg)
Computing Estimate of Slope and Intercept
38.11000
1380
xx
xy
S
Sb
1.655
275380.1
5
705
xbya
![Page 77: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/77.jpg)
0
20
40
60
80
100
120
140
160
180
200
0 10 20 30 40 50 60 70 80
Y = 65.1 + 1.38 X
Graph:
![Page 78: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/78.jpg)
Relationship between correlation and Linear Regression
1. Pearsons correlation.
• Takes values between –1 and +1
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
![Page 79: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/79.jpg)
2. Least squares Line Y = a + bX– Minimises the Residual Sum of Squares:
– The Sum of Squares that measures the variability in Y that is unexplained by X.
– This can also be denoted by:
SSunexplained
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
![Page 80: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/80.jpg)
Some other Sum of Squares:
– The Sum of Squares that measures the total variability in Y (ignoring X).
n
iiTotal yySS
1
2
![Page 81: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/81.jpg)
– The Sum of Squares that measures the total variability in Y that is explained by X.
n
iiExplained yySS
1
2ˆ
![Page 82: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/82.jpg)
It can be shown:
(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)
n
iii
n
ii
n
ii yyyyyy
1
2
1
2
1
2 ˆˆ
lainedUnExplainedTotal SSSSSS exp
![Page 83: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/83.jpg)
It can also be shown:
= proportion variability in Y unexplained by X.
= the coefficient of determination
n
ii
n
ii
yy
yyr
1
2
1
2
2
ˆ
![Page 84: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/84.jpg)
Further:
= proportion variability in Y that is unexplained by X.
n
ii
n
iii
yy
yyr
1
2
1
2
2
ˆ1
![Page 85: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/85.jpg)
Web sites demonstrating statistical principles using Java applets:
These can be found at the link:http://www.csustan.edu/ppa/llg/stat_demos.htm
![Page 86: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/86.jpg)
Example
TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
![Page 87: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/87.jpg)
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line
First compute the following three quantities:
![Page 88: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/88.jpg)
Computing Estimate of Slope and Intercept
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
![Page 89: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/89.jpg)
Computing r and r2
737.0
73.137455.14322
82.3271
yyxx
xy
SS
Sr
544.0737.0 22 r
54.4% of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)
![Page 90: Multivariate data](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56813a27550346895da208f2/html5/thumbnails/90.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X