# [Wiley Series in Probability and Statistics] Methods of Multivariate Analysis (Rencher/Methods) || Characterizing and Displaying Multivariate Data

Post on 08-Dec-2016

218 views

Embed Size (px)

TRANSCRIPT

CHAPTER 3

CHARACTERIZING AND DISPLAYING MULTIVARIATE DATA

We review some univariate and bivariate procedures in Sections 3.1, 3.2, and 3.3 and then extend them to vectors of higher dimension in the remainder of the chapter.

3.1 MEAN AND VARIANCE OF A UNIVARIATE RANDOM VARIABLE

Informally, a random variable may be defined as a variable whose value depends on the outcome of a chance experiment. Generally, we will consider only continuous random variables. Some types of multivariate data are only approximations to this ideal, such as test scores or a seven-point semantic differential (Likert) scale con-sisting of ordered responses ranging from "strongly disagree" to "strongly agree." Special techniques have been developed for such data, but in many cases, the usual methods designed for continuous data work almost as well.

The density function f(y) indicates the relative frequency of occurrence of the random variable y. (We do not use Y to denote the random variable for reasons given at the beginning of Section 3.6.) Thus if f(yi) > / (2/2), then points in the neighborhood of y\ are more likely to occur than points in the neighborhood of j/2

Methods of Multivariate Analysis, Third Edition. By Alvin C. Rencher and William F. Christensen 4 7 Copyright 2012 John Wiley & Sons, Inc.

4 8 CHARACTERIZING AND DISPLAYING MULTIVARIATE DATA

The population mean of a random variable y is defined (informally) as the mean of all possible values of y and is denoted by . The mean is also referred to as the expected value of y or E(y). If the density f(y) is known, the mean can sometimes be found using methods of calculus, but we will not use these techniques in this text.

If f(y) is unknown, the population mean will ordinarily remain unknown unless it has been established from extensive past experience with a stable population. If a large random sample from the population represented by f(y) is available, it is highly probable that the mean of the sample is close to .

The sample mean of a random sample of n observations 2/i, 2/2 > > 2/n is given by the ordinary arithmetic average

1 " n *-^ (3.1)

Generally, y will never be equal to ; by this we mean that the probability is zero that a sample will ever arise in which y is exactly equal to . However, y is considered a good estimator for because E(y) = and var(y) = 2/, where 2 is the variance of y. In other words, y is an unbiased estimator of and has a smaller variance than a single observation y. The variance 2 is defined below. The notation E(y) indicates the mean of all possible values of y; that is, conceptually, every possible sample is obtained from the population, the mean of each is found, and the average of all these sample means is calculated.

If every y in the population is multiplied by a constant a, the expected value is also multiplied by a:

E(ay) = aE(y) = . (3.2)

The sample mean has a similar property. If Zi = ayi for i = 1,2,. . . , n, then

z = ay. (3.3)

The variance of the population is defined as var(y) = 2 = E(y - )2. This is the average squared deviation from the mean and is thus an indication of the extent to which the values of y are spread or scattered. It can be shown that 2 = E(y2) - 2.

The sample variance is defined as

s2 = =-)2 {3) n 1

which can be shown to be equal to 9 2

= ^ y - n y . 0.5) n 1

The sample variance s2 is generally never equal to the population variance 2 (the probability of such an occurrence is zero), but it is an unbiased estimator for 2; that is, E(s2) = 2. Again the notation E(s2) indicates the mean of all possible sample variances. The square root of either the population variance or the sample variance is called the standard deviation.

COVARIANCE AND CORRELATION OF BIVARIATE RANDOM VARIABLES 4 9

Table 3.1 Height and Weight for a Sample of 20 College-Age Males

Person

1 2 3 4 5 6 7 8 9 10

Height, X

69 74 68 70 72 67 66 70 76 68

Weight, y

153 175 155 135 172 150 115 137 200 130

Person

11 12 13 14 15 16 17 18 19 20

Height, X

72 79 74 67 66 71 74 75 75 76

Weight, y

140 265 185 112 140 150 165 185 210 220

If each y is multiplied by a constant a, the population variance is multiplied by a2, or var(ay) = 22. Similarly, if z; = ayt,i = 1,2,... ,n, then the sample variance of z is given by

s2z = a2 ,s2. (3.6)

3.2 COVARIANCE AND CORRELATION OF BIVARIATE RANDOM VARIABLES

3.2.1 Covariance

If two variables x and y are measured on each research unit (object or subject), we have a bivariate random variable (x, y). Often x and y will tend to covary; if one is above its mean, the other is more likely to be above its mean, and vice versa. For example, height and weight were observed for a sample of 20 college-age males. The data are given in Table 3.1.

The values of height x and weight y from Table 3.1 are both plotted in the vertical direction in Figure 3.1. The tendency for x and y to stay on the same side of the mean is clear in Figure 3.1. This illustrates positive covariance. With negative covariance the points would tend to deviate simultaneously to opposite sides of the mean.

The population covariance is defined as cov(x,y) = axy = E[(x x)(y )}, where and are the means of x and y, respectively. Thus if x and y are usually both above their means or both below their means, the product (x - )^ ) will typically be positive and the average value of the product will be positive. Conversely, if x and y tend to fall on opposite sides of their respective means, the product will usually be negative and the average product will be negative. It can be shown that axy = E(xy) .

5 0 CHARACTERIZING AND DISPLAYING MULTIVARIATE DATA

Figure 3.1 Two variables with a tendency to covary.

If the two random variables x and y in a bivariate random variable are added or multiplied, a new random variable is obtained. The mean of x + y or xy is as follows:

E(x + y) = E(x) + E(y) (3.7) E(xy) E(x)E(y) if x, y independent. (3.8)

Formally, x and y are independent if their joint density factors into the product of their individual densities: f(x, y) = g(x)h(y). Informally, x and y are independent if the random behavior of either of the variables is not affected by the behavior of the other. Note that (3.7) is true whether or not x and y are independent, but (3.8) holds only for x and y independently distributed.

The notion of independence of x and y is more general than that of zero covari-ance. The covariance axy measures linear relationship only, whereas if two random variables are independent, they are not related either linearly or nonlinearly. Inde-pendence implies axy = 0, but axy = 0 does not imply independence. It is easy to show that if x and y are independent, then axy 0:

axy = E(xy) - = E{x)E{y) - [by (3.8)]

One way to demonstrate that the converse is not true is to construct examples of bivariate x and y that have zero covariance and yet are related in a nonlinear way (the relationship will have zero slope). This is illustrated in Figure 3.2.

COVARIANCE AND CORRELATION OF BIVARIATE RANDOM VARIABLES 5 1

Figure 3.2 A sample from a population where x and y have zero covariance and yet are dependent.

If x and y have a bivariate normal distribution (see Chapter 4), then zero covari-ance implies independence. This is because (1) the covariance measures only linear relationships and (2) in the bivariate normal case, the mean of y given x (or x given y) is a straight line.

The sample covariance is defined as

= EIU^-^-17). (3 9) y n-l

It can be shown that

sxy^^y\n^. (3.10)

n - l Note that sxy is essentially never equal to axy (for continuous data), that is, the probability is zero that sxy will equal axy. It is true, however, that sxy is an unbiased estimator for axy, that is, E(sxy) = axy.

Since sxy axy in any given sample, this is also true when axy = 0. Thus when the population covariance is zero, no random sample from the population will have zero covariance. The only way a sample from a continuous bivariate distribution will have zero covariance is for the experimenter to choose the values of x and y so that sxy = 0. (Such a sample would not be a random sample.) One way to achieve this is to place the values in the form of a grid. This is illustrated in Figure 3.3.

The sample covariance measures only linear relationships. If the points in a bi-variate sample follow a curved trend, as, for example, in Figure 3.2, the sample

5 2 CHARACTERIZING AND DISPLAYING MULTIVARIATE DATA

Figure 3.3 A sample of (x, y) values with zero covariance.

covariance will not measure the strength of the relationship. To see that sxy mea-sures only linear relationships, note that the slope of a simple linear regression line is

a _ Z- bn are orthogonal if =1 0. This is true for the centered variables Xi x and y, y when the sample covariance is zero, that is, Y=1(xi - x){yi -y) = 0.

EXAMPLE 3.2.1

To obtain the sample covariance for the height and weight data in Table 3.1, we first calculate x, y, and J2i xiVi> where x is height and y is weight:

_ 69 +74 + - + 76 x = = 71.45,

20 _ 153+175 + + 220 V = ^ = 164.7, y 20

20

Y^XiVi = (69)(153) + (74)(175) + + (76)(220) = 237,805. i = l

COVARIANCE AND CORRELATION OF BIVARIATE RANDOM VARIABLES 5 3

Now, by (3.10), we have

_ i=\ XiVi -nxy xy~ n - 1

= 237,805-(20) (71.45) (164.7) = uggg

By itself, the sample covariance 128.88 is not very meaningful. We are not sure if this represents a small, moderate, or large amount of relationship between y and x. A method of standardizing the covariance is given in the next section.

D

3.2.2 Correlation

Since the covariance depends on the scale of measurement of x and y, it is difficult to compare covariances between different pairs of variables. For example, if we change a measurement from inches to centimeters, the covariance will change. To find a measure of linear relationship that is invariant to changes of scale, we can standardize the covariance by dividing by the standard deviations of the two variables. This standardized covariance is called a correlation. The population correlation of two random variables x and y is

Pxy = corr(z, y) = ^ = % - ^ ) ( ^ J L ) ( 3 . 1 2 ) * VE(x-x)WE(y-y)

2

and the sample correlation is

Sxy = 7=( ~ X)(Vi ~ V) (3.13)

Either of these correlations will range between 1 and 1. The sample correlation rxy is related to the cosine of the angle between two vec-

tors. Let be the angle between vectors a and b in Figure 3.4. The vector from the terminal point of a to the terminal point of b can be represented as c = b a. Then the law of cosines can be stated in vector form as

COS a'a + b , b - ( b - a ) , ( b - a )

2 v V a ) ( b ' b ) a 'a + b 'b - (b 'b + a 'a - 2a'b)

2V(a 'a)(b 'b) a 'b

V(a 'a)(b 'b) (3.14)

Since cos(90) = 0, we see from (3.14) that a 'b = 0 when = 90. Thus a and b are perpendicular when a 'b = 0. By (2.99), two vectors a and b, such that

5 4 CHARACTERIZING AND DISPLAYING MULTIVARIATE DATA

Figure 3.4 Vectors a and b in 3-space.

a 'b = 0, are also said to be orthogonal. Hence orthogonal vectors are perpen-dicular in a geometric sense.

To express the correlation in the form given in (3.14), let the n observation vec-tors (xi,yi), {x2,2/2), , (im Vn) in t w 0 dimensions be represented as two vectors x' = (xi ) and y' = {yx, y2, , yn) in n dimensions, and let x and y be centered as x xj and y j/j. Then the cosine of the angle between them [see (3.14)] is equal to the sample correlation between x and y:

cos9 (x - xj)'(y - y'i)

\ / [ ( x - ^ ) ' ( - x3)][(y - yj)'(y i/j)]

/=(; - )2 '=(/< - v) 2 (3.15)

' xy

Thus if the angle between the two centered vectors x xj and y yj is small so that cos is near 1, rxy will be close to 1. If the two vectors are perpendicular, cos and rxy will be zero. If the two vectors have nearly opposite directions, rxy will be close to 1.

EXAMPLE 3.2.2

To obtain the correlation for the height and weight data of Table 3.1, we first calculate the sample variance of x:

SCATTERPLOTS OF BIVARIATE SAMPLES 5 5

Figure 3.5 Bivariate scatterplot of the data in Figure 3.1.

si 2 2

i=1 xj - nxz

n 1 102,379- (20)(71.45)2

19 14.576.

Then sx = \/l4.576 = 3.8179 and, similarly, sy = 37.964. By (3.13), we have

sxu 128.88 sxsy (3.8179) (37.964)

D

3.3 SCATTERPLOTS OF BIVARIATE SAMPLES

Figures 3.2 and 3.3 are examples of scatterplots of bivariate samples. In Figure 3.1, the two variables x and y were plotted separately for the data in Table 3.1. Figure 3.5 shows a bivariate scatterplot of the same data.

If the origin is shifted to (x, y), as indicated by the dashed lines, then the first and third quadrants contain most of the points. Scatterplots for correlated data typically show a substantial positive or negative slope.

A hypothetical sample of the uncorrelated variables height and IQ is shown in Figure 3.6. We could change the shape of the swarm of points by altering the scale on either axis. But because of the independence assumed for these variables, each

5 6 CHARACTERIZING AND DISPLAYING MULTIVARIATE DATA

Figure 3.6 A sample of data from a population where x and y are uncorrelated.

quadrant is likely to have as many points as any other quadrant. A tall person is as likely to have a high IQ as a low IQ. A person of low IQ is as likely to be short as to be tall.

3.4 GRAPHICAL DISPLAYS FOR MULTIVARIATE SAMPLES

It is a relatively simple procedure to plot bivariate samples as in Section 3.3. The position of a point shows at once the value of both variables. However, for three or more variables it is a challenge to show graphically the values of all the variables in an observation vector y. On a two-dimensional plot, the value of a third variable could be indicated by color or intensity or size of the plotted point. Four dimensions might be represented by starting with a two-dimensional scatterplot and adding two additional dimensions as line segments at right angles, as in Figure 3.7. The "corner point" represents yi and 2/2 whereas j/3 and 2/4 are given by the lengths of the two line segments.

We will now describe various methods proposed for representing p dimensions in a plot of an observation vector, where p > 2.

Profiles represent each point by p vertical bars, with the heights of the bars depicting the values of the variables. Sometimes the profile is outlined by a polygonal line rather than bars.

GRAPHICAL DISPLAYS FOR MULTIVARIATE SAMPLES 5 7

YA

1

^3

i

1

1

l'i

Figure 3.7 Four-dimensional plot.

Stars portray the value of each (normalized) variable as a point along a ray from the center to the outside of a circle. The points on the rays are usually joined to form a polygon.

Glyphs (Anderson 1960) are circles of fixed size with rays whose lengths represent the values of the variables. Anderson suggested using only three lengths of rays, thus rounding the variable values to three levels.

Faces (Chernoff 1973) depict each variable as a feature on a face, such as length of nose, size of eyes, shap...

Recommended