partial least squares and compositional data: problems and alternatives

14
ELSEVIER Chemometrics and Intelligent Laboratory Systems 30 (199.5) 159-172 Chemometrics and intelligent laboratory systems Partial least squares and compositional data: problems and alternatives John Hide, William Rayens Department of Statistics, Uniuersity of Kentucky Lexington, KY40506, USA Received 16 December 1994; accepted 23 June 1995 Abstract It is still widely unknown in chemometrics that the statistical analysis of compositional data requires fundamentally dif- ferent tools than a similar analysis of unconstrained data. This article examines the problems that potentially occur when one performs a partial least squares (PLS) analysis on compositional data and suggests logcontrast partial least squares (LCPLS) as an alternative. Keywords: Partial least squares; Compositional data 1. Introduction Many multivariate data sets of interest to scientists are compositional or ‘closed’ data sets, consisting essen- tially of relative proportions. For instance, in petrology the geochemical composition of rocks is often studied by classifying each rock according to the relative percentage by weight of chemical oxides. Some early examples of this type of research can be found in Thompson et al. [l], who studied samples of rocks that were collected from the Eocene Lavas of the Isle of Skye, Scotland. Carr [2] performed a similar study comparing the geochemical concentration of Permian and Post-Permian igneous rocks in the Southern Sydney Basin. Further, Love and Woronow [3] were interested in determining the effects on geochemical composition when different treatments for the destruction of organic material were used. And, typical of subsequent studies in sedimentology, Coakley and Rust [4] investigated sediment samples that had been taken at different depths from an arctic lake and then classified according to their relative amounts of sand, silt, and clay. In analytical chemistry compositional data often result from preprocessing. In chromatography, for instance, the area under the peak in a chromatogram is routinely scaled by the total area under all peaks so as to compen- sate for variation in the amount of sample used to generate that chromatogram. Other chemical and geochemical examples can be found in [5-121. Unfortunately, compositional data can rarely be analyzed with the usual multivariate statistical methods. The reasons for this are many, some almost philosophical, and even statisticians have been slow to appreciate the extent of the problems. The attending difficulties go well beyond the obvious objections to multivariate normal- ity which often accompanies such methods. In fact, one must realize that some of the most elementary concepts, 0169-7439/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDI 0169-7439(95)00062-3

Upload: john-hinkle

Post on 21-Jun-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Partial least squares and compositional data: problems and alternatives

ELSEVIER Chemometrics and Intelligent Laboratory Systems 30 (199.5) 159-172

Chemometrics and intelligent laboratory systems

Partial least squares and compositional data: problems and alternatives

John Hide, William Rayens Department of Statistics, Uniuersity of Kentucky Lexington, KY40506, USA

Received 16 December 1994; accepted 23 June 1995

Abstract

It is still widely unknown in chemometrics that the statistical analysis of compositional data requires fundamentally dif- ferent tools than a similar analysis of unconstrained data. This article examines the problems that potentially occur when one performs a partial least squares (PLS) analysis on compositional data and suggests logcontrast partial least squares (LCPLS) as an alternative.

Keywords: Partial least squares; Compositional data

1. Introduction

Many multivariate data sets of interest to scientists are compositional or ‘closed’ data sets, consisting essen- tially of relative proportions. For instance, in petrology the geochemical composition of rocks is often studied by classifying each rock according to the relative percentage by weight of chemical oxides. Some early examples of this type of research can be found in Thompson et al. [l], who studied samples of rocks that were collected from the Eocene Lavas of the Isle of Skye, Scotland. Carr [2] performed a similar study comparing the geochemical concentration of Permian and Post-Permian igneous rocks in the Southern Sydney Basin. Further, Love and Woronow [3] were interested in determining the effects on geochemical composition when different treatments for the destruction of organic material were used. And, typical of subsequent studies in sedimentology, Coakley and Rust [4] investigated sediment samples that had been taken at different depths from an arctic lake and then classified according to their relative amounts of sand, silt, and clay.

In analytical chemistry compositional data often result from preprocessing. In chromatography, for instance, the area under the peak in a chromatogram is routinely scaled by the total area under all peaks so as to compen- sate for variation in the amount of sample used to generate that chromatogram. Other chemical and geochemical examples can be found in [5-121.

Unfortunately, compositional data can rarely be analyzed with the usual multivariate statistical methods. The reasons for this are many, some almost philosophical, and even statisticians have been slow to appreciate the extent of the problems. The attending difficulties go well beyond the obvious objections to multivariate normal- ity which often accompanies such methods. In fact, one must realize that some of the most elementary concepts,

0169-7439/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDI 0169-7439(95)00062-3

Page 2: Partial least squares and compositional data: problems and alternatives

160 J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-I 72

such as covariance, correlation, and association, have to be carefully rethought, and sometimes redefined when dealing with compositional data. Since these concepts are fundamental to the mechanics of and intuition behind almost all multivariate procedures, it is essential that one proceed with care. Some authors have recognized this need for caution, particularly with principal components analysis (PCA) and principal components regression (PCR), while other techniques - notably partial least squares (PLS) - have escaped careful scrutiny. The pri- mary goal of this article is to look at some of the subtle problems associated with PLS when applied to composi- tional data and to offer an alternative. A review of relevant topics in compositional data analysis precedes this study.

2. Summary conclusions

Partial least squares is seen in Sections 4 and 5 to produce potentially misleading results if applied directly to closed data. It should be emphasized that it is not clear how serious or prevalent such problems may be in the chemometrics literature since the uses of PLS vary greatly and the associated interpretations are sometimes highly subjective. However, it is clear that PLS and other structure-seeking tools based on the usual idea of covariance are based on a potentially defective construct when applied to closed data. Following some of the suggestions of Aitchison [13] an alternate version of PLS - called logconstrast partial least squares (LCPLS) - is developed from statistical first principals. It is shown that this alternate methodology can be implemented by first applying a simple transformation to the data and then proceeding to use one of the many existing PLS algorithms. In view of how easy this new methodology is to implement, it is perhaps wise to compare both the results of ordinary PLS and LCPLS when analyzing compositional data. If such a comparison results in definite differences in pre- diction ability or subjective insight, then there are sound statistical reasons to believe that either the data are curvilinear in the original variable space, or the standard covariance construct has proved inadequate, or both. In this case, LCPLS will be more interpretable and relevant, inasmuch as it was developed to circumvent these problems.

3. Overview of compositional data

3.1. Notation

Suppose the composition has G components. If G - 1 of the components of that composition are known, then the last component is known. (For the remainder of this paper, g = G - 1.) Thus, a composition with G components is referred to as a g-dimensional composition. The sample space for a g-dimensional composition is the g-dimensional simplex, defined by

V={(xl )...) x,):x,>0 )...) x,>o; xi+ *.* +x,=1)

Compositional data often originate by normalizing data whose sample space is the positive orthant. The G-di- mensional positive orthant is defined by

WG={(r4+, . ..) w,):w,>O, . ..) Wc>O}

Suppose w = (wi, . . . , wc> is a G-dimensional vector in IV. A g-dimensional composition x = (xi, . . . , xc) can be found by letting

i

Wl WC

x= cyzlwj’ . ..’ C~=;=,Wj I

Page 3: Partial least squares and compositional data: problems and alternatives

J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-172 161

The vector w is known as a basis of the composition x. The usual covariance matrix

2 crude = [cov(xi, xj)]Gxc

is called the crude covariance matrix associated with x.

3.2. Negative bias

It is fairly well known that the crude covariance matrix suffers from a negative bias. That is, each row of 2 crude will sum to 0, forcing at least one covariance term in that row to be negative. An immediate implication is that these covariances, and the corresponding correlations are not free to range as usual. Hence, it is no longer clear what a ‘small’ correlation means and the common practice of attributing ‘no association’ to a zero correla- tion becomes suspect. Pearson [14] was the first to warn that when the random variables of interest are ratios, then a correlation of zero may not mean ‘no association’. In general, Pearson pointed out:

If u =fi(x, y) and v = f2(z, y) are two functions of the three variables x, y, z, and these variables be se- lected at random so that there exists no correlation between X, y; y, z; or z, x there will be found to exist correlation between u and u.

Pearson terms the correlation that exists between u and v spurious correlation, which is also referred to as null correlation. Although slightly more complicated, compositional data exhibit the same problems. Beginning in the late 1940s the pioneers of compositional data analysis recognized this and were interested in how spuri- ous correlation could be corrected. Chayes [151 used a nonrigorous, though reasonable, argument to derive an ‘average’ correlation one could expect from compositional data and suggested using Fisher’s z-transformation to test if the observed correlation is different from this designated ‘null correlation’ value. Mosimann [16], on the other hand, argued that the appropriate measure of null correlation was the correlation structure of the Dirichlet distribution. His attending intuition was based on the idea that spurious correlation might be summa- rized by the correlation that persists in a composition despite being derived from a basis of independent compo- nents.

A result due to Lukacs [17] helped to convince Mosimann that a basis of independent gammas with the same scale parameter was the ‘correct’ choice of basis. Mosimann showed that the resulting composition had a Dirichlet distribution and suggested, as did Chayes, using Fisher’s z-transformation to compare an observed correlation to the appropriate entry in the Dirichlet covariance structure. Darroch [18] derived the same result as Mosimann by means of a different argument, while Chayes and Kruskal 1191 started with an uncorrelated basis and employed a Taylor expansion to approximate the correlation structure of the resulting composition.

All of the above-mentioned tests have been critiqued elsewhere. The primary problem in starting with an un- correlated or independent basis and then studying the correlation that ensues in the composition is that a corre- lated basis might produce the same compositional correlation structure (e.g. Kork [20]). Also, Aitchison [21] points out the following:

(1) The distribution of the test statistics is unknown. Although it is suggested to use Fisher’s z-transforma- tion, this is only valid if there is bivariate normality. It is obvious that compositional data are not normally dis- tributed.

(2) These tests are carried out separately for each possible correlation. Thus, those tests are subject to the same criticisms that are often leveled at testing all pairwise t-tests without a preliminary overall F-test. No such overall test has been proposed.

In short, the concept of ‘null correlation’ did not lead to an infallible mechanism whereby one could adjust the crude covariance structure. Still, the point remained that the crude structure was an inadequate construct for summarizing variability in compositional data. Failure to appreciate this point has led to some conflicting advice in the literature. For example, Johansson et al. [6] studied the effect of closure on PCA by comparing the vari- ance explained by the first principal component for a single data set in both closed and open format. For that

Page 4: Partial least squares and compositional data: problems and alternatives

162 J. Hinkle, W. Rayens/Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-l 72

Fig. 1. Ternary diagram of aphyric skye lava data.

particular data set these two values were very similar, so it was concluded that closure had no effect. However, Butler [5] would argue that the summary variance associated with the closed data should be reduced, as sug- gested above. In fact, Butler [5] performed such a study on 28 chemical analyses of volcanic rocks from Gough Island [22]. With the adjustments suggested by Chayes, he found decided differences in the summary variances between open and closed arrays.

3.3. Curvature

Unfortunately, an uninterpretable crude covariance structure is not the only problem. As Le Maitre [22,23], Reyment [7,24], and Aitchison [25] have noted, compositional data often exhibit curvature when plotted in vari- able space, as shown in the aphyric Skye lava data analyzed by Le Maitre (Fig. 1 and Table 1). Hence, struc- ture-seeking methods such as PCA, PCR, and PLS will be at a disadvantage since the subspaces they produce are linear. For example, a standard principal component analysis on the Skye lava data using Zccrude would pro- duce principal directions as shown in Fig. 2, which completely miss the essential one-dimensional curvature in the data. One might also surmise from Fig. 2 that the associated component scores may be misleading. Further, one expects pairwise principal component plots to exhibit a ‘random’ or elliptical scatter, owing to their orthog- onality. However, if these components are derived from curvilinear compositional data, they often reflect that curvature, challenging one’s idea of ‘uncorrelatedness’ and suggesting that the statistical constructs used in the analysis may be inadequate. The most successful attempt at providing adequate constructs is due to Aitchison [13,21,25-301. His theory revolves around the use of transformations appropriate for mapping the original data from the confines of the simplex to an unconstrained g-dimensional space, where standard multivariate analyses can be performed. Although Rayens and Srinivasan 131,321 provided useful extensions, Aitchison’s basic theory remains the most encompassing to date. Two concepts fundamental to that theory and essential to the develop- ment of LCPLS are discussed in the next section.

Page 5: Partial least squares and compositional data: problems and alternatives

J. Hinkle, W. Rayens/Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-I 72 163

Table 1 AFM compositions of 23 Skye lavas from Aitchison [13]

Specimen Percentages

A F M

Sl 52 42 6 s2 52 44 4 s3 47 48 5 s4 45 49 6 s5 40 50 10 S6 37 54 9 S7 27 58 15 S8 27 54 19 s9 23 59 8 SlO 22 59 19 Sll 21 60 19 s12 25 53 22 s13 24 54 22 s14 22 55 23 s15 22 56 22 S16 20 58 22 s17 16 62 22 S18 17 57 26 s19 14 54 32 s20 13 55 32 s21 13 52 35 s22 14 47 39 S23 24 56 20

A: Na,O+K,O; F: Fe,O,; M: MgO.

M

Fig. 2. First PCA axis of aphyric skye lava data.

Page 6: Partial least squares and compositional data: problems and alternatives

164 J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-l 72

3.4. Compositional covariance structure

Aitchison [13] argues that when analyzing compositional data one should adopt a new concept of correlation based on the following definition.

Definition 1. The covariance structure of a G-part composition x is the set of all

aij,kr = cov{log( xi/xk), log( xj/xI)): i, j, k, 1 = 1, . . . , G (1)

This definition follows from an argument that any compositional analysis should be based only on the rela- tive values of the compositional parts and not the absolute values of the individual parts. Although Eq. (1) seems to suggest that there are overwhelmingly many covariances to estimate, it is not hard to show that only G . g/2 need to be specified (as with the crude structure) from which all the others can then be determined.

There are a variety of ways in which this general covariance structure can be represented by matrix con- structs. Two that will be presented herein are based on the following transformations from Sg.

Definition 2. The logratio transformation of a G-part composition is the g-dimensional vector y given by

y,=log(xJxc) i=l, . . ..g

The covariance matrix of y is termed the logratio covariance matrix

Definition 3. The centered logratio transformation of a G-part composition is the G-dimensional vector z given by

zi=log(xi/g(x)) i= 1, . . ..G (2)

The covariance matrix of z is termed the centered logratio couariance matrix,

~=[~ij]=[cov[log{~i/g(~)},l~g{~j/g(~)}]]: i,j=l,...,G

where g(x) is the geometric mean of the G components of x. It is not difficult to see that the entire covariance structure can be recovered from these matrix versions:

@ij,kl = uij + a,, - ai/ - ujk = Yij + Ykl - Yi/ - Yjk

Both matrix forms have strengths and weaknesses. For instance, 2 has full rank, but depends on the choice of divisor, while r is G X G with rank g, but the divisor is not component specific. A full discussion of these and other covariance structures, including strong motivation for their appropriateness, can be found in Aitchison

D31.

3.5. Linear combinations and logcontrasts

If the covariance structure of a G-part composition adequately describes one’s idea of variability in the com- position, then interest in studying the variability of the factor space, that is, linear combinations of the composi- tion should be based on a similar construct.

Definition 4. A logcontrast for a G-part composition x is any loglinear combination

a’ log( x) = C ai log( xi) where C ai = 0 (3)

Aitchison has made strong arguments suggesting that a logcontrast is to the simplex as a linear combination is to unconstrained Euclidean space. His reasoning is connected to the logratio transformations, which are inher-

Page 7: Partial least squares and compositional data: problems and alternatives

J. Hinkle, W. Rayens/Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-172 165

ent in the defined covariance structure mentioned above. For clarification, note that any linear combination of the logratio transformed composition can be written as:

a’y = 5 ai log( x&o) i=l

= i a, log( Xi) + a, log( xc) i=l

G

= c a,log( Xi) (4) i=l

where aG = - Cig_ 1 ui. Thus a linear combination of the variates in y can be viewed as a logcontrast. More- over, the logcontrast does not depend on which part of the composition is used as the divisor in the definition of y. Also, for the centered logratio transformation z, we have, for contrasts dl = 0,

U’Z = 5 ai log{ Xi} - log{ g( x)} 5 ai = ’ l”g( x> (5) i=l i=l

Thus a logcontrast is the same as a contrast in the centered logratio transformed composition. To the extent that Aitchison’s arguments are convincing, the implications are immediate for the standard ex-

ploratory, structure-seeking techniques used in chemometrics. For example, even with a method as elementary as PCA it is now suggested that one should be seeking to maximize var[u’ log(x)] over all unit-length contrasts a, rather than maximizing var[a’x] over all unit-length a in RG. It is relatively easy to show that the solutions to this new maximization problem are provided by the principal components of r corresponding to the G - 1 nonzero eigenvalues. The associated weighting vectors can then be mapped back to the original simplex, if de- sired. This method of logcontrastprincipal components (LCPCA) has immediate benefits. Recall, with the Skye lava data ordinary (crude principal components failed to adequately capture the apparent one-dimensional vari- ability. The logcontrast principal components (Fig. 3) do a much better job. The reader is referred to Aitchison [25] for a detailed discussion of additional merits of LCPCA. It is also important to note that Aitchison offers empirical evidence to suggest that the logarithmic function is sufficiently linear to model compositional data that do not embody any curvature. At the very least, the nonlinear nature of the logarithm function allows a poten- tially more accurate modelling of nonlinear data patterns.

The intuition behind this article is now clear. PLS exists for much the same reason that PCA does: to find linear subspaces that allow a meaningful reduction in dimension from that of the original problem. Therefore, it is suggested that any afflictions present in performing a PCA with the crude covariance structure will translate to a PLS analysis. In the sections which follow, these afflictions are assessed and logcontrast PLS (LCPLS) is introduced and demonstrated.

4. Logcontrast partial least squares

Partial least squares (PLS) was introduced by Wold [33] and is typically used in chemometrics as a modeling alternative to ordinary least squares (OLS) when the predictor matrix is poorly conditioned. Since the technique has been widely discussed [34-361, no detailed review will be presented here. The motivation, simply, is to ap- proximate the estimation space provided by the original predictor matrix with one of lower dimension, defined in terms of linear combinations of the original predictors that maximize the squared covariance with the re- sponse. There is a variety of perspectives on this optimization problem and several arguably equivalent algo- rithms have emerged in the literature. One particular perspective which highlights the susceptibility of PLS to the problems associated with a crude covariance structure, along with some necessary notation, are reviewed below.

Page 8: Partial least squares and compositional data: problems and alternatives

166 J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-172

Fig. 3. First LCPCA axis of aphyric Skye lava data.

4.1. Partial least squares

Suppose one’s data are represented by the set of observations {yi, x&F= I, where yi is termed the response and the p-dimensional column vector xi the independent variables or predictors. Using the covariance struc- ture of [ yi, xi}, given by

S px,=cov(+; ,&E)($-E)! I-1

and

spxI=cov(y,x)=; ,&y,-y)(xi-x) r=l

PLS produces factors {ti},?= 1 of uncorrelated variables of the independent data given by,

ti = WAXi (6)

where ti is an A-dimensional vector and the (p X A) matrix WA = [wl, . . . , wA] is called the weight matrix. Here, A is the number of components needed to adequately model the data ( yi, x&r= I based on some mini- mization or stopping rule. The conditions that ensure uniqueness of the weight matrix are inherent in the follow- ing definition.

Definition 5. For the data {yi, x;}/= I, the PLS factors of xi are given in the matrix equation in (6) with

w,=argmax(cov’(y,w’w): w’w=l,{w’Swi=O}~~~), fork=l, . . ..A (7)

As mentioned above, this is not the only definition of PLS that has appeared in the literature. It is, however, the same as presented in [12,37,38], and it has the intuitive advantage of being cosmetically very similar to the

Page 9: Partial least squares and compositional data: problems and alternatives

.I. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-172 167

definitions of PCA and, for multivariate response, canonical correlation. Stone and Brooks [37] proved that the resulting factors are equivalent to those factors produced from PLS as defined by Helland [34]. In turn, Helland showed that his non-algorithmic form of PLS is equivalent to the original algorithmic form (NIPALS) due to H. Wold. Hinkle and Rayens [39] extended this definition to a multivariate response and showed that the other modes of PLS that result are equivalent to those derived from the other definitions.

The following theorem highlights the roles of S and s in the solutions to (7). A proof can be found in the Appendix.

Theorem 1. For the PLS factors given in Definition 5, the vector solutions of (7) are

HkS

wk+l=m (8)

forO<klA-lwhere

Notice the derivation of each PLS component depends explicitly on the information in the crude covariance structure provided by S and s. Hence, it is clear that these components are susceptible to the same problems as PCA components when PLS is applied to compositional data. Following Aitchison’s contention that the linear combinations of x should be replaced with logcontrasts in X, the following alternate definition is suggested for compositional data.

Definition 6. For the observational data (y,, x$~= 1, where xi is a p-part composition, the logcontrast PLS (LCPLS) factors of xi are given by

ti = w; log xi

where W, =[w,, . . . . wA] is the matrix of weightings with

wk=argmax(cov2(y, w’log x): w’w= 1, w’l =O, (W’S* wj=O}~~~) (9) where S * = cov(log xl.

Note this definition is similar to Definition 5, but with the added constraint that the weighting vectors sum to zero. Owing to this added constraint, it is not immediately clear that an existing PLS algorithm will be appropri- ate. However, using the properties of the centered logratio transformation z, it is possible to show that to imple- ment LCPLS, one only needs to replace S with r= Cycl(Zi - ixz, - ZY and s with y = Cy=t(zi - ZXyi - j) and employ an algorithm such as suggested by Theorem 1. A formal statement of this result follows, with the attending proof relegated to the Appendix.

Theorem 2. The logcontrast PLS factors defined in Definition 6 can be formed by constructing the PLS fac- tors (Definition 5) of the centered logratio transformation of the composition xi.

5. Examples of logcontrast PLS

5.1. Example 1

In Kvalheim et al. [40] the proportions of phenanthrene and four monomethylphenanthrenes, extracted by gas chromatography from 15 coal samples, were recorded along with organic maturity, measured as the vitrinite re- flectance of the sample. The primary purpose of that study was to use PCR to suggest factors representing the relative abundances of phenanthrene and the mono-methylphenanthrenes (l-MP) that are, in turn, highly related to maturity. To accomplish this goal, OLS was used to regress maturity on two extracted PCR factors to produce

Page 10: Partial least squares and compositional data: problems and alternatives

168 J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-172

Table 2 Table of weights resulting from PLS and LCPLS on coal data

PLS LCPLS

Independent variable WI WZ Wl w2

phenanthrene 3-methylphenanthrene 2-methylphenanthrene 9-methylphenanthrene 1-methylphenanthrene

Variance oercentage

-0.3149 0.6988 -0.1887 0.2493 0.4136 0.0318 0.4800 - 0.2507 0.6484 0.1348 0.5775 0.3397

- 0.4987 - 0.1935 - 0.5419 0.4235 - 0.2463 - 0.6745 - 0.3269 -0.7618

58% 30% 74% 8%

a new ‘rotated’ factor that is maximally correlated with maturity. The weights on the original variables were recovered and these weights suggested that maturity was best explained by its linear relationship with either the ratio [ZMP]/([l-MP] + [9-MP]) or ([3-MP] + [ZMP])/([l-MP] + [9-MP]).

Certainly the regression context suggests that PLS would be more appropriate for this task than PCR, and since the independent variables are closed, LCPLS may be even more appropriate than PLS. In fact, the empha- sis on relating the relative abundances of phenanthrene and the monomethylphenanthrenes to maturity suggests even more strongly that logcontrasts are more relevant to the problem than are linear combinations.

For comparison, a two-factor PLS and a two-factor LCPLS analysis were performed on these data and the resulting weight vectors are recorded in Table 2. The percentages in the bottom row of the table represent the percent of total variability in the independent variables that has been explained by the corresponding factor. It should be reiterated that some statisticians would argue that the variance summaries under PLS are overstated (see e.g. Butler [5]), owing to their association with the crude covariance structure of the data. Regardless, the differences are striking, indicating that LCPLS is doing a much better job with one component than PLS and a comparable job with two.

With the original goal of Kvalheim et al. [40] in mind, maturity was regressed on the relevant factors and, in each case, a rotated factor was obtained. The implied weights on the original variables were recovered and are displayed in Table 3. The results of PLS agree with those given in the original study, suggesting that maturity is arguably ‘most’ related to either [ZMP]/([l-MP] + [9-MP]) or ([3-MP] + [ZMP])/([l-MP] + [9-MP]). How- ever, the results offered by LCPLS are decidedly different, suggesting that maturity should be highly correlated with log([2-MP]/[l-MP]). In light of the methodology developed herein, some resolution of these differences would be necessary. Since the data are closed and since LCPLS explicitly uses ratios of the composition to build factors related to the response, it is certainly more tempting to believe the LCPLS results than those of PCR or PLS.

Table 3 Table of regression parameters resulting from regressing vitrinite reflectance on the PLS and LCPLS factors of the coal data

Independent variable

phenanthrene 3-methylphenanthrene 2-methylphenanthrene 9-methylphenanthrene l-methylphenanthrene

r2

PLS loading

0.0075 0.0418 0.0806

- 0.0675 - 0.0687

90%

LCPLS loading

-0.1594 0.7620 1.5132

- 0.7295 - 1.3862

94%

Page 11: Partial least squares and compositional data: problems and alternatives

J. Hinkle, W. Rayens/Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-172 169

Clay

Sand Silt Fig. 4. Ternary diagram of arctic lake data.

Clay

Sand Fig. 5. PIS axis of arctic lake data.

Silt

Page 12: Partial least squares and compositional data: problems and alternatives

170 J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 159-I 72

Clay

Silt Fig. 6. LCPLS axis of artic lake data.

It should be noted that the regression on the LCPLS factors included an intercept. Since the logratio transfor- mation removed the constraints of the composition, a more general model was possible. Discussion of the impli- cations for and appropriateness of particular model specifications when compositions and transformations of compositions are involved is given in Hinkle and Rayens [39].

5.2. Example 2

Fig. 4 shows a ternary diagram of data adapted from Coakley and Rust [4]. Each point represents the compo- sition of sand, silt, and clay at different depths in a Arctic Lake. From the plot we see that a select group of samples exhibits low sand content and that the ratio of sand to clay is more variable than that of silt to clay, producing the slightly curved scatter plot. Thus a logratio transformation should both ‘straighten’ out this curva- ture and reflect the above-mentioned ratio relationships in the corresponding covariance structure.

Fig. 5 shows a plot of the first PLS axis based on the crude covariance structure of the observed data. Fig. 6 shows the first LCPLS axis computed using the centered logratio covariance. This curved direction in the log- contrast factor space was selected as having relevance to observed depth of the sample and at the same time to reflect the ratio relationships in the sampled compositions. It arguably fits the data better than the corresponding PLS axis and any analysis based on the weights suggested by the curved fit will likely be more appropriate than a similar analysis based on the linear fit.

Acknowledgements

During the course of this research Professor Rayens was supported by NSF grant ATM-9108177.

Page 13: Partial least squares and compositional data: problems and alternatives

J. Hinkle, W. Rayens/ Chemometrics and Intelligent Laboratory Systems 30 (1995) 1.59-172 171

Appendix A

A.1. Proof of theorem 1

Suppose that we have the first k solution vectors, W, = [w,, . . . , wk] of (7), then by using the Lagrange multiplier technique let

4,+,(w) =cov2(y,w’x) -h(w’w- 1) -2w’SW,8=0

where A and f3=[8,, . . . . 0,y are Lagrange multipliers corresponding to the constraints in (7). Since cov( y, w’x) = w’s, we have the vector of the partial derivatives of &+ 1 with respect to the elements of w set equal to zero,

aA+ l(W) aw =ss’w-hw-sw,e=o (10)

Premultiplication of (10) by w’ and solving for A gives

h = w’ss’w

And premultiplication by WiS and solving for 19 gives,

e= [ wp’w,] -lw;sss’w

Using (11) and (12) to simplify (lo), results in the eigenvector problem

Hksd w = hw

(11)

(12)

Since the matrix on the left is rank one, there is only one nontrivial solution, hence wk+ 1 a Hks.

A.2. Proof of theorem 2

The logratio transformation zi is defined in Definition 3. To compute the A PLS factors of zi given by

ti= w;zi i=l, . ..) n

we will use the covariance structures of the data {y, z’$= 1. These are

r= i (Zi-i)(Zi-Z)’ and y= i (Zi-i)(yi-Y) i=l i= 1

From Theorem 1 the weight vectors defining the factors are given by,

KY w/c+1 =-

,,Hkr,, k=O, . . . . A-1 (13)

where Hk=I-rWk[WLr2Wk]Wk)r. Now to see how computing the above weights and factors of zi is equivalent to doing LCPLS, notice from

Eq. (2) that l’z, = 0. This implies

0 = cov( y, I’z) = l’cov( y, z) = l’y

and from (13), l’w for I= 0 for 1 = 1, . . . , A, since l’r = 0. Thus the weight vectors resulting from standard PLS on { y, z’$= 1 are contrasts; that is, they each sum to zero. Using this result and Eq. (5) we have the follow- ing relations,

max COV’( y, a’ log X) = max COV”( y, a’z) a

a’a = 1 d,“= 1 (14)

a’1 =o dl=O

Page 14: Partial least squares and compositional data: problems and alternatives

172 J. Hinkle, W. Rayens/ Chemomerrics and Intelligent Laboratory Systems 30 (1995) 159-I 72

and,

max cov”( y, b’z) 2 max COV”( y, dz) b a

b’b= 1 a’a = 1 (15)

a’l=O

But the maximizing vectors of the left side of (151, subject to the PLS constraints, are simply the weight vec- tors given by (13). Thus (15) is an equality and hence the weights and factors of LCPLS are exactly the weights and factors computed above.

References

[1] R.N. Thompson, .I. Esson and A.C. Dunham, J. Petrol., 13 (1972) 219. [2] P.F. Carr, Math. Geol., 13 (1881) 193. [3] K.M. Love and A. Woronow, Chem. Geol., 93 (1991) 291. [4] J.P. CoakIey and B.R. Rust, J. Sediment. Petrol., 38 (1968) 1290. [5] J. Butler, Math. Geol., 8 (1976) 25. [6] E. Johansson, S. Wold and K. Sjodin, Anal. Chem., 56 (1984) 1685. [7] R. Reyment, Chemom. Intell. Lab. Syst., 2 (1987) 79. [8] V. Pawlowsky, Math. Geol., 24 (1989) 27. [9] W. Windig, Chemom. Intell. Lab. Syst., 4 (1988) 201.

[lo] R. Pell, M. Seasholtz and B. KowaIski, J. Chemom., 6 (1992) 52. [ll] N. Kettaneh-Wold, Chemom. Intell. Lab. Syst., 14 (1992) 57. [12] H. Martens and T. Naes, Multivariate Calibration, Wiley, New York, 1989. [13] J. Aitchison, The Statistical Analysis of Compositional Data, Chapman and Hall, New York, 1986. [14] K. Pearson, Proc. R. Sot., 60 (1897) 489. [15] F. Chayes, J. Geophys. Res., 65 (1960) 4185. [16] J.E. Mosimann, Biometrika, 49 (1962) 65. [17] E. Lukacs, Ann. Math. Stat., 26 (1955) 319. 1181 J.N. Darroch, Math. Geol., 1 (1969) 221. [19] F. Chayes and W. Kruskal, J. Geol., 74 (1966) 692. [20] J.O. Kork, Math. Geol., 9 (1977) 543. [21] J. Aitchison, in C. Taillie, G.P. Patil and B. Baldessari (Eds.), Statistical Distributions in Scientific Work, Reidel, Dordrecht, 1981, pp.

147-156. [22] R. Le Maitre, Geol. Sot. Am. Bull., 73 (1962) 1309. [23] R. Le Maitre, J. Petrol., 9 (1968) 220. [24] R. Reyment, Chemom. Intell. Lab. Syst., 3 (1988) 254. [25] J. Aitchison, Biometrika, 70 (1983) 57. [26] J. Aitchison, J. Math. Geol., 13 (1981) 175. [27] J. Aitchison, J. R. Stat. Sot. Ser. B, 44 (1982) 139. [28] J. Aitchison, Math. Geol., 16 (1984) 531. [29] J. Aitchison, J. R. Stat. Sot. Ser. B, 47 (1985) 136. [30] J. Aitchison, in J.M. Bemardo, M.H. De Groot, D.V. Lindley and A.F.M. Smith (Eds.), Bayesian Statistics 2, Elsevier, Amsterdam,

1985, pp. 15-32. [31] W.S. Rayens and C. Srinivasan, J. Chemom., 5 (1991) 227. [32] W.S. Rayens and C. Srinivasan, J. Chemom., 5 (1991) 361. [33] H. Wold, in P.R. Krishnaiah (Ed.), Multivariate Analysis, Academic Press, New York, 1966, pp. 391-420. [34] 1. Helland, Commun. Stat. Simul. Comput., 17 (1988) 581. [35] A. Lorber, L. Wangen and B. Kowalski, J. Chemom., 1 (1987) 19. [36] I. Frank and J. Friedman, Technometrics, 35 (1993) 109. 1371 M. Stone and R. Brooks, J. R. Stat. Sot. Ser. B, 52 (1990) 237. [38] A. Hoskuldsson, J. Chemom., 2 (1988) 211. [39] J. HinkIe and W. Rayens, University of Kentucky Technical Report, 1994. [40] 0. Kvalheim, A. Christy, N. Telnaes and A. Bjorseth, Geochem. Cosmochem. Acta, 51 (1987) 1883.