the use of monte carlo methods in factor analysis

8
Atmospheric Environment Vol. 27A, No. 13, pp. 1967 1974, 1993. 0004-6981/93 $6.00+0.0~ Printed in Great Britain. ~l" 1993 Pergamon Press Ltd THE USE OF MONTE CARLO METHODS IN FACTOR ANALYSIS P. KUIK, M. BLAAUW,J. E. SLOOF and H. TH. WOLTERBEEK Delft University of Technology, Interfaculty Reactor Institute, Department of Radiochemistry, Mekelweg 15, 2629 JB Delft, The Netherlands (First receil,ed 30 November 1992 and in final form 2 April 1993) Abstract--Monte Carlo techniques are introduced in target transformation factor analysis (TTFA), in combination with the concept of the principal factor model, in order to account for local variances in the data set and to estimate the uncertainties in the obtained source profiles. The new method is validated using several types of artificial data sets. It was found that application of the Monte Carlo method leads to a significant improvement of the accuracy of the derived source profiles in comparison with standard TTFA. From the introduction of (known) error sources to the artificial data sets it was found that the source-profile reproduction quality is optimal if the magnitudes of the Monte Carlo variations are chosen equal to the magnitudes of the introduced errors. Key word index: Factor analysis, target transformation, Monte Carlo methods. INTRODUCTION this model, the obtained factor solutions take into account the unexplained local variances (noise) in the During the last few decades an increasing number of data set. environmental pollution studies have employed Up to now, most publications on factor analysis did multivariate statistical methods such as factor ana- not explicitly account for the uncertainties in the data lysis to identify possible sources of pollution, to re- set, neither by using local variances in a principal solve the elemental composition of the sources and to factor analysis, nor in the estimation of the resulting determine the contribution of each source to the total uncertainties in the factor solution. Hopke (1988) and pollution level (see, for example, Alpert and Hopke, Roscoe and Hopke (1981a) discussed two methods, 1980; Hopke, 1988; Sloofand Wolterbeek, 1991). Con- among which the so-called jack-knifing method, to trary to the chemical element balance method (CBM) obtain estimates for the uncertainties in the factor which requires both the number and composition of Ioadings. However, although nearly as computa- the sources to be known in advance, factor analysis is tionally intensive as the Monte Carlo approach, the the most appropriate choice to obtain the desired jack-knifing method does not use any knowledge (if information in cases where no a priori information present) about the individual uncertainties in the data about these source properties is available. In an set. Instead, it estimates the uncertainties in the load- earlier publication, Sloofand Wolterbeek (1991) have ings by subsequently eliminating a sample from the reported on the application of target transformation data set, whereafter means and standard deviations of factor analysis (TTFA) in the analysis of large data the obtained parameters are determined. It is doubtful sets of atmospheric pollution data. Using TTFA, whether the standard deviations thus obtained can be several sources of air pollution could be successfully considered to be reasonable estimates of the true identified. However, some questions concerning the uncertainties. validity and reliability of the factor model remained Validation of the standard TTFA method by using unanswered. In particular the following problems re- artificial test-data sets has been described by several quired further investigation: authors (eg. Hwang et al., 1984; Hopke, 1988), yielding validation of the factor analysis method in terms of rather reasonable results. Following the same its ability to produce the correct source profiles methods, it is interesting to study possible effects of • determination of the uncertainties in the obtained the application of the Monte Carlo method on the solution that arise from uncertainties in the data set accuracy with which the source profiles can be repro- duced. • choice of the number of factors to be used. In the following sections the factor analysis method The present paper describes a new approach to TTFA is introduced and a survey of the mathematical as- which may contribute to the solution of the problems pects of factor analysis as well as the basic calcu- mentioned above. Essential topics in this study are the lational procedures is given, both to orient the reader use of Monte Carlo techniques and the application of and to introduce notation conventions. Thereafter, the principal factor model in factor analysis. By using the Monte Carlo approach and its computational 1967

Upload: hth

Post on 30-Dec-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Atmospheric Environment Vol. 27A, No. 13, pp. 1967 1974, 1993. 0004-6981/93 $6.00+0.0~ Printed in Great Britain. ~l" 1993 Pergamon Press Ltd

THE USE OF MONTE CARLO METHODS IN FACTOR ANALYSIS

P. KUIK, M. BLAAUW, J. E. SLOOF and H. TH. WOLTERBEEK

Delft University of Technology, Interfaculty Reactor Institute, Department of Radiochemistry, Mekelweg 15, 2629 JB Delft, The Netherlands

(First receil,ed 30 November 1992 and in final form 2 April 1993)

Abstract--Monte Carlo techniques are introduced in target transformation factor analysis (TTFA), in combination with the concept of the principal factor model, in order to account for local variances in the data set and to estimate the uncertainties in the obtained source profiles. The new method is validated using several types of artificial data sets. It was found that application of the Monte Carlo method leads to a significant improvement of the accuracy of the derived source profiles in comparison with standard TTFA. From the introduction of (known) error sources to the artificial data sets it was found that the source-profile reproduction quality is optimal if the magnitudes of the Monte Carlo variations are chosen equal to the magnitudes of the introduced errors.

Key word index: Factor analysis, target transformation, Monte Carlo methods.

INTRODUCTION this model, the obtained factor solutions take into account the unexplained local variances (noise) in the

During the last few decades an increasing number of data set. environmental pollution studies have employed Up to now, most publications on factor analysis did multivariate statistical methods such as factor ana- not explicitly account for the uncertainties in the data lysis to identify possible sources of pollution, to re- set, neither by using local variances in a principal solve the elemental composition of the sources and to factor analysis, nor in the estimation of the resulting determine the contribution of each source to the total uncertainties in the factor solution. Hopke (1988) and pollution level (see, for example, Alpert and Hopke, Roscoe and Hopke (1981a) discussed two methods, 1980; Hopke, 1988; Sloofand Wolterbeek, 1991). Con- among which the so-called jack-knifing method, to trary to the chemical element balance method (CBM) obtain estimates for the uncertainties in the factor which requires both the number and composition of Ioadings. However, although nearly as computa- the sources to be known in advance, factor analysis is tionally intensive as the Monte Carlo approach, the the most appropriate choice to obtain the desired jack-knifing method does not use any knowledge (if information in cases where no a priori information present) about the individual uncertainties in the data about these source properties is available. In an set. Instead, it estimates the uncertainties in the load- earlier publication, Sloofand Wolterbeek (1991) have ings by subsequently eliminating a sample from the reported on the application of target transformation data set, whereafter means and standard deviations of factor analysis (TTFA) in the analysis of large data the obtained parameters are determined. It is doubtful sets of atmospheric pollution data. Using TTFA, whether the standard deviations thus obtained can be several sources of air pollution could be successfully considered to be reasonable estimates of the true identified. However, some questions concerning the uncertainties. validity and reliability of the factor model remained Validation of the standard TTFA method by using unanswered. In particular the following problems re- artificial test-data sets has been described by several quired further investigation: authors (eg. Hwang et al., 1984; Hopke, 1988), yielding

• validation of the factor analysis method in terms of rather reasonable results. Following the same its ability to produce the correct source profiles methods, it is interesting to study possible effects of

• determination of the uncertainties in the obtained the application of the Monte Carlo method on the solution that arise from uncertainties in the data set accuracy with which the source profiles can be repro-

duced. • choice of the number of factors to be used.

In the following sections the factor analysis method The present paper describes a new approach to TTFA is introduced and a survey of the mathematical as- which may contribute to the solution of the problems pects of factor analysis as well as the basic calcu- mentioned above. Essential topics in this study are the lational procedures is given, both to orient the reader use of Monte Carlo techniques and the application of and to introduce notation conventions. Thereafter, the principal factor model in factor analysis. By using the Monte Carlo approach and its computational

1967

1968 P. KUIK et al.

aspects are presented. The validation of the new fac- where the m column vectors tor analysis approach is studied by performing factor analysis on simple, artificial data sets which w e r e / a l k \ generated by a limited number of sources of known ( a ) composition, ilk= 2k (k= 1 . . . m) (6)

\ (Ink /

contain the loadings (source composition) of the cor- THE F A C T O R M O D E L responding factor k.

Similar to the standardized variables z~i, thefki and The mathematical factor analysis concept used in the uji have variances of unity, giving the following

the present study is essentially based on the classical relations factor model, which is extensively treated by Harman (1976). We consider a data set of elemental concentra- hi ~'~ 2 , 2 2

J g l

= ~., ajk, h~ + dj = 1 (j = 1 . . . n). (7) tions determined at N sampling sites for a total of k= t

n elements. The concentration of the j th element Equation (7) shows how the total variance of each ( j = 1 ... n) at the ith sampling site (i= 1 ... N) is de- variable j (which is normalized to unity) is split into noted by xji. It is convenient to transform the data set two distinct parts. The communality h] represents the to the standardized variables z~ by fraction of the variance of variable j which is to be

zji=(xj~-~j)/aj (1) explained by the factor model, while the uniqueness d] represents the remaining unexplained fraction of

with the mean of element j given by the variance. Because this unexplained variance is

N reflected in measured variations on a local scale (i.e. ~j = ~ x jdN (2) multiple measurements within a single sampling site),

i=1 the uniqueness is referred to as the "local variance" in

and its standard deviation by the present paper. The n coefficients u~ indicate how ( ~ 1 )1/2. each individual sample i contributes to the unique-

uj= ( X j I - - X j ) 2 / ( N - 1) (3) ness. For the complete data set, equation (4) can be expressed in matrix form

As a result, the zj~ have a mean of zero and a variance of unity. In the factor model the zj~ are assumed to be Z = A F + D U (8)

a linear sum of m common factors (in our particular where Z is an n × N matrix with components zj~; A an case to be interpreted as emission sources), with n x m matrix with components ajk; F an rn x N matrix m ~< n, which account for the correlations between the with components Jki; D a diagonal n x n matrix with variables, and a unique contribution which is specific components dj on the diagonal; and U an n x N for each individual sampling site matrix with components uji.

The uniqueness can be determined in advance from zj~ = a~kfk~ + djuj~. (4) the uncertainties in the individual elemental concen-

k 1 trations. From equations (2) and (3) it follows that the total amount of variance of element j, denoted by

The coefficients a~k, representing the correlation of Vror.j, is given by element j with factor k, are indicative for the relative

N elemental compositibn of the factor k. These coeffic- 2 Vror.~=a~ = ~ (x j i -£ j ) : / (N--1) . (9) ients are often called the loadinos of the factors. The ~= 1

m coefficients fk~ are usually called the values of the The amount of variance due to local variations in the factors. They represent the contribution of factor k to data set can be expressed by sample i. The product dju~i represents the residual error for variablej in sample i, which is not accounted N VLoc, j= Z (Axe32~ N (10) for by the m common factors. For a given sample i the ~= 1

zji can be conceived as a column vector with the where Axj~ is the total, absolute uncertainty in the standardized element concentrations as its compon- concentration xj~ of element j at sampling site i. Thus ents. Equation (4) can then be written as

/zl) 1o1, /o12 /al. • ,5,

\Zni \anl / \an2 / \anra / \dnuni

Monte Carlo techniques 1969

the uniqueness dy is given by finds a transformation R which provides a N / N least-squares fit between the target matrix and the

dj2--VLoc,j /VToT,j~-,~(Axji)2/E-- (Xji-- ~,)2 (11) transformed matrix of factor loadings. The type of i - - 1 / - - 1 ~ transformation considered in the present study is the

in which use has been made of the approximation orthogonal Procrustes rotation (Sch6nemann, 1966). (N - l )~ N for sufficiently large N. By modifying the obtained rotated solution according

The major goals of factor analysis are (1) to deter- to the conditions mentioned above (in particular the mine m, the number of factors needed to describe the condition of non-negative contributions) and using experimental data satisfactorily, and (2) to determine this modified rotated solution as a new target matrix the matrix of factor loadings A. Assuming the exist- for the next calculation, one obtains an iterative pro- ence of m common factors, a first direct solution for cess. The iteration ends if the differences between the A is obtained from the principal factor method, which target matrix and the matrix of the rotated solution mainly consists of the diagonalization of the n x n have become "small" (the relative difference for each matrix C of reduced correlations between variables matrix element should be smaller than, for example, (here: elements), with components 10-*). In practice the process is stopped after a fixed

1 N number (100) of iterations, which was found to be CJ k = - N 2 gJ i gki -- 8J k d2 (12) sufficient in all cases. Several types of target matrices

~= t have been considered, among which:

where 6jk = ! for j = k and 6jk =0 for j:/:k. • a forced-positive direct solution (negative matrix If d ] is set to zero for all j, the analysis is called coefficients set to zero);

a principal components analysis. In that case the com- • a "simple" target matrix consisting of unit vectors mon factors must account for all variance in the data with only one element set to unity and all others to set. When no a priori information on the uniqueness zero. The unrotated direct solution A is used to is available, principal components analysis is in fact guide the choice of the elements to be set to unity the only way to obtain a first solution. If, however, the (selecting elements with maximum correlation). uniqueness can be determined in advance from the

It was found that various types of target matrices yield uncertainties in the individual elemental concentra- tions as indicated in equation (11), then the principal essentially identical solutions, even in the case of real, factor method is the appropriate choice, complicated data sets. This indicates that the solution

If Q is the n x m matrix of eigenvectors of C, it can thus found can be considered to be reasonably be shown that a possible solution of A is given by correct. Further investigations concerning the cor-

rectness of the solution can only be made by using 1

ajk = qjk(Ak)2 (13) artificial test-data sets, as will be discussed later on, or by validation using independent knowledge on source

satisfying the relationship profiles (eg. Roscoe et al., 1984). All calculations pres-

C = QAQ ~ = A A r (14) ented here have been performed with the forced-posit- ive direct-solution target matrix.

where A is the diagonal matrix of eigenvalues 2k of C. Because of the normalization defined by equation It should be noted that if A is a solution of equation (1), the factor loadings are obtained in terms of cor- (14), then every matrix A'= A T with T an orthogonal relations and as such they have only a mathematical transformation satisfying the relationship T= T-~ meaning. After target transformation, the obtained will be a solution also. final solution B must be transformed back to the

In order to be useful as source profile vectors, the concentration domain by multiplying row j of B with factor loadings of the direct solution A must be trans- the standard deviation aj. Then the components formed to a new solution B = A R , with R the trans- bjk represent the concentration ofelementj in factor k. formation matrix. The transformation must satisfy the Each column of B is then scaled in order to sum up to following conditions a total of 1,000,000:

• the factor loadings should not contain negative element concentrations (obviously a source cannot ~ bjk = 1,000,000 (k = 1 ... m). (15) emit negative amounts of any element); J=

• the factor loadings must still explain the original Alternatively, the columns of B can be scaled to correlations in the data set. a value of 100.0 for the so-called pilot element of the

factor in order to enhance interpretability. The pilot This transformation procedure is commonly called element is the element with the largest loading in the "target transformation". It was originally introduced correlation domain (before back-transformation to by Weiner et al. (1970) and was further employed by the original basis). Hence it is the most characteristic several authors, eg. Alpert and Hopke (1980), Roscoe element in the factor. and Hopke (1981 b). With this transformation one can In the final step of the analysis the matrix of factor specify a target matrix which should more or less values F is determined using a linear least-squares resemble the desired solution. The algorithm then method. Due to the standardization given in equation

1970 P. KUIK et al.

(1), for each sampling site i the vector of factor values the uncertainties in the obtained factor loadings and f with componentsf~i...fm~, is obtained from to gain more insight in the stability of the solution.

The basic idea is to generate a large number (typically f /= (B x WB)- IBr W(Sc i - xav). (16) 500 in our case) of modified data sets X' with concen-

Here W is a diagonal weight matrix with the inverse trations xj~, in which all element concentrations are of the squared uncertainties in the elemental concen- slightly altered in a random way, and subsequently trations as diagonal elements, ~i is the vector of perform factor analysis on these data sets. The magni- elemental concentrations and ~av is the vector of con- tudes of the normally distributed random deviations centration averages are chosen in accordance with the uncertainties in the

original element concentrations:

! xzi | ~ x2 x~i = x~i + u" Axjl (20)

x,-i ); x.v_ i " in which Axj~ is the total absolute uncertainty in the ,i \ i , / original concentration xj~ and u= N(0; 1) a normally

distributed random deviate with a mean of zero and Consequently, the resulting f-values can be positive a standard deviation of unity. The estimation of the as well as negative. If, however, it turns out that also uncertainties Axj~, which is of crucial importance to the vector xav can be fitted adequately by equation the outcome of the Monte Carlo calculations, requires (16), then one should obtain proper positive f-values a thorough evaluation of all possible sources of error by substituting ~ for (~-X~v) in equation (16). Ex- in the data set under study, for which hardly any clusively positive solutions forf~ are then obtained by general rules can be given since they are very specific an iterative least-squares method in which parameters for each particular type of data set. that tend to become negative are forced to zero. Each modified data set X' is subject to factor ana-

The required number of factors rn is generally deter- lysis as described in the previous section, resulting in mined by one of the following methods: a new matrix of rotated factor loadings B' and a new

• Observation of a significant drop in magnitude matrix of factor values F' for each modified data set. between the rnth and (m+ 1)th eigenvalue of the The local variances required for the principal factor correlation matrix C. If there are m factors, then the analysis are determined in advance from the uncer- rank of the matrix of factor loadings A is m and tainties in the original data set, using equation (ll). consequently the rank of the n × n matrix C is also The obtained set of local variances is used for each rn (see equation (14))• Hence it follows that C can modified data set. In order to properly study the have at most m positive eigenvalues, whereas the behaviour of the parameters b)k, the m columns of the remaining ( n - m ) eigenvalues have small positive matrix B' have to be identified with corresponding and negative values summing up to zero. columns of the original matrix B, and possible column

• Consider the matrix of reproduced correlations permutations in the matrix B'(which have been found C' = AA T as a function of m and define a chi-square to occur frequently) have to be corrected. It is import- x2(rn) which indicates how well C' approaches the ant to note that a solution with a deviating column matrix of observed correlations C arrangement need not be essentially "different" from

the original solution. In most cases, each column xE(m) ---v__l!.i ~ ~ - (18) (source profile) of B' can be uniquely identified with

j=i k=l a corresponding column of the original solution•

with V the number of degrees of freedom, given by Hence, both columns can be considered to represent the same type of source profile. If every column of B'

V=~ ( n - m ) + n - m (19) can thus be uniquely identified with a column of B, • then both matrices represent the same set of source

profiles and are therefore equivalent. It is expected that x2(m) reaches a minimum for the The column identification is accomplished by the optimal value of m, or at least remains constant for method of geometrical correspondence: two vectors larger values of m. coincide if the angle between them is zero. This angle

It must be stressed, however, that there exists no is given by statistical or practical test which in all cases automati- cally predicts the correct number of factors with predefined probability. A certain amount of the b~jb~k ~'j'bk modeller's judgement remains necessary in choosing COS(q)jk)= i=1 . (21) the number of factors, b~j E blk

i = 1 i = l /

THE MONTE CARLO APPROACH Column vector j of B' is compared with all column vectors k = 1 ... m of B. The particular k resulting in

The main purpose of the Monte Carlo approach as the smallest angle is chosen to be identified with j. In it is used in our factor analysis process is to determine order to avoid scaling problems, the matrices B and B'

Monte Carlo techniques 1971

are compared in the correlation domain, before back- VALIDATION TESTS transformation to the concentration domain. In a minor number of cases it has been found that two Application of the factor analysis procedures as different columns j and j' of the matrix B' tended to be described in the previous sections raises the question identified with the same column k of the matrix B, as to how well the given solution for the loadings giving rise to an identification conflict. This was rea- represents the "true" sources that are thought to con- son to exclude these results from further calculations, stitute the data set. This question is studied by per- but the occurrence of such identification conflicts was forming factor analysis on simple, artificial data sets recorded for statistical reasons, After correction of the that are generated by a limited number of sources of column permutations, the matrix B' is transformed known composition, following the model

back to the concentration domain and a new matrix m

of factor values F' can be calculated, x~i= ~'~ £1jkfki (22) A final solution for B' is obtained by calculating the k- 1

mean and standard deviation of each parameter in which the ajk are the known source compositions b;k after completion of all Monte Carlo steps. More- and fki are uniformly distributed random numbers over, during the Monte Carlo process, the number of between 0.1 and 1000 representing the factor values. zero values is recorded in order to enable the calcu- In our particular case we chose n = 10 as the number lation of a probability for zero values P~k(0) for each of of variables (elemental concentrations), m = 2 as the the parameters. Alternatively, Pjk(0) could be cal- number of factors (sources) and N=200 as the num- culated from the obtained values for the mean and ber of sampling points. The source compositions are standard deviation, assuming a Normal (Gaussian) generated following the approach of Hopke (1988), distribution function. The averaged loadings are con- who performed similar validation tests on his TTFA sidered to be significant if they are non-zero for more algorithm. The general form of the source composi- than 95% of the generated data sets. Likewise, means tions is given by and standard deviations can be determined for each of the parametersf~i of F'. However, in practice only the [ A ~(1 - F) ,42 F solution for B' is obtained from the Monte Carlo -/l~F : = process, while the final solution for the F matrix is i l l - - | • fi2 A 3 t I - F ) . (23) obtained by fitting the original data set X with the averaged solution B' from the Monte Carlo process. ' ,A ~ o ( - F ) ~ A 10F

It was found that the absolute minimum number of Monte Carlo steps required for reasonable reliability Here the A j, with j = 1 ... 10, are uniformly distributed was of the order of 200, but in order to be on the safe random numbers between 0 and 1, and F is a para- side, 500 steps was chosen as a standard. Using 500 meter which determines the degree of collinearity be- steps, the reproducability of the results was found to tween the two sources. For F = 0 the sources are fully be very good. However, when using new, different orthogonal, while for F=0.5 both sources are ident- data sets, it might be necessary to re-investigate these ical (completely collinear). In the following discussion assumptions, the A~F components of the profiles will be referred to

as "small" values, while the A j(1 - F) components will S~?[~wure implementation be referred to as "large" values.

All calculations are performed by software de- For each sampling point the source profiles can be veloped in FORTRAN-77, running in batch mode on perturbed by adding a normally distributed error to a Digital VAX-3100 workstation of the institute's each loading ark (profile variation), with the standard central computer system. Several subroutines from deviation of the normal distribution being a given the IMSL mathematical and statistical software li- percentage of the loading. In addition, the generated brary (IMSL, 1987) have been used, such as DPRINC concentrations xji can be perturbed by a normally (first direct factor solution), DFOPCS (orthogonal distributed"analytical error". The standard deviation Procrustes rotation) and DRNNOR (generation of of this normal distribution is equal to a given percent- normally distributed random numbers). Routines for age of the concentration xji. Finally, each generated data handling and normalization, constructing the profile is scaled so that it sums up to a total of correlation matrix, performing simple statistial calcu- 1,000,000. In order to be able to perform principal lations and performing factor identification are all .['actor analysis, uncertainties Ax~ have to be deduced user-written. Although realistic data sets may contain for each generated concentration xji. These uncertain- up to 60 element concentrations for each sampling ties are calculated using site, the number of variables (here: element concentra- , , tions) had to be limited to a maximum of 20 for AX2i=(aaXji)2+ ~" (tYpajkfki) 2 (24) practical reasons (i.e. available CPU-time, memory k= limitations). These 20 elements should be selected by where aa is the relative standard deviation of the their relevance to the problem to be studied (eg. envir- analytical error and ap is the relative standard devi- onmental pollution), ation of the profile variation.

1972 P. KuIK et al.

The first type of data set that has been investigated In general we found improvement in the quality of was generated by orthogonal sources (F = 0). In gen- the profile reproduction when profile variations of eral it was found that TTFA applied to these types of 1-5% were used in the generation of the data set. data sets almost perfectly reproduces the sources that A typical case showing this behaviour is presented in constitute the data set. The second type of data set Table 1. The numbering of the profiles in this table is used in testing the TTFA algorithm was generated by consistent with the functional model given by equa- two non-orthogonal sources, obtained with F=0.2. tion (23), i.e. "profile 1" is associated with ~1 and Hopke (1988) has discussed the crucial role of profile "profile 2" with ~2. The introduction of 3% of profile variation in the ability of his TTFA algorithm to variation in data set B reduces the value of Q2 of the reproduce non-orthogonal profiles, as well as the badly reproduced profile (profile 2) by nearly a factor negative effects of the addition of analytical error to of 2 when compared with the error-free data set A, the generated concentrations. We have checked while the reproduction quality of the well-reproduced whether our TTFA algorithm exhibits a similar profile (profile 1)remains fairly unchanged. The re- behaviour. In order to quantify the quality of the suits published by Hopke (1988) for the so-called source-profile reproduction, we first introduce a chi- R-mode case show an analogous behaviour when they square-like indicator function Q2 which expresses the are subject to the indicator function defined by equa- reproduction quality as a single numerical value, tion (26).

Two different indicator functions have been con- Contrary to Hopke, however, we found that the sidered. The first one is obtained by adding the introduction of 1-5% of analytical error (instead of squared relative deviations of the Ioadings profile variation) generally resulted in similar im-

P_ m ( ~=l(a)k_ajk~2 ~ provements of the profile reproduction. The com- Q2=k~ (25) bined introduction of both error sources (as in data

1 j \ ajk / "/ set C in Table 1) resulted in about the same improve- in which ajk denote the original ioadings and a)k the ment as was found for each error source alone. It has reproduced Ioadings, both scaled to column totals of been observed that the introduction of either type of 1,000,000. The second indicator function is given by error source (in the case of Table 1 the addition of

~,(j~llog2(a~k/ajk) ) analytical error) sometimes leads to "swapping" of the Q2= . (26) well-reproduced profile and the badly reproduced

k = 1 " = profile, as can be seen from the changes in Q2 in Table For small deviations [a)k--aikl<~ark both versions are 1. Other investigated test-data sets showed the swap- nearly equal and in the case of a perfect reproduction ping phenomenon after the introduction of profile we obtain Q2 =0. However, the function defined by variation. From these results it can be concluded that equation (26) complies better with our intuitive way of there exists a "well-reproduced" and a "badly repro- looking at differences between source profiles, because duced" version of each profile. it judges reproduction quality by considering the ra- For all investigated test-data sets it was found that tios (a)k/a~k). This implies, for example, that a value the final solution, obtained after completion of a large a)k=2aik is considered equally "bad" as a value number of Monte Carlo iterations of the data set, a~k=O.5ajk. The indicator function defined by equa- appears to be by far more accurate than the single tion (25) underestimates the influence of negative devi- solution. This is illustrated by the results presented in ations (a)k<ajk) because the relative deviation can Table 2. From the decrease in the Q2-values it can be never become smaller than - 1, while it can become seen that both profiles are reproduced substantially much larger than + 1. Therefore we have chosen better by the Monte Carlo solution than by the single equation (26) as the indicator function for the repro- solution. Basically, the introduction of random vari- duction quality. For practical reasons, very small (or ations of the data set by the Monte Carlo process can zero) values (< 10- 6) of the ratio (a~k/a~) are forced to be seen as the introduction of additional analytical a (arbitrarily chosen) value of 10 -6, while very large error. Detailed investigations of the individual solu- values (>10 +6) are forced to 10 +6. tions that contribute to the average Monte Carlo

From numerous tests we have found that for the solution have shown that these (single) solutions can so-called "single" solutions (i.e. solutions obtained be divided into two distinct classes, of which the directly, without using the Monte Carlo process), gen- solutions for data sets B and C in Table 1 are typical erally only one of the source profiles is reproduced representatives. Thus, for each of the profiles, the reasonably well, whereas the other (badly reproduced) Monte Carlo solution is just the average of a large source profile more or less tends to the corresponding number of "well-reproduced" and "badly reproduced" orthogonal case with F ~ 0 . There seems to be a tend- versions of the original profile (due to the swapping ency to increase the contrast between the alternating phenomenon mentioned above). Because the "well- "small" (AjF) and "large" (A~(1- F)) values in the reproduced" version has a decreased contrast between badly reproduced profile (tending to a 0, 1, 0, alternating values in the profile and the "'badly repro- 1 ... pattern, which means F ~0) and to decrease the duced" version has an increased contrast, the average contrast in the other profile (tending to a 1, 1, 1, Monte Carlo solution is likely to be more accurate 1 ... pattern, which means F,~0.5). than a single "well reproduced" version. The relative

Monte Carlo techniques 1973

Table 1. Single-solution results obtained from factor analysis of three artificial data sets generated by two non-orthogonal factors with parameter F = 0.2. Data set A was generated without profile variation and analytical error, data set B was generated with 3.0% profile variation, and data set C was generated with 3% profile variation and 2.5% analytical error. Factor valuesfk i are identical in each data set. Monte Carlo results have been obtained after 500 iterations. For the Monte Carlo solution the obtained relative uncertainties in the loadings are also given

Original Data set A Data set B Data set C profile single-solution single-solution single-solution

(a) Profile 1

22,740 47,439 46,320 0 319,900 267,683 268,248 364,006 28,990 60,576 60,485 519 57,910 48,451 48,867 66,934 59,930 125,237 124,362 3219

5790 4845 4864 6732 15,820 33,024 32,988 945

324,100 271,228 272,315 371,578 2860 5974 5957 158

162,000 135,544 135,595 185,910

Q2 for reproduction 0.541 0.530 43.764

(b) Profile 2

123,100 174,471 171,619 108,625 108,300 150 2206 135,938 157,000 222,384 221,679 141,470

19.600 0 0 24,516 324,500 459,648 457,904 289,175

1960 0 8 2429 85,630 121,290 119,631 76,403

109,700 3 5190 140,111 15,490 21,951 21,728 13,973 54,820 102 36 67,359

Q2 for reproduction 92.697 56.512 0.059

Table 2. Results obtained from factor analysis of an artificial data set generated by two non-orthogonal factors with parameter F=0.2 and 3.0% profile variation (see text). Monte Carlo results have been obtained after 500

iterations. For the Monte Carlo solution the obtained relative uncertainties in the loadings are also given

Original Monte Carlo Relative profile Single solution solution uncertainty

(a) Profile 1

22,740 46,320 27,089 1.035 319,900 268,248 309,031 0.061

28,990 60,485 35,838 1.018 57,910 48,867 56,368 0.059 59,930 124,362 73,280 1.024

5790 4864 5608 0.060 15,820 32,988 19,702 1.003

324300 272,315 313,162 0.062 2860 5957 3544 1.011

162,000 135,595 156,377 0.060

QZ for reproduction 0.530 0.041

(b) Profile 2

123,100 171,619 130,500 0.058 108,300 2206 89,764 0.938 157,000 221,679 168,731 0.060

19,600 0 16,072 0.952 324,500 457,904 348,151 0.059

1960 8 1605 0.948 85,630 119,631 91,126 0.062

109,700 5190 92,930 0.918 15,490 21,728 16,545 0.061 54,820 36 44,576 0.952

Q2 for reproduction 56.512 0.039

1974 P. KUIK et al.

uncertainties in the Monte Carlo solution, which are (not shown in Fig. 1) and 10% show a more monot- also given in Table 2, provide additional support for onous decrease of Q2, going to a more or less constant this concept. It is seen that these uncertainties are level for r/s > 2.0. Thus a further increase of the magni- extremely large (about 100%) for the "small" (AjF) tude of the variations beyond this value does not values in the profiles, and rather small (about 6%) for result in additional gain in reproduction quality. For the "large" (Ai(1 - F)) values, a fixed r/s in the range 1.0-2.0 it is seen that an increase

In the standard Monte Carlo procedure the magni- of the amount of profile variation results in a slight tude of the random variations that are imposed on the increase of Q2. data set are chosen in accordance with the uncertain- The conclusion that can be drawn from these vali- ties in the data set, see equation (20). Alternatively, dation tests is that Monte Carlo techniques dramati- one may wish to manipulate the magnitudes of the cally improve the reliability of the factor solution, in Monte Carlo variations without modifying the uncer- particular in cases of non-orthogonal factors. The tainties in the data set. This can be achieved by using single solution generally resolves sources profiles that a scaling parameter r/, for the Monte Carlo variations, tend too much to a "quasi-orthogonal" case. The Equation (20) then becomes presence of noise in the data set (either profile vari-

ation or analytical error) gives rise to some improve- x'ji = xii + ~h" u" Axi i (27)

ment in the profile reproduction quality for the single with u = N(0; 1). The "standard" case is obtained for solution. For the Monte Carlo solution it seems that r/s= 1, while r/s=0 indicates the single solution (no the various possible quasi-orthogonal single solutions Monte Carlo calculations at all). We have studied the more or less "average-out" to a more accurate final quality of the factor reproduction as a function of the solution. For optimal reproduction quality, the mag- scaling parameter ~/, for several data sets, with profile nitudes of the Monte Carlo variations should be variations ranging from 1 up to 10%. Figure 1 shows chosen to be at least as large as the true uncertainties the typical behaviour of the total Q2 (being the sum of in the data set. The present results give enough confi- the individual Q2-values for each profile) as a function dence in the accuracy of the obtained solutions to of r/, for three data sets, with 1, 3 and 10% profile allow application of the Monte Carlo-assisted factor variation and F=0 .2 . Each data point represents the analysis method to realistic sets of environmental average result of at least three separate Monte Carlo pollution data. runs of 500 steps. Figure 1 shows that Q2 decreases substantially as r/s is increased from zero. The curves for the data sets with 1 and 3% profile variation even go through a minimum for r/8 around 1.0, which indi- REFERENCES

cates that the reproduction quality is indeed optimal Alpert D. J. and Hopke P. K. (1980) A quantitative deter-

when the magnitudes of the Monte Carlo variations mination of sources in the Boston urban aerosol. Atmo- are equal to the true uncertainties in the data set. The spheric Environment 14, 1137-1146. curves for the data sets with profile variations of 5% Harman H. H. (1976) Modern Factor Analysis, 3rd edition

(revised). University of Chicago Press, Chicago. Hopke P. K. (1988) Target transformation factor analysis as

an aerosol mass apportionment method: a review and Source-Prof i le Reproduct ion sensitivity study. Atmospheric Environment 22, 1777-1792.

two-source test data sets Hwang C. S., Severin K. G. and Hopke P. K. (1984) Com- 10o: parison of R-mode and Q-mode factor analysis for aerosol

mass apportionment. Atmospheric Environment 18, 345-352.

IMSL (1987) STAT/Library manual version 1.0. Interna- l0 tional Mathematical and Statistical Library, Inc. Hous-

"~ ton, Texas. - i

er ~.~ Roscoe B. A. and Hopke P. K. (1981a) Error estimates for ¢- -

.o 1 '~ '~ factor loadings and scores obtained with target trans- ~ ", profllevada~on: formation factor analysis. Analytica Chim. Acta 132,

~'~"-.¢_ 10.0% Roscoe B. A. and Hopke P. K. (1981b) Comparison of ~ 89-97.

=e 0.1 3.0% -~:~--,,~--~--: weighted and unweighted target transformation rotations ,, .............. ,, .......................... in factor analysis. Comput. Chem. 5, 1-7.

'~ . 1.0% Roscoe B. A., Chen C. Y. and Hopke P. K. (1984) Compari- ~ ,~, ,, son of the target transformation factor analysis of coal

0.01 , , . '~',' . . . . . . . . . composition data with X-ray diffraction analysis. Analyti- 0.s 1 1.5 2 2.5 3 3.s ca Chim. Acta 160, 121-134.

Sch6nemann P. H. (1966) A generalized solution of the orthogonal Procrustes problem. Psychometrika 31, 1-10.

Fig. 1. Indicator function Q2 for source-profile Sloof J. E. and Wolterbeek H. Th. (1991) Patterns in trace reproduction as a function of the scaling para- elements in lichens. Water Air Soil Pollut. 57-58, 785-795. meter t h of the Monte Carlo process for three Weiner P. H., Malinowski E. R. and Levinstone A. R. (1970) different test-data sets, generated with 1, 3 and Factor analysis of solvent shifts in proton magnetic reson-

10% profile variation, respectively, ance. J. phys. Chem. 74, 4537-4542.