The use of Monte Carlo methods in factor analysis
Post on 30-Dec-2016
Atmospheric Environment Vol. 27A, No. 13, pp. 1967 1974, 1993. 0004-6981/93 $6.00+0.0~ Printed in Great Britain. ~l" 1993 Pergamon Press Ltd
THE USE OF MONTE CARLO METHODS IN FACTOR ANALYSIS
P. KUIK, M. BLAAUW, J. E. SLOOF and H. TH. WOLTERBEEK
Delft University of Technology, Interfaculty Reactor Institute, Department of Radiochemistry, Mekelweg 15, 2629 JB Delft, The Netherlands
(First receil,ed 30 November 1992 and in final form 2 April 1993)
Abstract--Monte Carlo techniques are introduced in target transformation factor analysis (TTFA), in combination with the concept of the principal factor model, in order to account for local variances in the data set and to estimate the uncertainties in the obtained source profiles. The new method is validated using several types of artificial data sets. It was found that application of the Monte Carlo method leads to a significant improvement of the accuracy of the derived source profiles in comparison with standard TTFA. From the introduction of (known) error sources to the artificial data sets it was found that the source-profile reproduction quality is optimal if the magnitudes of the Monte Carlo variations are chosen equal to the magnitudes of the introduced errors.
Key word index: Factor analysis, target transformation, Monte Carlo methods.
INTRODUCTION this model, the obtained factor solutions take into account the unexplained local variances (noise) in the
During the last few decades an increasing number of data set. environmental pollution studies have employed Up to now, most publications on factor analysis did multivariate statistical methods such as factor ana- not explicitly account for the uncertainties in the data lysis to identify possible sources of pollution, to re- set, neither by using local variances in a principal solve the elemental composition of the sources and to factor analysis, nor in the estimation of the resulting determine the contribution of each source to the total uncertainties in the factor solution. Hopke (1988) and pollution level (see, for example, Alpert and Hopke, Roscoe and Hopke (1981a) discussed two methods, 1980; Hopke, 1988; Sloofand Wolterbeek, 1991). Con- among which the so-called jack-knifing method, to trary to the chemical element balance method (CBM) obtain estimates for the uncertainties in the factor which requires both the number and composition of Ioadings. However, although nearly as computa- the sources to be known in advance, factor analysis is tionally intensive as the Monte Carlo approach, the the most appropriate choice to obtain the desired jack-knifing method does not use any knowledge (if information in cases where no a priori information present) about the individual uncertainties in the data about these source properties is available. In an set. Instead, it estimates the uncertainties in the load- earlier publication, Sloofand Wolterbeek (1991) have ings by subsequently eliminating a sample from the reported on the application of target transformation data set, whereafter means and standard deviations of factor analysis (TTFA) in the analysis of large data the obtained parameters are determined. It is doubtful sets of atmospheric pollution data. Using TTFA, whether the standard deviations thus obtained can be several sources of air pollution could be successfully considered to be reasonable estimates of the true identified. However, some questions concerning the uncertainties. validity and reliability of the factor model remained Validation of the standard TTFA method by using unanswered. In particular the following problems re- artificial test-data sets has been described by several quired further investigation: authors (eg. Hwang et al., 1984; Hopke, 1988), yielding
validation of the factor analysis method in terms of rather reasonable results. Following the same its ability to produce the correct source profiles methods, it is interesting to study possible effects of
determination of the uncertainties in the obtained the application of the Monte Carlo method on the solution that arise from uncertainties in the data set accuracy with which the source profiles can be repro-
duced. choice of the number of factors to be used.
In the following sections the factor analysis method The present paper describes a new approach to TTFA is introduced and a survey of the mathematical as- which may contribute to the solution of the problems pects of factor analysis as well as the basic calcu- mentioned above. Essential topics in this study are the lational procedures is given, both to orient the reader use of Monte Carlo techniques and the application of and to introduce notation conventions. Thereafter, the principal factor model in factor analysis. By using the Monte Carlo approach and its computational
1968 P. KUIK et al.
aspects are presented. The validation of the new fac- where the m column vectors tor analysis approach is studied by performing factor analysis on simple, artificial data sets which w e r e / a l k \ generated by a limited number of sources of known ( a ) composition, ilk= 2k (k= 1 . . . m) (6)
\ (Ink /
contain the loadings (source composition) of the cor- THE F A C T O R M O D E L responding factor k.
Similar to the standardized variables z~i, thefki and The mathematical factor analysis concept used in the uji have variances of unity, giving the following
the present study is essentially based on the classical relations factor model, which is extensively treated by Harman (1976). We consider a data set of elemental concentra- hi ~'~ 2 , 2 2
J g l
= ~., ajk, h~ + dj = 1 (j = 1 . . . n). (7) tions determined at N sampling sites for a total of k= t
n elements. The concentration of the j th element Equation (7) shows how the total variance of each ( j = 1 ... n) at the ith sampling site (i= 1 ... N) is de- variable j (which is normalized to unity) is split into noted by xji. It is convenient to transform the data set two distinct parts. The communality h] represents the to the standardized variables z~ by fraction of the variance of variable j which is to be
zji=(xj~-~j)/aj (1) explained by the factor model, while the uniqueness d] represents the remaining unexplained fraction of
with the mean of element j given by the variance. Because this unexplained variance is
N reflected in measured variations on a local scale (i.e. ~j = ~ x jdN (2) multiple measurements within a single sampling site),
i=1 the uniqueness is referred to as the "local variance" in
and its standard deviation by the present paper. The n coefficients u~ indicate how ( ~ 1 )1/2. each individual sample i contributes to the unique-
uj= ( X j I - - X j ) 2 / ( N - 1) (3) ness. For the complete data set, equation (4) can be expressed in matrix form
As a result, the zj~ have a mean of zero and a variance of unity. In the factor model the zj~ are assumed to be Z = A F + D U (8)
a linear sum of m common factors (in our particular where Z is an n N matrix with components zj~; A an case to be interpreted as emission sources), with n x m matrix with components ajk; F an rn x N matrix m ~< n, which account for the correlations between the with components Jki; D a diagonal n x n matrix with variables, and a unique contribution which is specific components dj on the diagonal; and U an n x N for each individual sampling site matrix with components uji.
The uniqueness can be determined in advance from zj~ = a~kfk~ + djuj~. (4) the uncertainties in the individual elemental concen-
k 1 trations. From equations (2) and (3) it follows that the total amount of variance of element j, denoted by
The coefficients a~k, representing the correlation of Vror.j, is given by element j with factor k, are indicative for the relative
N elemental compositibn of the factor k. These coeffic- 2 Vror.~=a~ = ~ (x j i - j ) : / (N--1) . (9) ients are often called the loadinos of the factors. The ~= 1
m coefficients fk~ are usually called the values of the The amount of variance due to local variations in the factors. They represent the contribution of factor k to data set can be expressed by sample i. The product dju~i represents the residual error for variablej in sample i, which is not accounted N VLoc, j= Z (Axe32~ N (10) for by the m common factors. For a given sample i the ~= 1
zji can be conceived as a column vector with the where Axj~ is the total, absolute uncertainty in the standardized element concentrations as its compon- concentration xj~ of element j at sampling site i. Thus ents. Equation (4) can then be written as
/zl) 1o1, /o12 /al. ,5,
\Zni \anl / \an2 / \anra / \dnuni
Monte Carlo techniques 1969
the uniqueness dy is given by finds a transformation R which provides a N / N least-squares fit between the target matrix and the
dj2--VLoc,j /VToT,j~-,~(Axji)2/E-- (Xji-- ~,)2 (11) transformed matrix of factor loadings. The type of i - - 1 / - - 1 ~ transformation considered in the present study is the
in which use has been made of the approximation orthogonal Procrustes rotation (Sch6nemann, 1966). (N - l )~ N for sufficiently large N. By modifying the obtained rotated solution according
The major goals of factor analysis are (1) to deter- to the conditions mentioned above (in particular the mine m, the number of factors needed to describe the condition of non-negative contributions) and using experimental data satisfactorily, and (2) to determine this modified rotated solution as a new target matrix the matrix of factor loadings A. Assuming the exist- for the next calculation, one obtains an iterative pro- ence of m common factors, a first direct solution for cess. The iteration ends if the differences between the A is obtained from the principal factor method, which target matrix and the matrix of the rotated solution mainly consists of the diagonalization of the n x n have become "small" (the relative difference for each matrix C of reduced correlations between variables matrix element should be smaller than, for example, (here: elements), with components 10-*). In practice the process is stopped after a fixed
1 N number (100) of iterations, which was found to be CJ k = - N 2 gJ i gki -- 8J k d2 (12) sufficient in all cases. Several types of target matrices
~= t have been considered, among which:
where 6jk = ! for j = k and 6jk =0 for j:/:k. a forced-positive direct solution (negative matrix If d ] is set to zero for all j, the analysis is called coefficients set to zero);
a principal components analysis. In that case the com- a "simple" target matrix consisting of unit vectors mon factors must account for all variance in the data with only one element set to unity and all others to set. When no a priori information on the uniqueness zero. The unrotated direct solution A is used to is available, principal components analysis is in fact guide the choice of the elements to be set to unity the only way to obtain a first solution. If, however, the (selecting elements with maximum correlation). uniqueness can be determined in advance from the
It was found that various types of target matrices yield uncertainties in the individual elemental concentra- tions as indicated in equation (11), then the principal essentially identical solutions, even in the case of real, factor method is the appropriate choice, complicated data sets. This indicates that the solution
If Q is the n x m matrix of eigenvectors of C, it can thus found can be considered to be reasonably be shown that a possible solution of A is given by correct. Further investigations concerning the cor-
rectness of the solution can only be made by using 1
ajk = qjk(Ak)2 (13) artificial test-data sets, as will be discussed later on, or by validation using independent knowledge on source
satisfying the relationship profiles (eg. Roscoe et al., 1984). All calculations pres-
C = QAQ ~ = A A r (14) ented here have been performed with the forced-posit- ive direct-solution target matrix.
where A is the diagonal matrix of eigenvalues 2k of C. Because of the normalization defined by equation It should be noted that if A is a solution of equation (1), the factor loadings are obtained in terms of cor- (14), then every matrix A'= A T with T an orthogonal relations and as such they have only a mathematical transformation satisfying the relationship T= T-~ meaning. After target transformation, the obtained will be a solution also. final solution B must be transformed back to the
In order to be useful as source profile vectors, the concentration domain by multiplying row j of B with factor loadings of the direct solution A must be trans- the standard deviation aj. Then the components formed to a new solution B = A R , with R the trans- bjk represent the concentration ofelementj in factor k. formation matrix. The transformation must satisfy the Each column of B is then scaled in order to sum up to following conditions a total of 1,000,000:
the factor loadings should not contain negative element concentrations (obviously a source cannot ~ bjk = 1,000,000 (k = 1 ... m). (15) emit negative amounts of any element); J=
the factor loadings must still explain the original Alternatively, the columns of B can be scaled to correlations in the data set. a value of 100.0 for the so-called pilot element of the
factor in order to enhance interpretability. The pilot This transformation procedure is commonly called element is the element with the largest loading in the "target transformation". It was originally introduced correlation domain (before back-transformation to by Weiner et al. (1970) and was further employed by the original basis). Hence it is the most characteristic several authors, eg. Alpert and Hopke (1980), Roscoe element in the factor. and Hopke (1981 b). With this transformation one can In the final step of the analysis the matrix of factor specify a target matrix which should more or less values F is determined using a linear least-squares resemble the desired solution. The algorithm then method. Due to the standardization given in equation
1970 P. KUIK et al.
(1), for each sampling site i the vector of factor values the uncertainties in the obtained factor loadings and f with componentsf~i...fm~, is obtained from to gain more insight in the stability of the solution.
The basic idea is to generate a large number (typically f /= (B x WB)- IBr W(Sc i - xav). (16) 500 in our case) of modified data sets X' with concen-
Here W is a diagonal weight matrix with the inverse trations xj~, in which all element concentrations are of the squared uncertainties in the elemental concen- slightly altered in a random way, and subsequently trations as diagonal elements, ~i is the vector of perform factor analysis on these data sets. The magni- elemental concentrations and ~av is the vector of con- tudes of the normally distributed random deviations centration averages are chosen in accordance with the uncertainties in the