the use of monte carlo methods in factor analysis
Post on 30-Dec-2016
Embed Size (px)
Atmospheric Environment Vol. 27A, No. 13, pp. 1967 1974, 1993. 0004-6981/93 $6.00+0.0~ Printed in Great Britain. ~l" 1993 Pergamon Press Ltd
THE USE OF MONTE CARLO METHODS IN FACTOR ANALYSIS
P. KUIK, M. BLAAUW, J. E. SLOOF and H. TH. WOLTERBEEK
Delft University of Technology, Interfaculty Reactor Institute, Department of Radiochemistry, Mekelweg 15, 2629 JB Delft, The Netherlands
(First receil,ed 30 November 1992 and in final form 2 April 1993)
Abstract--Monte Carlo techniques are introduced in target transformation factor analysis (TTFA), in combination with the concept of the principal factor model, in order to account for local variances in the data set and to estimate the uncertainties in the obtained source profiles. The new method is validated using several types of artificial data sets. It was found that application of the Monte Carlo method leads to a significant improvement of the accuracy of the derived source profiles in comparison with standard TTFA. From the introduction of (known) error sources to the artificial data sets it was found that the source-profile reproduction quality is optimal if the magnitudes of the Monte Carlo variations are chosen equal to the magnitudes of the introduced errors.
Key word index: Factor analysis, target transformation, Monte Carlo methods.
INTRODUCTION this model, the obtained factor solutions take into account the unexplained local variances (noise) in the
During the last few decades an increasing number of data set. environmental pollution studies have employed Up to now, most publications on factor analysis did multivariate statistical methods such as factor ana- not explicitly account for the uncertainties in the data lysis to identify possible sources of pollution, to re- set, neither by using local variances in a principal solve the elemental composition of the sources and to factor analysis, nor in the estimation of the resulting determine the contribution of each source to the total uncertainties in the factor solution. Hopke (1988) and pollution level (see, for example, Alpert and Hopke, Roscoe and Hopke (1981a) discussed two methods, 1980; Hopke, 1988; Sloofand Wolterbeek, 1991). Con- among which the so-called jack-knifing method, to trary to the chemical element balance method (CBM) obtain estimates for the uncertainties in the factor which requires both the number and composition of Ioadings. However, although nearly as computa- the sources to be known in advance, factor analysis is tionally intensive as the Monte Carlo approach, the the most appropriate choice to obtain the desired jack-knifing method does not use any knowledge (if information in cases where no a priori information present) about the individual uncertainties in the data about these source properties is available. In an set. Instead, it estimates the uncertainties in the load- earlier publication, Sloofand Wolterbeek (1991) have ings by subsequently eliminating a sample from the reported on the application of target transformation data set, whereafter means and standard deviations of factor analysis (TTFA) in the analysis of large data the obtained parameters are determined. It is doubtful sets of atmospheric pollution data. Using TTFA, whether the standard deviations thus obtained can be several sources of air pollution could be successfully considered to be reasonable estimates of the true identified. However, some questions concerning the uncertainties. validity and reliability of the factor model remained Validation of the standard TTFA method by using unanswered. In particular the following problems re- artificial test-data sets has been described by several quired further investigation: authors (eg. Hwang et al., 1984; Hopke, 1988), yielding
validation of the factor analysis method in terms of rather reasonable results. Following the same its ability to produce the correct source profiles methods, it is interesting to study possible effects of
determination of the uncertainties in the obtained the application of the Monte Carlo method on the solution that arise from uncertainties in the data set accuracy with which the source profiles can be repro-
duced. choice of the number of factors to be used.
In the following sections the factor analysis method The present paper describes a new approach to TTFA is introduced and a survey of the mathematical as- which may contribute to the solution of the problems pects of factor analysis as well as the basic calcu- mentioned above. Essential topics in this study are the lational procedures is given, both to orient the reader use of Monte Carlo techniques and the application of and to introduce notation conventions. Thereafter, the principal factor model in factor analysis. By using the Monte Carlo approach and its computational
1968 P. KUIK et al.
aspects are presented. The validation of the new fac- where the m column vectors tor analysis approach is studied by performing factor analysis on simple, artificial data sets which w e r e / a l k \ generated by a limited number of sources of known ( a ) composition, ilk= 2k (k= 1 . . . m) (6)
\ (Ink /
contain the loadings (source composition) of the cor- THE F A C T O R M O D E L responding factor k.
Similar to the standardized variables z~i, thefki and The mathematical factor analysis concept used in the uji have variances of unity, giving the following
the present study is essentially based on the classical relations factor model, which is extensively treated by Harman (1976). We consider a data set of elemental concentra- hi ~'~ 2 , 2 2
J g l
= ~., ajk, h~ + dj = 1 (j = 1 . . . n). (7) tions determined at N sampling sites for a total of k= t
n elements. The concentration of the j th element Equation (7) shows how the total variance of each ( j = 1 ... n) at the ith sampling site (i= 1 ... N) is de- variable j (which is normalized to unity) is split into noted by xji. It is convenient to transform the data set two distinct parts. The communality h] represents the to the standardized variables z~ by fraction of the variance of variable j which is to be
zji=(xj~-~j)/aj (1) explained by the factor model, while the uniqueness d] represents the remaining unexplained fraction of
with the mean of element j given by the variance. Because this unexplained variance is
N reflected in measured variations on a local scale (i.e. ~j = ~ x jdN (2) multiple measurements within a single sampling site),
i=1 the uniqueness is referred to as the "local variance" in
and its standard deviation by the present paper. The n coefficients u~ indicate how ( ~ 1 )1/2. each individual sample i contributes to the unique-
uj= ( X j I - - X j ) 2 / ( N - 1) (3) ness. For the complete data set, equation (4) can be expressed in matrix form
As a result, the zj~ have a mean of zero and a variance of unity. In the factor model the zj~ are assumed to be Z = A F + D U (8)
a linear sum of m common factors (in our particular where Z is an n N matrix with components zj~; A an case to be interpreted as emission sources), with n x m matrix with components ajk; F an rn x N matrix m ~< n, which account for the correlations between the with components Jki; D a diagonal n x n matrix with variables, and a unique contribution which is specific components dj on the diagonal; and U an n x N for each individual sampling site matrix with components uji.
The uniqueness can be determined in advance from zj~ = a~kfk~ + djuj~. (4) the uncertainties in the individual elemental concen-
k 1 trations. From equations (2) and (3) it follows that the total amount of variance of element j, denoted by
The coefficients a~k, representing the correlation of Vror.j, is given by element j with factor k, are indicative for the relative
N elemental compositibn of the factor k. These coeffic- 2 Vror.~=a~ = ~ (x j i - j ) : / (N--1) . (9) ients are often called the loadinos of the factors. The ~= 1
m coefficients fk~ are usually called the values of the The amount of variance due to local variations in the factors. They represent the contribution of factor k to data set can be expressed by sample i. The product dju~i represents the residual error for variablej in sample i, which is not accounted N VLoc, j= Z (Axe32~ N (10) for by the m common factors. For a given sample i the ~= 1
zji can be conceived as a column vector with the where Axj~ is the total, absolute uncertainty in the standardized element concentrations as its compon- concentration xj~ of element j at sampling site i. Thus ents. Equation (4) can then be written as
/zl) 1o1, /o12 /al. ,5,
\Zni \anl / \an2 / \anra / \dnuni
Monte Carlo techniques 1969
the uniqueness dy is given by finds a transformation R which provides a N / N least-squares fit between the target matrix and the
dj2--VLoc,j /VToT,j~-,~(Axji)2/E-- (Xji-- ~,)2 (11) transformed matrix of factor loadings. The type of i - - 1 / - - 1 ~ transformation considered in the present study is the
in which use has been made of the approximation orthogonal Procrustes rotation (Sch6nemann, 1966). (N - l )~ N for sufficiently large N. By modifying the obtained rotated solution according
The major goals of factor analysis are (1) to deter- to the conditions mentioned above (in particular the mine m, the number of factors needed to describe the condition of non-negative contributions) and using experimental data satisfactorily, and (2) to determine this modified rotated solution as a new target matrix the matrix of factor loadings A. Assuming the exist- for the next calculation, one obtains an iterative pro- ence of m common factors, a first direct solution for cess. The iteration ends if the difference