3 example concentrations of substances at different factory sites.pdf
TRANSCRIPT
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
1/12
Practical Statistics forEnvironmental and
iological Scientists
Joho Townend
JO N WILEY & SONS, LTD
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
2/12
Copyright 2002by John Wiley SonsUd,Baffins Lane, Chichester,West SussexPOl9 lUD, England
National 01243 779777lnternational(+44) 1243779777e-mail(fororders and eustomer serviee enquires):[email protected] oUTHome Page onhttp://www.wiley.co.uk
or http:/www.wiley.com AHRights Reserved.No part of this publication may be reproduced, storedin a retrievalsystem, or transmitted, in any form orby any means, eleetronic, mechanical,photocopying, recording, scanning or otherwise, except under the terms of the Copyright,Designs and Patents Act1988or under the terms of a licence issuedby the CopyrightLicensing Agency,90 Tottenham Court Road, London, UKWIP 9HE, without thepermission in writing of the publsher. John Townend has asserted his right under theCopyright, Designsand PatentsAct 1988,to be identifiedas the author of this work.
Other Wiley Editorial Offices
John Wiley Sons, Ine.,605Third Avenue,New York, NY 10158-0012, USA
Wiley-VCH Verlag Gmbh, Pappelallee3,D-69469 Weinheim, GermanyJacaranda WileyUd, 33 Park Road, Milton,Queensland 4064, AustraliaJohn Wiley Sons (Asia) PteUd, 2 Clementi Loop #02-01,Jin Xing Distrpark, Singapore129809John Wiley& Sons (Canada)Ud, 22 Worcester Road,Rexdale, Ontaro M9WlU, Canada
Library o Congress Cataloging-in-Publicatiolt Data
Townend, John.Practcal statistGs for envronmental and bologcal scientsts / John Townend.p cm.
Includes bibliographical references andindexoISBN 0-471-49664-2 (cased) ISBN 0-471-49665-0 (pbk.)1 Mathemateal statistics. l. Title.
QA276.12.T682001519.5-- de21 2001046623
British Library Cataloguing in publication DataA catalogue record for this bookis available from the British Library
ISBN 0-471-49664-2(cased)ISBN 0-471-49665-0 (limp)
Typeset in 10'5/13pt Timesby Vision Typesetting, ManchesterPrinted and bound in Great Britain byTI International Ud., Padstow, Cornwal1This bookis printedon acid-free paper responsibly manufaetured from sustainableforestry, n whichal least two trees re planted for each one usedfOIpaper producton.
Contents
Prefuce
PART 1 STATISTICS
1 Introduction1.1 Do you need statstics?1.2 What is statistcs?1 3 Some important lesson1.4 Statsticsis gettingeasieL5 lntegrity in statistics
1.6 About this book2 A Brief Tutorial o o :'1 Introduction. Variablitv: 3 Samples;nd populationo4 Summary statistics
15 The basis of statisticaltes2.6 Limitations of statistic
3 Before You Start:.1 lntroducton;-2 What statistieal metho
~ - = Surveysand experiment: 4 Designing experimentsal3..5 Summary
.. Designing an Experi-tI Introduction. ....2 Sample size.$.3 Sampling. .04 Experimental design'-5 Further reading
5 Exploratory Data A:'.1 Introducton
~ Column graphs5 Line graphs5 4 Scatter graphs
mailto:[email protected]:[email protected]:///reader/full/http://www.wiley.co.ukhttp:///reader/full/http:%EF%BF%BDww.wiley.commailto:[email protected]:///reader/full/http://www.wiley.co.ukhttp:///reader/full/http:%EF%BF%BDww.wiley.com -
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
3/12
208 14 Principal Component Analysis
An example of when we might use PCA is:
A consultant has been asked for advice about remediating the sites of 20 chemicalfactories abandoned durng a war in a developing country. No records areavailable about what was manufactured at the different sites. He decides to startby carrying out chemical analyses on Bol samples from each of the sites andmeasuring the concentraton of 11 different substances (Substl to Substll ineach:
Subs tl to Subst3 are common inorganics
Subst4 to Subst7 are common organics
Subst8 to SubstlO are associated with the textile industry
Substll is associated with the manufacture of explosives
At this stage he does not know how many different types of chemical factory thereare, or which factodes produced what.
The data for this example are given in Appendix BCa1culating principal components without a computer is not a practical
option, so it is important to make sure you have a statistical package that willdo this. For most statistical packages the data should be arranged in columnswith each variable, i.e. each type of measurement (length, height, etc.), in aseparate column. Each row should then contain the measurements made on oneindividual.
There are a few decisions you need to make before carrying out the analysis,. but unless you have good reason to do otherwise, you can follow the standard
settings given below.
Should you use the cova1 iance 01 cor1 elation matrix?
Using the covariance matrix will apply more weighting to sorne variables thanothers, depending on the magnitude of the actual numbers e.g. the diameter of aperson s head in millimetres would be given more weighting than their height inmetres, simply because the numbers are bigger). Using the correlation matrixappUes equal weighting to a the different types of measurement in the analysis.
f this means nothing to you, don t worry, just use the correlation matrix.
ow many principal components should you calculate?
Principal components are the scoring systems referred to above. They areproduced in order of importance, so the first principal component is the scoring
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
4/12
~
14,4 Interpreting the results 209
system that reftects the most variable characteristic of the sample. The secondand subsequent principal components reftect gradually less and less variable
characteristics. In practice you are unlikely to learn anything usefulfr m morethan the first three or four components. However, I would usually err on theside of caution and calculatearaund six componentsso that 1 don t missanything interesting.
The maximum number of principal components you can calculateis equal tothe num ber of variables - the number of types of measurement - in your dataset.So if you have only four variables you can only calculate up to four principalcomponents. In eftect, the program reaBy always calculates all possible components and the decision you are makingis just how many to display. Displayingmore components will not affect any of the other components. The disadvantage of displaying a lot of componentsis only the amount of printout you willget. There can bea lot and mostof it will tell you nothing useful, so don t choosemore than six components unless you have good reason.
hat results should you sto re
You are likely to want to store thescores for each of the principal components.You can use these scores to draw scatter graphs as described below.
y ou will need to specify which variables to nelude in your analysis,i.e. whichcolumns ofmeasurements. These variables will usuallybe all ofthose you havemeasured,but they donot have to be. Once you have specified which variablesto inelude, and set up the other options as described, the results should be
produced in a form similar to the example below (Section14.4 .
14.4 Interpreting th results
Eigenvalues
The difIerent scoring systems produced are caBedprincipal components andeach one hasan eigenvalue. These are actually variances;we can thinko themas say ng, ifwe measure the population using this scoring system then therewillbe this much variation among the scores. Notice that they appear in order, withPCl having the largest eigenvalue, PC2 the next largest,and so on.
PCl PC2 PC3 PC4 PC5 PC6Eigenvalue 5.3603 2.8867 1.1247 0.61240.4396 0.2611Proportion 0.487 0.262 0.102 0.056 0.040 0.024Cumulative 0.487 0.750 0.852 0.908 0.948 0.971
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
5/12
210 14 Principal Component Analysis
PC7 PC8 PC9 PClO PC11Eigenvalue 0.1792 0.0601 0.0461 0.0261 0.0037Proportion 0.016 0.005 0.004 0.002 0.000Cumula tive 0.988 0.993 0.997 1.000 1.000
The lines to concentrate on here are not so much the eigenvalues themselves butthe proportions and cumulative proportions of total variance they accoun t for.We started out with 11 types ofmeasurem ent for each factory (concentrations ofSubstl to Subst1l). From these results, we ean see that we could describe 48.7(0.487) of the tota l var iation between factories by giving just one score for eachfactory (ts score on PC1), rather than aIl 11 types of measurement. We coulddescribe 75.0 (0.750) of the to tal variat ion between factories by giving scoresusing both of the first two scoring systems (PCl and PC2), and 85.2 (0.852) ofthe tota l variation using three scores (PCl, PC2 and PC3), i.e. three types ofmeasurement rather than lI.
Although the technical meaning of eigenvalues might seem a bit obscure, amore down to earth interpretation is to say that, since 85 ofthe total variationbetween sites is concen trate d in the first three pr incipal components, we shouldbe most interested in whatever types of measurement these three scoringsysterns are made up from.
In this example we have obtained a fortunate result where rnost of thevariation is concentrated in the first few cornponents. f he variation turns outto be spread fairIy evenly over a lot of cornponents, it can be difficult to inte rpre tthe results and t rnight be better to pursue sorne other type of analysis.
Scree p ots
Scre plots are simply plots of the amount of variance (or eigenvalues) attribu-table to each of the different principal components, in order. The scree plot forthe aboye results is shown in Figure 14.1.
Sorne people find t easier to decide how many compo nents to focus on byusing a scree plot. In this example there s Httle varance explained by the fourth
and subsequent cornponents, so 1 will concentr a e on interpret ing the first three.Unless there is a clear b reak in the slope of the curve, the choice of how rnanycomponents to concentrate on is very subjective.
Loadings
In the terminology 1 have used up to now, the load ings are the scoring systernsthernselves, also known as eigenvectors The loadings for the first six principalcornponents are as foIlows:
~eX : ::JII
l b d
Y a
S}51:C
are d5 0
tiC ::lS
ha'q~
IhanltiC ::- i t
5C(':C
can.Ihe:1lmt:.51
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
6/12
utroof%
hldes1 0 f
an
l d
ng
etet
yh
ny
S
l
14.4 Interpreting the results 211
6
5
4O l::Jro>c: 3IDOl
lIi2
2 3 4 5 6 7 8 9 10 11
Principal component nu mber
Figure 14.1 A scree plot for the principal components calculated for the chemical factoriesexample
PCl PC2 PC3 PC4 PC5 PC6Subs tl 0.407 -0.033 ~ 0 0 5 3 -0.066 0.036 0.512Subst2 0.413 -0.038 -0.051 0.103 -0.27 4 0.088Subst3 0.384 -0.106 -0.190 -0.322 -0.182 0.328Subst4 -0.323 -0.003 -0.318 -0.004 -0.858 0.020Subst5 -0.343 0.057 0.061 -0.680 0.037 ~ 0 0 7 2Subst6 -0.334 0.187 -0.004 0.588 0.038 0.409Subst7 -0.375 0.119 -0.124 -0.236 0.200 0.623Subst8 ~ 0 0 9 3 -0.561 0.057 -0.033 0.052 0.153Subst9 -0.130 -0.555 ~ ~ 0 0 0 3 0.066 0.024 -0.021SubstlO -0.128 -0.557 -0.023 0.103 0.036 0.033Substll 0.014 -0.006 -0.913 0.046 0.327 -0.192
The loadings tell us how each scoring system is related to each of the originalvariables. Values close to zero indicate Httle relationship between scores on thissystern and the or iginal variables. In this exarnple we can see that scores pe3are closely related to Substll concentration. The loading is negative ( -0.913)so we can say that factories with high scores on pe3 will have low coocentrations of Substl1, and factories with low scores on pe3 (i.e. negative scores) willhave high Substl1 concentrations.
Note that sorne prograrns produce tables of correlation coefficients ratherthan eigenvectors, but these may also be referred to as loadings. The interpretation of correlation coefficients is broadly similar, i.e. values in the table close to
+ 1 or 1 indicate that a particular variable has a strong infiuence on thescores a part icular principal component. However, this type of loadingcannot be used directly to calculate the scores as described below. You shouldtherefore check which type of loadings your statistical package produces. Wemust now try to interpret the different principal cornponents.
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
7/12
212 14 Principal Component Analysis
he first principal component p e l )
Variables with loadings close to zera can be ignored. Unfortunately,it is rathersubjective which ones are close enough to zero to ignore,but our aim hereis tofind a meaningful interpretation ratherthan a mathematical descriptionof thedata. This scoring system seemsto be dominated by the first seven Ioadings ands approximately
0.41 x [Substl] + 0.41 x [Subst2]+ 0.38 x [Subst3]
- 0.32 x [Subst4] 0.34 x [Subst5] - 0.33 x [Subst6] - 0.38 x [Subst7]
where[Substl] is thestandardized concentrationof Substl etc. (Box14.1 .Thcconcentrations of Subst8 to Subst11do not greatly affect the score on thiscomponent, sowecan ignore them.pe l therefore contrastsSubstl to Subst3(inorganics), which have positive coefficients in the scoring system, with Subst4to Subst7 (organics), which have negative coefficients. The higher the concentratonof inorganic substances foundat the site, the higher the score; the higher the
Box14 1 Standardizing measurements
Opting to use the correlation matrixin the analysis has the effectof makingaHthe typesof measurement equaHy important, regardlessof the magnitudeof the actual values. Thisis achieved by replacing each measurement in theanalysis by a 'standardized' measurement:
standardized measurement mean of that typeof measurementmeasurement standard deviation of that type of measurement
The different types of measurement are more correctly called variables. In thefactories example, Substl concentrationisone variable, Subst2 concentrationi8 another variable, and so on.For Substl concentration,wehave a mean of4.48and a standard deviation of1.018(AppendixB . The Subst1 concentration for factory1 i83.3,so
3.3- 4.48standardized Substl concentration for factory1 = 1.161.018
A standardized value can be calculated for eacho the measurements in theoriginal datasetin a similarway.
After standardizing, each variablewillhave a standard deviation of 1 and amean ofO This procedure for standardizing variablesisnot unique topeA. tis used in many other statistcal methods also.
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
8/12
erohend
heis
st3t4
h
14 4 Interpreting the results 213
concentration of organic substances, the lower the score. Hence PC1can beeonsidered to be a measure ofhow inorganicor organic are the residuesat thesite.
he second principal component PC2)
Ignoring variables with loadngs closeto zero, this seoring systemis approximately
- 0.56 x [Subst8J 0.55x [Subst9J 0.56x [SubstlOJ
The concentratons ofSubstl to Subst7and the concentration ofSubstll donot greatly affect the seo re on this component. Subst8, Subst9and Subst10 werethe chemicals assocated with textile manufacture. HencePC2 mightbe considered to be a measure ofhow likely factories areto have been associated withthe textile industry. Since the loadings allhappen to be negative values, thelower the score, the more likely a factoryis to have been in vol ved with thetextile industry.
he third principal component PC3)
Ignoring variables with loadngs close to zero, this scoring systemis approxmately
0.32 x [Subst4J 0.91 x[SubstllJ
The concentratons of Subst1 to Subst3and the concentrations of Subst5 toSubstlOdo not greatly affect the scoreon this component.Wthout moreinformationabout the chemistry involved, it isnot obviouswhat the combination of Subst4and Subst11 signifies,but this scoring systemis dominated by theconcentration of Subst11. HencePC3 can be considered to be a measureofhowlikely a factoryis to have been involved in arms manufacture. Since the loadingfor Subst11is negative, a low score indicates higher likelihood of being involvedwith arms manufacture.
One mightattempt to interpret theother principal components in a similar waybut in general this becomes progressivelymore difficult aswe get to highernumbered components. Also, as we are interested in differentiating betweentypes of factory, we should concentrateour efforts on the componentsthatindude most of the variance.
One other common scenariois to obtain a component whereaH the loadings
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
9/12
214 14 Principal Component Analysis
are approximately the same, either all positveor aH negative. f we hadobtained such a component in this example, we could interpretit as a generalindex of pollution due to chemical manufacture. Increases in any of the substances would havehad a similar effecton tbis score.
Scores
Box 14.2 explains how the seo res are derived from the loadings. The scores are
therefore 'measurements' of how each factory rates on the different scoringsystems given by the principal components:
Factory PCl PC2 PC3 PC4 PC5 PC61 ~ 1 9 2 2 2.062 0.566 0.594 -0.864 -0.2852 2.620 1.230 0.592 0.433 -1.003 --0.3403 -2.917 1.225 -1.507 -0.516 0.093 0.2314 2.165 -2.030 0.793 0.957 0.074 0.4845 1.794 -2.002 -1.228 -0.186 0.301 -0.1506 1.386 0.771 -1.371 -0.754 0.146 0.3027 -1.196 2.785 1.112 0.917 1.212 0.4568 -2.020 -0.771 -1.302 0.600 0.321 -0.4129 2.661 0.763 ~ 1 4 3 4 0.199 -0.085 ~ 0 3 7 8
10 -2.310 -0.268 0.485 -1.212 -0.914 -0.74711 1.982 0.720 -1.294 0.310 0.461 0.24512 -1.405 0.032 1.087 0.965 0.319 -0.59513 -2.358 -1.476 0.765 0.181 -0.138 -0.31514 1.815 -2.220 0.531 0.709 -0.898 0.14115 -2.184 -0.855 -1.437 0.579 0.071 -0.27516 . 1.653 1.363 1.005 -0.342 0.861 0.092
BaI
Thia l l
proa ..n:.st31
Ecalef0, 'f( l1
0.40Jo.. : )O _
- 0 . : 3
O.c:.
Si:ore
pr.3I
17 2.970 1.807 0.930 -1.739 0.316 -0.530 may 318 2.621 -2.762 1.161 -0.602 1.005 -0.102 ong::m
-r9 1.713 1.895 0.157 0.001 -0.988 1.111 ments20 -3.760 -2.270 0.388 -1.094 -0.290 1.068 potatiJC
10tm0We can use these to gain information about particular factores. We can seethatbougbfactories 2 4 9 17 and 18 have particularly high scoreson PC1, i.e. highat ran..:
concen trati ons of inorganie ehemieals eompared to organies. F actories 7and 18 grounjhave thehighestand lowest seores respectively for likelihood of being involvedthe IX 1with the textileindustry (PC2),and so on.Most statistical paekages will save the
HeSseo res for you whenyou carryout the analysis. These seo res are equivalenttopotat0measurements made.on the individuals. They ean be analysed like other types ofingshemeasurement using tests sueh as t-tests, ANOVAor regression.mas5.fstarch
Scatter graphs are apwere 2
Plotting scatter graphs of the seores given by different principal components questi0
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
10/12
14.4 Interpreting the resuIts 215
Box 14 2 Calculating scores on principal components
This explanation assumes you have used the correlation matrix in the
analysis, as most people do. I f you have used the covariance matrix, theprocedure for calculating the scorcs is the same except that you use theactual measurements instead ofthe standardized measurements see below).The procedure described also assumes that the loadings given by yourstatistical package are eigenvectors and not correlation coefficients.
Each factory has a score on each principal component. These can becalculated from the loadings and the standardized substance concentrationsfor tha t factory Box 14.1). For factory 1, the score on PC1 is calculated asfollows:
0.407 x standardized Subst 1 concentration for factory 10.413 x standardzed Subst2 concentration for factory 1
0.384 x standardized Subst3 concentraton for factory 1-0.323 x standardized Subst4 concentration for factory 1
0.014 x standardized Subst11 concentration for factory 1
0.407 x - 1.16) = 0.4720.413 x 0.47) - 0.196
0.384 x - 1.16) = 0.444-0.323 x 1.12 - 0.360
0.014 x 0.76) = - 0.010Total = 1.922
So the score for factory 1 on PC1 is -1.922. Scores for other factories, andon other principal components, can be obtained in the same way. In
most statistical programs will calculate the scores for you.
may help to reveal groupings of individuals that are not obvious from theoriginal data. To understand this, let us first consider plotting some measurements not involving PCA. Suppose a potato grower buys a load of seedpota toes from a new supplier one year. At the end ofthe year there seems to be alot more variation in his crop than usual and he wonders ifthe seed potatoes hebought were actually not all ofthe same quality. He takes a sample o f 40 plantsat random and measures three variables on each: the dry mass of the aboyeground part, the dry mass of potatoes produced, and the starch concentration inthe pota toes.
He starts by looking at frequency distributions of the aboye ground mass,potato mass, and starch concentra tion Figure 14.2). There are no c1ear groupings here. Next he tries plott ing a scatte r graph of aboye ground mass vs. potatomass Figure 14.3). Again, there are no obvious groups here. Finally he triesstarch concentration vs. potato mass Figure 14.4). This time we can see thereare apparentIy two different groups of plants. This seems to suggest that therewere actually two different 10ts of seed potatoes mixed together. He mightquestion his supplier about this.
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
11/12
14.4 Interpreting the results 217
Of course, if there really are no distnct groupings, whichever variablesweplot against each otherwill not show this kindof pattern. Yet why was itthatplotting aboye ground mas svs. potato mass did not reveal these groupngsbutplotting starch concentrationvs. potato mass did? This was largely becauseabove ground mas s and potato mas s were highly correlated,i.e. the plot inFigure14.3 is almost a straightlineo On the other hand, when measurements arefairly uncorrelated, aspotato mass and starch concentration were in thisexample, any groupings that exist are more likely to stand out.
We could draw scatter graphs for different pairs of substance concentrationmeasurements for the factories example in a similar way,but with 11 types ofmeasurement there would be55 graphs to draw. Alsowe can only make useoftwo types of measurementat a time. However,an alternativeis to useour newmeasurements here the scores on eachofthe principal components and plotscatter graphs of them.
These have the advantage that each typeof scores making useof informa-tion about a number of substance concentrations rather than just one. Also,beca use of the way they are derived, the different typesof score have the usefulpropertythat they are all uncorrelated with each other;just whatwe are lookingfor to reveal groupings in the data. Because most of the variation betweenfactoriesis expressed in their scores on PC1,PC and PC3,we will concentrateon these three principal components (Figure14.5).
Various groupings offactories emerge. In this case the computer program haslabelled the points with the number of each factory. Otherwisewe could identifywhich factory was which by reading the scores off the graph and checking backon the table of scores. Factoriesat the bottom of Figure 14.5(a) are most likelyto be relatedto the textile industry. How farto the right they appear indicates
the amount of inorganic substances relative to the amountof organic substan-ces found.Figure 14.5(b) suggests there are four groups of factories. Particularly con-
cerning are thoseat the bottom with highly negative scores on PC3. Thisindicates a possible involvement with arms manufacture. These can be dividedinto those where relatively large amountsof inorganic chemicals were found5,6,9 and11) and those where relatively large amounts of organic chemicals werefound (3,8 and15), which might tell us something about the types ofweapons.We should note, however, that factory 5 was also among those quite likely to beinvolved with the textile trade, sowe need to consider whether Subst11 hasother uses too.
This is a fictitious example, sowe cannot analyse the chemistry in too muchdetail. However,we can see that in a situation like this, PCA could be a usefultoolto identify groups ofindividuals in a complex dataset, which would help usto plan a more detaled sampling strategy for further studies.
-
8/12/2019 3 Example Concentrations of substances at different factory sites.pdf
12/12
j18 14 Principal Component Analysis(a) ~ ] 1 . : - : ; 7 : - - -
. 19 3 .16.6 . 11 '
C\l10 .12oa..
15 . 815 421 ' ;0
. 1 3
141 18-3
-4 -3PC1 3
(b)1 ~ . - 7 1 s ~ l1 ~ 16 12 4
.. 4 ,;'-O 110 .19
( ' ) ooa..
-16 " 11 9
.3 . 1 5.8 - - - - - , . . - - . 1 I
-4 -3 -2 -1 o 2 3PC1
(C) e18 . 12 .16.17 4 .13 7. .1.10, 1 420 .19
( ) o .oa..
-1 ~ .5 8 1 1 ~ . 315 ~ _ . ~ . ~
-3 -2 -1 o 1 2 3PC2
Figure 14.5 Scatter graphs of scores on principal components 1, 2 and 3. The numbers next
to individual points are the numbers assigned to different factories in the original dataset
14.5 Further reading
In the example above we identified several scoring systems, such as pe l , whichwas interpreted as the degree of organicjinorganic pollution at the site. However, our main interest was in using these scoring systems to study differencesbetween factories, and groupings of factories. l your m in aim is to look forunderlying factors (similar to the organicjinorganic scores in the example), that