lec set 1 data analysis
TRANSCRIPT
-
8/13/2019 Lec Set 1 Data Analysis
1/55
ES 670 EnvironmentalStatistics
Data Analysis
-
8/13/2019 Lec Set 1 Data Analysis
2/55
Mean and Median
It is impossible to conduct chemical analysis free of
errors / uncertainty. Data of unknown quality isworthless.
Thus, replicates - a set of measurements, are
required instead of a single measurement.
The central value of the set is a more reliable measure than
individual estimates. Mean or median is used.
Variation in replicates provides a measure of uncertainty.Standard deviation, Variance, Coefficient of Variation are
some ways of determining the same.
-
8/13/2019 Lec Set 1 Data Analysis
3/55
Mean and Median Mean
Median Middle result when replicate data are arranged in order from
smallest to largest.
Eg. Median in a set of 5 measurements : 9, 2, 7, 11, 14Arrange in ascending order : 2, 7, 9, 11, 14
Median is 9. Rank is
If N is even : median is average of two middle measurements
The median is less sensitive to extreme values compared to mean.
N
x
x
N
i
i=
= 1
32
1=
+N
-
8/13/2019 Lec Set 1 Data Analysis
4/55
Mean and Median
For data that is symmetrically distributed about the
mean, mean and median are equal.
For skewed distribution, mean shifts towards thedirection of the skewiness.
-
8/13/2019 Lec Set 1 Data Analysis
5/55
Precision vs. Accuracy
Precision : Describes the reproducibility of
measurements and can be determined by repeatingthe experiment
It is indicated by Standard Deviation, variance,
coefficient of variation It relates to the deviation from mean
_
xxd ii =
-
8/13/2019 Lec Set 1 Data Analysis
6/55
Precision vs. Accuracy
Accuracy : Indicates closeness of the measurement
to its true/accepted value It is expressed by the error (Absolute/Relative)
It can never be determined exactly since true value is
not known exactlyAbsolute Error
Relative Error
ti xxE =
%100
=t
ti
r x
xxE
-
8/13/2019 Lec Set 1 Data Analysis
7/55
Types of Errors in Experimental Data
Random / Indeterminate Errors reflects
precision
Systematic / Deterministic Errors
reflects a bias. It causes a series of
measurements to be all high / all low.
Gross Error Outlier may be very
high or very low.
-
8/13/2019 Lec Set 1 Data Analysis
8/55
Types of Errors in Experimental Data
The cause may be assigned & affects all data
in the same way.
Sources :
Instrument errors (eg. Glassware marking error)
Method errors (non-ideal behaviour of reagents)
Personal errors (eg. Error in detecting colour
change)
-
8/13/2019 Lec Set 1 Data Analysis
9/55
Precision Vs Accuracy in Measurements
Absolute Error
Abs error in micro Kjeldahl determination of Nitrogen for two compounds 1 and 2 by 4
different analysts
txx
_
1
txx
_
3
txx
_
4
txx
_
2
Analyst 1
Cmpd1
Analyst 2
Cmpd1
Analyst 3
Cmpd2
Analyst 4
Cmpd2
-
8/13/2019 Lec Set 1 Data Analysis
10/55
Effect of Systematic Errors & its
Detection
Constant Error : Magnitude of the error does
not depend on size of quantity measuredmore serious when quantity measured is
small
Eg. Excess reagent required to cause colourchange in titration
Proportional Error : Errors increase/decrease
in proportion to sample size
Eg. Presence of interfering compounds in sample
-
8/13/2019 Lec Set 1 Data Analysis
11/55
Effect of Systematic Error and its
detection
Systematic instrumental errors can be
determined by calibration Personal errors can be minimized by self
discipline
Bias in analytical methods can be minimizedby using standard reference materials, or byusing an alternative reliable analytical method
Blank Determination can reveal error due tointerfering contaminants eg. Titration endpoint correction can be done with blanks
-
8/13/2019 Lec Set 1 Data Analysis
12/55
Random Errors in Analysis
Indeterminate caused by uncontrollable
variables. Cannot identify or measure thevariables that contribute to these errors.Causes a random scatter in the data
An empirical observation is that for mostexperimental data the distribution ofreplicates approaches that of a Gaussiancurve / Normal Distribution
Theoretically, this distribution occurs due to alarge number of individual error components
-
8/13/2019 Lec Set 1 Data Analysis
13/55
Calibration of a 10 mL Pipette
The exact volume of water delivered by a
10mL pipette was measured geometrically 50times. Mass was converted to volume using
density values at the measured temperatures
The data collected was rearranged in order toobtain a frequency distribution. The data
series was distributed in 0.003 mL groups
the number % of observations in each groupwas determined
-
8/13/2019 Lec Set 1 Data Analysis
14/55
Calibration of a 10mL pipette The frequency distribution was plotted as a bar graph
histogram
Range : 9.969 9.995
Mean : 9.982 mL
Median : 9.982 mL
Spread : 0.025 mLSD : 0.0056 mL
-
8/13/2019 Lec Set 1 Data Analysis
15/55
Calibration of a 10 mL pipette
As the number of measurements increase thehistogram would approach the continuous curvewhich has a Gaussian/Normal distribution
The Gaussian would have the same mean, samestd. deviation (SD). Therefore, it has the sameprecision & the same area under the curve as thehistogram
Sources of random uncertainties : usually readingthe 10 mL mark drainage time / angle of holding thepipette, temperature fluctuations affecting viscosity& performance of balance, vibration / draft affectingbalance
-
8/13/2019 Lec Set 1 Data Analysis
16/55
Reproducibility / Repeatability
Both terms relate to precision
An analyst makes 5 replicate measurementsin quick succession using same reagents &glassware. Such measurements reflect
Repeatability = within run precision Same analyst takes same readings on 5
different occasions data would be subject to
difference in reagents, glassware, labconditions it would now reflectReproducibility between run precision
-
8/13/2019 Lec Set 1 Data Analysis
17/55
Reproducibility / Repeatability
Error estimates based on sequentially
repeated observations may give a falsesense of security regarding precision
More emphasis needs to be put on
reproducibility. It highlights the difference inobservations when replicate experiments are
performed in random sequence
-
8/13/2019 Lec Set 1 Data Analysis
18/55
Normality / Randomness / Independence
Most statistical procedures are based on Normality,
Randomness & Independence Normality : The measurement error comes from a
normal distribution. Due to central limit effect many
additive component errors lead to a normal likedistribution
This is not a very restrictive criteria
If errors are not normally distributed, transformations are
available to make the errors normal like Most tests are robust to deviations from normality
-
8/13/2019 Lec Set 1 Data Analysis
19/55
Normality / Randomness / Independence
Random : Observations are drawn from a population
such that every element of the population has anequal chance of being drawn
Randomization of sampling can ensure that
observations are independent Example of non-randomness in measurements : If in
10 replicate measurements all early time
measurements are high compared to late time
measurements there is a non-randomness
associated with measurement
-
8/13/2019 Lec Set 1 Data Analysis
20/55
Normality / Randomness / Independence
Randomness is indicated by a plot of
measurement error vs. order of observation It is good to check
for randomness with
respect to eachidentifiable factor
that can affect the
measurement
-
8/13/2019 Lec Set 1 Data Analysis
21/55
Normality / Randomness / Independence
Independence : It implies that simple
multiplicative law of probability works The probability of joint occurrence of two events is
given by product of probability of individual
occurrences Lack of independence can seriously distort
variance & results of statistical tests
-
8/13/2019 Lec Set 1 Data Analysis
22/55
Statistical treatment of Random Error
Sample vs. Population
Sample The finite number of experimentalobservations
Population The infinite number of possible
observations that could in principle be made giveninfinite time
The statistical laws are derived on the basis of a
population when applied to smaller samples,these laws may need to be modified
-
8/13/2019 Lec Set 1 Data Analysis
23/55
Population Mean & sample mean
_
x
If the no. of observations is small the two are
not same The population mean is the true mean of the
population
The sample mean is an estimator of thepopulation mean
If a measurement has no systematic error
= the true value
The difference between the two
increases as N increases
tx=
-
8/13/2019 Lec Set 1 Data Analysis
24/55
Population and Sample mean
=
x
z
AA
BB
The population standard deviation ( measure ofprecision)
( )
N
N
i
ix=
= 1
2
=
xz Deviation from the mean expressed in units
of standard deviation
-
8/13/2019 Lec Set 1 Data Analysis
25/55
Characteristics of the normal error curve Mean occurs at the central point of maximum frequency
There is a symmetrical distribution of positive & negativedeviations about the mean
There is an exponential decrease in frequency as the magnitudeof the deviations increases, i.e., small random uncertainties areobserved more often than larger ones
Areas under a Gaussian curve
68.3% of the area lies within of the mean
95.5% of the area lies within of the mean
99.7% of the area lies within of the mean
21
3
Therefore the standard deviation is a very useful prediction tool
-
8/13/2019 Lec Set 1 Data Analysis
26/55
The Sample Standard Deviation
It applies to small data sets
replaces
Denominator = N-1 = degrees of freedom
If = N, s would be less than . Therefore it preventsnegative bias
( )
11
2
=
=
N
N
i
xix
s
x
1N
N
x
x
2N
1i
iN
1i
2
i
=
=
=
Use of simplified
formula may give rise to
large round-off errors
-
8/13/2019 Lec Set 1 Data Analysis
27/55
Alternative Measures of Precision Variance
Relative Standard Deviation
Coefficient of variation
( )
11
2
2 =
=
N
N
i
xix
s
%100xsRSD =
%100x
sCV =
-
8/13/2019 Lec Set 1 Data Analysis
28/55
Alternative Measures of Precision
Spread or Range
Difference between the largest & smallest value in
a set of replicates Standard Deviation of computed results
Obtained by propagation of errors
y is computed, x is measured
To obtain s for y we would need to know
i.e. SD for each of the variances
21 cxcy +=
ys=
21 ccx s,s,s
-
8/13/2019 Lec Set 1 Data Analysis
29/55
Error Propagation Formulas
Y = antilog aAntilogarithm
y= log aLogarithm
y=axExponential
Multiplication or
Division
y=a+b-cAddition orSubtraction
Std. Dev. Of yExampleType of Calc
c
bay
=
222cbay ssss ++=
222
+
+
=
c
s
b
s
a
s
y
scbay
=
a
sx
y
say
a
s
s a
y 434.0=
a
ys
y
s303.2=
-
8/13/2019 Lec Set 1 Data Analysis
30/55
Calibration curve
y is plotted as a function of known x for a
series of standards
x independent variable
y dependent variable
Best fit line obtained by regressionanalysis using method of least squares
-
8/13/2019 Lec Set 1 Data Analysis
31/55
Standard Error of a mean Std. deviation s refers to probable error for a
single measurement
1x 2x Nx
Now if the distribution in the
set of mean values is
observed, less scatter will be
observed as N increases
The standard deviation of themean is denoted as standard
error ms
N
sms =
-
8/13/2019 Lec Set 1 Data Analysis
32/55
Reliability of s as a measure of precision
Reliability of s increases as N increases
s can be determined apriori using a large
number of replicates eg. for pH measurement,
chromatograph measurement. particularly for
simple measurements
For more complicated experiments, data from a
series of samples accumulated over time can be
used to get a pooled estimate of s
This is a better estimate than for a single subset
-
8/13/2019 Lec Set 1 Data Analysis
33/55
Pooled Std Deviation
( ) ( ) ( )tNNNNN
xxxxxx
s t
N
i
i
N
i
i
N
i
i
pooled++++
++
====
....
...
4321
1
23
1
22
1
21
321
N| = # of data in set 1
t = # of data sets
It assumes the same source of randomerrors in all sub-sets
-
8/13/2019 Lec Set 1 Data Analysis
34/55
Example: Hardness Mearurement
Measurement conducted by 13 students
using EDTA Titrimetric method To draw a frequency distribution of deviation
from true value on a scale of frequency
versus z Is the distribution a normal distribution ?
Are there any outliers ?
What are the assumptions ?
Are there any bias in the measurement ?
-
8/13/2019 Lec Set 1 Data Analysis
35/55
Analysis of Hardness Measurement
13
No. of
Observations3.62Stdev (h-ht)
1.74Mean (h-ht)
0.0003.1012.9513.912-6.00849013
0.0002.5510.9511.9100.4090.49012
0.0001.998.959.982.0011211011
0.1521.446.957.964.00646010
0.2330.894.955.940.001001009
0.1520.342.953.920.001001008
0.314-0.220.951.900.001001007
0.000-0.77-1.05-0.1-2-2.8097.21006
0.081-1.32-3.05-2.1-45.32105.321005
0.081-1.88-5.05-4.1-62.6482.64804
0.000-2.43-7.05-6.1-86.0086803
0.000-2.98-9.05-8.1-105.0095902
0.000-3.53-11.05-10.1-126.001161101
zxh-hthht
Relative
Freq
Freq(x-
mean/s
d)
Mid-
point
lower
limit
upper
limit
Error
Analysis
Measured
Conc.(mg/L)
True
Conc.(mg/L)
Sr No.
-
8/13/2019 Lec Set 1 Data Analysis
36/55
Relative Frequency Distribution
Error in Hardness Measurement
0.0
0.1
0.2
0.3
0.4
-4 -3 -2 -1 0 1 2 3 4
-
8/13/2019 Lec Set 1 Data Analysis
37/55
Students t-Distribution
W.S. Gosset experimentally determined the students t
distribution in 1908
For a distribution of sample means, definition of z forlarge samples
If s is substituted for in z, the resulting quantitywould be the t statistic
=
x
xz
=
x
s
xt
-
8/13/2019 Lec Set 1 Data Analysis
38/55
Characteristics of t-Distribution
Mound shaped; Symmetrical about t=0
It is more variable than z
z varies only due to x-bar
The variability in t is due to two random quantities whichare independent: x-bar and s
The t-distribution depends on sample size, n The variability in t decreases as n increases since s
approaches . When n=; t=z
d.f.=n-1
Degrees of freedom = Number of squared deviationsavailable for estimating 2
-
8/13/2019 Lec Set 1 Data Analysis
39/55
t-Distribution
-
8/13/2019 Lec Set 1 Data Analysis
40/55
Confidence Limits
Confidence limits define a confidence interval
a region around the experimentally determined
mean within which the population mean lies witha given degree of probability
The size of the interval
is derived from the sample standard deviation (s)
is also affected by how closely s (sample std. dev)
approaches (population std. dev) As s approaches (as N increases) the
confidence limits gets narrower
-
8/13/2019 Lec Set 1 Data Analysis
41/55
Confidence Interval: Large Sample Size
Definition: z statistic
Confidence Limit when s is a good
approximate of
=x
xz
Nx
=
N
zx=
-
8/13/2019 Lec Set 1 Data Analysis
42/55
Confidence Interval and Sample Size
0.32100.416
0.455
0.540.583
0.712
1.01
Relative size of Confidence
Interval
No. of Measurements
-
8/13/2019 Lec Set 1 Data Analysis
43/55
Confidence Interval: Small Sample Size
Confidence Limit when is not known
CL
Define t statistic
N
tsx +=
xs
x
t
= N
s
sx=
-
8/13/2019 Lec Set 1 Data Analysis
44/55
t --Statistics
t values in tabulated form are available.
t > z
If s is based on 3 measurements, d.f. = 2
Value of t for 95% CI = 4.3 as compared to z
value of 1.96
t values are dependent on degrees offreedom (d.f.) in addition to its dependence
on confidence level
t z as the d.f.
-
8/13/2019 Lec Set 1 Data Analysis
45/55
Depiction of Confidence Interval based
on Normal Error Curve
y axis relative frequency
x axis
=
x
xz
95 times out of 100 the true
mean will be within 1.96
67.0
29.1
64.1
-
8/13/2019 Lec Set 1 Data Analysis
46/55
Confidence levels for various values of z
3.2999.9
3.0099.7
2.5899
2.00961.9695
1.6490
1.29801.0068
0.6750
zConfidence Levels
-
8/13/2019 Lec Set 1 Data Analysis
47/55
Confidence Limits based on t & z statistics
Conc. of a contaminant in water (expressed in %)
0.084 0.089 0.079
To determine 95% CL when no additionalknowledge on precision is available
= 252.0ix
021218.02 =ix3
= ixx
( ) ( )%005.0
23
252.0021218.0
1
22
2
=
=
=
N
N
xx
s
i
i
-
8/13/2019 Lec Set 1 Data Analysis
48/55
Confidence Limits based on t & z statistics
t = 4.3 for d.f. = 2 and 95% confidence
%012.0084.03
005.03.4
084.0%95 =
== N
ts
xCL
%012.0084.0%95 = CL
C fid Li i b d & i i
-
8/13/2019 Lec Set 1 Data Analysis
49/55
Confidence Limits based on t & z statistics
To determine 95%CL if from previous
experiments it is known that Now the z statistic can be used
z = 1.96 for 95% confidence
A sure knowledge of decreases theconfidence interval significantly.
%005.0=
%006.0084.0
3
005.096.1
084.0%95
=
== N
zxCL
Q li A & C l
-
8/13/2019 Lec Set 1 Data Analysis
50/55
Quality Assurance & Control
There must be unequivocal evidence to provethat the data from chemical measurements isreliable. Quality assurance studies provides such
evidence Quality assessment involves evaluation of
accuracy & precision of methods of measurement
Eg. Instruments need to be calibrated frequentlywith standard samples to ensure accuracy &precision
Quality assurance of manufactured products also
very important. Eg. Fluoride levels in toothpasteis regulated
Control charts can be used to monitor quality
Q li A & C l E l
-
8/13/2019 Lec Set 1 Data Analysis
51/55
Quality Assurance & Control: Example
The accuracy and precision of a balance can bemonitored by periodically determining standardweights
Determine if measurements made on subsequentdays are within certain limits of the standard
UCL = + 3/N LCL = - 3/N
Upper and lower control limits
= Population mean
= Population standard deviation
For a normal error curve, the measurements areexpected to lie in this range 99.7% of the time
Q li A & C l E l
-
8/13/2019 Lec Set 1 Data Analysis
52/55
Quality Assurance & Control: Example
5 10 15 20
Sample (day)
20
LCL
UCL
Mass
ofStdwt
Balance is almost out of control on day 17
= 20.000
= 0.00012 g
for mean of 5 measurement
x = 0.00012/5
3/N = 0.00054
UCL = 20.00016 g
LCL = 19.99946 g
Th Q T t f d t ti f G E
-
8/13/2019 Lec Set 1 Data Analysis
53/55
The Q Test for detection of Gross Errors
A rationale for excluding outlying results that differexcessively from average
Qexp = (xq xn)/ w
Qexp is compared with Qcritical
If Qexp>Qcritical Thequestionable result can be
rejected with the specified
confidence level
d=x6-x5 w=x6-x1
wx
x1 x2 x3 x4 x5 x6
d
Q @ Specified Confidence Le el
-
8/13/2019 Lec Set 1 Data Analysis
54/55
Qcrit @ Specified Confidence Level
0.8210.7100.6425
0.5680.4660.412100.5980.4930.4379
0.6340.5260.4688
0.6800.5680.50770.7400.6250.5606
0.9260.8290.7654
0.9940.9700.9413
99%95%90%No. ofobservation Assumption
The distribution
of populationdata is normal
A cautious
approach to
rejection ofoutliers is wise
Recommendation for treatment of Outliers
-
8/13/2019 Lec Set 1 Data Analysis
55/55
Recommendation for treatment of Outliers
Re-examine all data & observations relating tooutlying result maintain lab notebook with allobservations & data
Estimate precision of the procedure to ensurethat outlying result is actually questionable
Repeat analysis. Check for agreement between
new data and original set
Apply Q test to decide if data should be retainedor rejected on statistical ground
If Q test indicates retention consider reportingthe median instead of mean