lec set 1 data analysis

8/13/2019 Lec Set 1 Data Analysis

1/55

ES 670 EnvironmentalStatistics

Data Analysis


2/55

Mean and Median

It is impossible to conduct chemical analysis free of

errors / uncertainty. Data of unknown quality isworthless.

Thus, replicates - a set of measurements, are

required instead of a single measurement.

The central value of the set is a more reliable measure than

individual estimates. Mean or median is used.

Variation in replicates provides a measure of uncertainty.Standard deviation, Variance, Coefficient of Variation are

some ways of determining the same.


3/55

Mean and Median Mean

Median Middle result when replicate data are arranged in order from

smallest to largest.

Eg. Median in a set of 5 measurements : 9, 2, 7, 11, 14Arrange in ascending order : 2, 7, 9, 11, 14

Median is 9. Rank is

If N is even : median is average of two middle measurements

The median is less sensitive to extreme values compared to mean.

N

x

x

N

i

i=

= 1

32

1=

+N


4/55

Mean and Median

For data that is symmetrically distributed about the

mean, mean and median are equal.

For skewed distribution, mean shifts towards thedirection of the skewiness.


5/55

Precision vs. Accuracy

Precision : Describes the reproducibility of

measurements and can be determined by repeatingthe experiment

It is indicated by Standard Deviation, variance,

coefficient of variation It relates to the deviation from mean

_

xxd ii =


6/55

Precision vs. Accuracy

Accuracy : Indicates closeness of the measurement

to its true/accepted value It is expressed by the error (Absolute/Relative)

It can never be determined exactly since true value is

not known exactlyAbsolute Error

Relative Error

ti xxE =

%100

=t

ti

r x

xxE


7/55

Types of Errors in Experimental Data

Random / Indeterminate Errors reflects

precision

Systematic / Deterministic Errors

reflects a bias. It causes a series of

measurements to be all high / all low.

Gross Error Outlier may be very

high or very low.


8/55

Types of Errors in Experimental Data

The cause may be assigned & affects all data

in the same way.

Sources :

Instrument errors (eg. Glassware marking error)

Method errors (non-ideal behaviour of reagents)

Personal errors (eg. Error in detecting colour

change)


9/55

Precision Vs Accuracy in Measurements

Absolute Error

Abs error in micro Kjeldahl determination of Nitrogen for two compounds 1 and 2 by 4

different analysts

txx

_

1

txx

_

3

txx

_

4

txx

_

2

Analyst 1

Cmpd1

Analyst 2

Cmpd1

Analyst 3

Cmpd2

Analyst 4

Cmpd2


10/55

Effect of Systematic Errors & its

Detection

Constant Error : Magnitude of the error does

not depend on size of quantity measuredmore serious when quantity measured is

small

Eg. Excess reagent required to cause colourchange in titration

Proportional Error : Errors increase/decrease

in proportion to sample size

Eg. Presence of interfering compounds in sample


11/55

Effect of Systematic Error and its

detection

Systematic instrumental errors can be

determined by calibration Personal errors can be minimized by self

discipline

Bias in analytical methods can be minimizedby using standard reference materials, or byusing an alternative reliable analytical method

Blank Determination can reveal error due tointerfering contaminants eg. Titration endpoint correction can be done with blanks


12/55

Random Errors in Analysis

Indeterminate caused by uncontrollable

variables. Cannot identify or measure thevariables that contribute to these errors.Causes a random scatter in the data

An empirical observation is that for mostexperimental data the distribution ofreplicates approaches that of a Gaussiancurve / Normal Distribution

Theoretically, this distribution occurs due to alarge number of individual error components


13/55

Calibration of a 10 mL Pipette

The exact volume of water delivered by a

10mL pipette was measured geometrically 50times. Mass was converted to volume using

density values at the measured temperatures

The data collected was rearranged in order toobtain a frequency distribution. The data

series was distributed in 0.003 mL groups

the number % of observations in each groupwas determined


14/55

Calibration of a 10mL pipette The frequency distribution was plotted as a bar graph

histogram

Range : 9.969 9.995

Mean : 9.982 mL

Median : 9.982 mL

Spread : 0.025 mLSD : 0.0056 mL


15/55

Calibration of a 10 mL pipette

As the number of measurements increase thehistogram would approach the continuous curvewhich has a Gaussian/Normal distribution

The Gaussian would have the same mean, samestd. deviation (SD). Therefore, it has the sameprecision & the same area under the curve as thehistogram

Sources of random uncertainties : usually readingthe 10 mL mark drainage time / angle of holding thepipette, temperature fluctuations affecting viscosity& performance of balance, vibration / draft affectingbalance


16/55

Reproducibility / Repeatability

Both terms relate to precision

An analyst makes 5 replicate measurementsin quick succession using same reagents &glassware. Such measurements reflect

Repeatability = within run precision Same analyst takes same readings on 5

different occasions data would be subject to

difference in reagents, glassware, labconditions it would now reflectReproducibility between run precision


17/55

Reproducibility / Repeatability

Error estimates based on sequentially

repeated observations may give a falsesense of security regarding precision

More emphasis needs to be put on

reproducibility. It highlights the difference inobservations when replicate experiments are

performed in random sequence


18/55

Normality / Randomness / Independence

Most statistical procedures are based on Normality,

Randomness & Independence Normality : The measurement error comes from a

normal distribution. Due to central limit effect many

additive component errors lead to a normal likedistribution

This is not a very restrictive criteria

If errors are not normally distributed, transformations are

available to make the errors normal like Most tests are robust to deviations from normality


19/55


Random : Observations are drawn from a population

such that every element of the population has anequal chance of being drawn

Randomization of sampling can ensure that

observations are independent Example of non-randomness in measurements : If in

10 replicate measurements all early time

measurements are high compared to late time

measurements there is a non-randomness

associated with measurement


20/55


Randomness is indicated by a plot of

measurement error vs. order of observation It is good to check

for randomness with

respect to eachidentifiable factor

that can affect the

measurement


21/55


Independence : It implies that simple

multiplicative law of probability works The probability of joint occurrence of two events is

given by product of probability of individual

occurrences Lack of independence can seriously distort

variance & results of statistical tests


22/55

Statistical treatment of Random Error

Sample vs. Population

Sample The finite number of experimentalobservations

Population The infinite number of possible

observations that could in principle be made giveninfinite time

The statistical laws are derived on the basis of a

population when applied to smaller samples,these laws may need to be modified


23/55

Population Mean & sample mean

_

x

If the no. of observations is small the two are

not same The population mean is the true mean of the

population

The sample mean is an estimator of thepopulation mean

If a measurement has no systematic error

= the true value

The difference between the two

increases as N increases

tx=


24/55

Population and Sample mean

=

x

z

AA

BB

The population standard deviation ( measure ofprecision)

( )

N

N

i

ix=

= 1

2

=

xz Deviation from the mean expressed in units

of standard deviation


25/55

Characteristics of the normal error curve Mean occurs at the central point of maximum frequency

There is a symmetrical distribution of positive & negativedeviations about the mean

There is an exponential decrease in frequency as the magnitudeof the deviations increases, i.e., small random uncertainties areobserved more often than larger ones

Areas under a Gaussian curve

68.3% of the area lies within of the mean



21

3

Therefore the standard deviation is a very useful prediction tool


26/55

The Sample Standard Deviation

It applies to small data sets

replaces

Denominator = N-1 = degrees of freedom

If = N, s would be less than . Therefore it preventsnegative bias

( )

11

2

=

=

N

N

i

xix

s

x

1N

N

x

x

2N

1i

iN

1i

2

i

=

=

=

Use of simplified

formula may give rise to

large round-off errors


27/55

Alternative Measures of Precision Variance

Relative Standard Deviation

Coefficient of variation

( )

11

2

2 =

=

N

N

i

xix

s

%100xsRSD =

%100x

sCV =


28/55

Alternative Measures of Precision

Spread or Range

Difference between the largest & smallest value in

a set of replicates Standard Deviation of computed results

Obtained by propagation of errors

y is computed, x is measured

To obtain s for y we would need to know

i.e. SD for each of the variances

21 cxcy +=

ys=

21 ccx s,s,s


29/55

Error Propagation Formulas

Y = antilog aAntilogarithm

y= log aLogarithm

y=axExponential

Multiplication or

Division

y=a+b-cAddition orSubtraction

Std. Dev. Of yExampleType of Calc

c

bay

=

222cbay ssss ++=

222

+

+

=

c

s

b

s

a

s

y

scbay

=

a

sx

y

say

a

s

s a

y 434.0=

a

ys

y

s303.2=


30/55

Calibration curve

y is plotted as a function of known x for a

series of standards

x independent variable

y dependent variable

Best fit line obtained by regressionanalysis using method of least squares


31/55

Standard Error of a mean Std. deviation s refers to probable error for a

single measurement

1x 2x Nx

Now if the distribution in the

set of mean values is

observed, less scatter will be

observed as N increases

The standard deviation of themean is denoted as standard

error ms

N

sms =


32/55

Reliability of s as a measure of precision

Reliability of s increases as N increases

s can be determined apriori using a large

number of replicates eg. for pH measurement,

chromatograph measurement. particularly for

simple measurements

For more complicated experiments, data from a

series of samples accumulated over time can be

used to get a pooled estimate of s

This is a better estimate than for a single subset


33/55

Pooled Std Deviation

( ) ( ) ( )tNNNNN

xxxxxx

s t

N

i

i

N

i

i

N

i

i

pooled++++

++

====

....

...

4321

1

23

1

22

1

21

321

N| = # of data in set 1

t = # of data sets

It assumes the same source of randomerrors in all sub-sets


34/55

Example: Hardness Mearurement

Measurement conducted by 13 students

using EDTA Titrimetric method To draw a frequency distribution of deviation

from true value on a scale of frequency

versus z Is the distribution a normal distribution ?

Are there any outliers ?

What are the assumptions ?

Are there any bias in the measurement ?


35/55

Analysis of Hardness Measurement

13

No. of

Observations3.62Stdev (h-ht)

1.74Mean (h-ht)

0.0003.1012.9513.912-6.00849013

0.0002.5510.9511.9100.4090.49012

0.0001.998.959.982.0011211011

0.1521.446.957.964.00646010

0.2330.894.955.940.001001009

0.1520.342.953.920.001001008

0.314-0.220.951.900.001001007

0.000-0.77-1.05-0.1-2-2.8097.21006

0.081-1.32-3.05-2.1-45.32105.321005

0.081-1.88-5.05-4.1-62.6482.64804

0.000-2.43-7.05-6.1-86.0086803

0.000-2.98-9.05-8.1-105.0095902

0.000-3.53-11.05-10.1-126.001161101

zxh-hthht

Relative

Freq

Freq(x-

mean/s

d)

Mid-

point

lower

limit

upper

limit

Error

Analysis

Measured

Conc.(mg/L)

True

Conc.(mg/L)

Sr No.


36/55

Relative Frequency Distribution

Error in Hardness Measurement

0.0

0.1

0.2

0.3

0.4

-4 -3 -2 -1 0 1 2 3 4


37/55

Students t-Distribution

W.S. Gosset experimentally determined the students t

distribution in 1908

For a distribution of sample means, definition of z forlarge samples

If s is substituted for in z, the resulting quantitywould be the t statistic

=

x

xz

=

x

s

xt


38/55

Characteristics of t-Distribution

Mound shaped; Symmetrical about t=0

It is more variable than z

z varies only due to x-bar

The variability in t is due to two random quantities whichare independent: x-bar and s

The t-distribution depends on sample size, n The variability in t decreases as n increases since s

approaches . When n=; t=z

d.f.=n-1

Degrees of freedom = Number of squared deviationsavailable for estimating 2


39/55

t-Distribution


40/55

Confidence Limits

Confidence limits define a confidence interval

a region around the experimentally determined

mean within which the population mean lies witha given degree of probability

The size of the interval

is derived from the sample standard deviation (s)

is also affected by how closely s (sample std. dev)

approaches (population std. dev) As s approaches (as N increases) the

confidence limits gets narrower


41/55

Confidence Interval: Large Sample Size

Definition: z statistic

Confidence Limit when s is a good

approximate of

=x

xz

Nx

=

N

zx=


42/55

Confidence Interval and Sample Size

0.32100.416

0.455

0.540.583

0.712

1.01

Relative size of Confidence

Interval

No. of Measurements


43/55

Confidence Interval: Small Sample Size

Confidence Limit when is not known

CL

Define t statistic

N

tsx +=

xs

x

t

= N

s

sx=


44/55

t --Statistics

t values in tabulated form are available.

t > z

If s is based on 3 measurements, d.f. = 2

Value of t for 95% CI = 4.3 as compared to z

value of 1.96

t values are dependent on degrees offreedom (d.f.) in addition to its dependence

on confidence level

t z as the d.f.


45/55

Depiction of Confidence Interval based

on Normal Error Curve

y axis relative frequency

x axis

=

x

xz

95 times out of 100 the true

mean will be within 1.96

67.0

29.1

64.1


46/55

Confidence levels for various values of z

3.2999.9

3.0099.7

2.5899

2.00961.9695

1.6490

1.29801.0068

0.6750

zConfidence Levels


47/55

Confidence Limits based on t & z statistics

Conc. of a contaminant in water (expressed in %)

0.084 0.089 0.079

To determine 95% CL when no additionalknowledge on precision is available

= 252.0ix

021218.02 =ix3

= ixx

( ) ( )%005.0

23

252.0021218.0

1

22

2

=

=

=

N

N

xx

s

i

i


48/55


t = 4.3 for d.f. = 2 and 95% confidence

%012.0084.03

005.03.4

084.0%95 =

== N

ts

xCL

%012.0084.0%95 = CL

C fid Li i b d & i i


49/55


To determine 95%CL if from previous

experiments it is known that Now the z statistic can be used

z = 1.96 for 95% confidence

A sure knowledge of decreases theconfidence interval significantly.

%005.0=

%006.0084.0

3

005.096.1

084.0%95

=

== N

zxCL

Q li A & C l


50/55

Quality Assurance & Control

There must be unequivocal evidence to provethat the data from chemical measurements isreliable. Quality assurance studies provides such

evidence Quality assessment involves evaluation of

accuracy & precision of methods of measurement

Eg. Instruments need to be calibrated frequentlywith standard samples to ensure accuracy &precision

Quality assurance of manufactured products also

very important. Eg. Fluoride levels in toothpasteis regulated

Control charts can be used to monitor quality

Q li A & C l E l


51/55

Quality Assurance & Control: Example

The accuracy and precision of a balance can bemonitored by periodically determining standardweights

Determine if measurements made on subsequentdays are within certain limits of the standard

UCL = + 3/N LCL = - 3/N

Upper and lower control limits

= Population mean

= Population standard deviation

For a normal error curve, the measurements areexpected to lie in this range 99.7% of the time

Q li A & C l E l


52/55

Quality Assurance & Control: Example

5 10 15 20

Sample (day)

20

LCL

UCL

Mass

ofStdwt

Balance is almost out of control on day 17

= 20.000

= 0.00012 g

for mean of 5 measurement

x = 0.00012/5

3/N = 0.00054

UCL = 20.00016 g

LCL = 19.99946 g

Th Q T t f d t ti f G E


53/55

The Q Test for detection of Gross Errors

A rationale for excluding outlying results that differexcessively from average

Qexp = (xq xn)/ w

Qexp is compared with Qcritical

If Qexp>Qcritical Thequestionable result can be

rejected with the specified

confidence level

d=x6-x5 w=x6-x1

wx

x1 x2 x3 x4 x5 x6

d

Q @ Specified Confidence Le el


54/55

Qcrit @ Specified Confidence Level

0.8210.7100.6425

0.5680.4660.412100.5980.4930.4379

0.6340.5260.4688

0.6800.5680.50770.7400.6250.5606

0.9260.8290.7654

0.9940.9700.9413

99%95%90%No. ofobservation Assumption

The distribution

of populationdata is normal

A cautious

approach to

rejection ofoutliers is wise

Recommendation for treatment of Outliers


55/55

Recommendation for treatment of Outliers

Re-examine all data & observations relating tooutlying result maintain lab notebook with allobservations & data

Estimate precision of the procedure to ensurethat outlying result is actually questionable

Repeat analysis. Check for agreement between

new data and original set

Apply Q test to decide if data should be retainedor rejected on statistical ground

If Q test indicates retention consider reportingthe median instead of mean

lec set 1 data analysis

Documents