lec set 1 data analysis

Upload: smarika-kulshrestha

Post on 04-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Lec Set 1 Data Analysis

    1/55

    ES 670 EnvironmentalStatistics

    Data Analysis

  • 8/13/2019 Lec Set 1 Data Analysis

    2/55

    Mean and Median

    It is impossible to conduct chemical analysis free of

    errors / uncertainty. Data of unknown quality isworthless.

    Thus, replicates - a set of measurements, are

    required instead of a single measurement.

    The central value of the set is a more reliable measure than

    individual estimates. Mean or median is used.

    Variation in replicates provides a measure of uncertainty.Standard deviation, Variance, Coefficient of Variation are

    some ways of determining the same.

  • 8/13/2019 Lec Set 1 Data Analysis

    3/55

    Mean and Median Mean

    Median Middle result when replicate data are arranged in order from

    smallest to largest.

    Eg. Median in a set of 5 measurements : 9, 2, 7, 11, 14Arrange in ascending order : 2, 7, 9, 11, 14

    Median is 9. Rank is

    If N is even : median is average of two middle measurements

    The median is less sensitive to extreme values compared to mean.

    N

    x

    x

    N

    i

    i=

    = 1

    32

    1=

    +N

  • 8/13/2019 Lec Set 1 Data Analysis

    4/55

    Mean and Median

    For data that is symmetrically distributed about the

    mean, mean and median are equal.

    For skewed distribution, mean shifts towards thedirection of the skewiness.

  • 8/13/2019 Lec Set 1 Data Analysis

    5/55

    Precision vs. Accuracy

    Precision : Describes the reproducibility of

    measurements and can be determined by repeatingthe experiment

    It is indicated by Standard Deviation, variance,

    coefficient of variation It relates to the deviation from mean

    _

    xxd ii =

  • 8/13/2019 Lec Set 1 Data Analysis

    6/55

    Precision vs. Accuracy

    Accuracy : Indicates closeness of the measurement

    to its true/accepted value It is expressed by the error (Absolute/Relative)

    It can never be determined exactly since true value is

    not known exactlyAbsolute Error

    Relative Error

    ti xxE =

    %100

    =t

    ti

    r x

    xxE

  • 8/13/2019 Lec Set 1 Data Analysis

    7/55

    Types of Errors in Experimental Data

    Random / Indeterminate Errors reflects

    precision

    Systematic / Deterministic Errors

    reflects a bias. It causes a series of

    measurements to be all high / all low.

    Gross Error Outlier may be very

    high or very low.

  • 8/13/2019 Lec Set 1 Data Analysis

    8/55

    Types of Errors in Experimental Data

    The cause may be assigned & affects all data

    in the same way.

    Sources :

    Instrument errors (eg. Glassware marking error)

    Method errors (non-ideal behaviour of reagents)

    Personal errors (eg. Error in detecting colour

    change)

  • 8/13/2019 Lec Set 1 Data Analysis

    9/55

    Precision Vs Accuracy in Measurements

    Absolute Error

    Abs error in micro Kjeldahl determination of Nitrogen for two compounds 1 and 2 by 4

    different analysts

    txx

    _

    1

    txx

    _

    3

    txx

    _

    4

    txx

    _

    2

    Analyst 1

    Cmpd1

    Analyst 2

    Cmpd1

    Analyst 3

    Cmpd2

    Analyst 4

    Cmpd2

  • 8/13/2019 Lec Set 1 Data Analysis

    10/55

    Effect of Systematic Errors & its

    Detection

    Constant Error : Magnitude of the error does

    not depend on size of quantity measuredmore serious when quantity measured is

    small

    Eg. Excess reagent required to cause colourchange in titration

    Proportional Error : Errors increase/decrease

    in proportion to sample size

    Eg. Presence of interfering compounds in sample

  • 8/13/2019 Lec Set 1 Data Analysis

    11/55

    Effect of Systematic Error and its

    detection

    Systematic instrumental errors can be

    determined by calibration Personal errors can be minimized by self

    discipline

    Bias in analytical methods can be minimizedby using standard reference materials, or byusing an alternative reliable analytical method

    Blank Determination can reveal error due tointerfering contaminants eg. Titration endpoint correction can be done with blanks

  • 8/13/2019 Lec Set 1 Data Analysis

    12/55

    Random Errors in Analysis

    Indeterminate caused by uncontrollable

    variables. Cannot identify or measure thevariables that contribute to these errors.Causes a random scatter in the data

    An empirical observation is that for mostexperimental data the distribution ofreplicates approaches that of a Gaussiancurve / Normal Distribution

    Theoretically, this distribution occurs due to alarge number of individual error components

  • 8/13/2019 Lec Set 1 Data Analysis

    13/55

    Calibration of a 10 mL Pipette

    The exact volume of water delivered by a

    10mL pipette was measured geometrically 50times. Mass was converted to volume using

    density values at the measured temperatures

    The data collected was rearranged in order toobtain a frequency distribution. The data

    series was distributed in 0.003 mL groups

    the number % of observations in each groupwas determined

  • 8/13/2019 Lec Set 1 Data Analysis

    14/55

    Calibration of a 10mL pipette The frequency distribution was plotted as a bar graph

    histogram

    Range : 9.969 9.995

    Mean : 9.982 mL

    Median : 9.982 mL

    Spread : 0.025 mLSD : 0.0056 mL

  • 8/13/2019 Lec Set 1 Data Analysis

    15/55

    Calibration of a 10 mL pipette

    As the number of measurements increase thehistogram would approach the continuous curvewhich has a Gaussian/Normal distribution

    The Gaussian would have the same mean, samestd. deviation (SD). Therefore, it has the sameprecision & the same area under the curve as thehistogram

    Sources of random uncertainties : usually readingthe 10 mL mark drainage time / angle of holding thepipette, temperature fluctuations affecting viscosity& performance of balance, vibration / draft affectingbalance

  • 8/13/2019 Lec Set 1 Data Analysis

    16/55

    Reproducibility / Repeatability

    Both terms relate to precision

    An analyst makes 5 replicate measurementsin quick succession using same reagents &glassware. Such measurements reflect

    Repeatability = within run precision Same analyst takes same readings on 5

    different occasions data would be subject to

    difference in reagents, glassware, labconditions it would now reflectReproducibility between run precision

  • 8/13/2019 Lec Set 1 Data Analysis

    17/55

    Reproducibility / Repeatability

    Error estimates based on sequentially

    repeated observations may give a falsesense of security regarding precision

    More emphasis needs to be put on

    reproducibility. It highlights the difference inobservations when replicate experiments are

    performed in random sequence

  • 8/13/2019 Lec Set 1 Data Analysis

    18/55

    Normality / Randomness / Independence

    Most statistical procedures are based on Normality,

    Randomness & Independence Normality : The measurement error comes from a

    normal distribution. Due to central limit effect many

    additive component errors lead to a normal likedistribution

    This is not a very restrictive criteria

    If errors are not normally distributed, transformations are

    available to make the errors normal like Most tests are robust to deviations from normality

  • 8/13/2019 Lec Set 1 Data Analysis

    19/55

    Normality / Randomness / Independence

    Random : Observations are drawn from a population

    such that every element of the population has anequal chance of being drawn

    Randomization of sampling can ensure that

    observations are independent Example of non-randomness in measurements : If in

    10 replicate measurements all early time

    measurements are high compared to late time

    measurements there is a non-randomness

    associated with measurement

  • 8/13/2019 Lec Set 1 Data Analysis

    20/55

    Normality / Randomness / Independence

    Randomness is indicated by a plot of

    measurement error vs. order of observation It is good to check

    for randomness with

    respect to eachidentifiable factor

    that can affect the

    measurement

  • 8/13/2019 Lec Set 1 Data Analysis

    21/55

    Normality / Randomness / Independence

    Independence : It implies that simple

    multiplicative law of probability works The probability of joint occurrence of two events is

    given by product of probability of individual

    occurrences Lack of independence can seriously distort

    variance & results of statistical tests

  • 8/13/2019 Lec Set 1 Data Analysis

    22/55

    Statistical treatment of Random Error

    Sample vs. Population

    Sample The finite number of experimentalobservations

    Population The infinite number of possible

    observations that could in principle be made giveninfinite time

    The statistical laws are derived on the basis of a

    population when applied to smaller samples,these laws may need to be modified

  • 8/13/2019 Lec Set 1 Data Analysis

    23/55

    Population Mean & sample mean

    _

    x

    If the no. of observations is small the two are

    not same The population mean is the true mean of the

    population

    The sample mean is an estimator of thepopulation mean

    If a measurement has no systematic error

    = the true value

    The difference between the two

    increases as N increases

    tx=

  • 8/13/2019 Lec Set 1 Data Analysis

    24/55

    Population and Sample mean

    =

    x

    z

    AA

    BB

    The population standard deviation ( measure ofprecision)

    ( )

    N

    N

    i

    ix=

    = 1

    2

    =

    xz Deviation from the mean expressed in units

    of standard deviation

  • 8/13/2019 Lec Set 1 Data Analysis

    25/55

    Characteristics of the normal error curve Mean occurs at the central point of maximum frequency

    There is a symmetrical distribution of positive & negativedeviations about the mean

    There is an exponential decrease in frequency as the magnitudeof the deviations increases, i.e., small random uncertainties areobserved more often than larger ones

    Areas under a Gaussian curve

    68.3% of the area lies within of the mean

    95.5% of the area lies within of the mean

    99.7% of the area lies within of the mean

    21

    3

    Therefore the standard deviation is a very useful prediction tool

  • 8/13/2019 Lec Set 1 Data Analysis

    26/55

    The Sample Standard Deviation

    It applies to small data sets

    replaces

    Denominator = N-1 = degrees of freedom

    If = N, s would be less than . Therefore it preventsnegative bias

    ( )

    11

    2

    =

    =

    N

    N

    i

    xix

    s

    x

    1N

    N

    x

    x

    2N

    1i

    iN

    1i

    2

    i

    =

    =

    =

    Use of simplified

    formula may give rise to

    large round-off errors

  • 8/13/2019 Lec Set 1 Data Analysis

    27/55

    Alternative Measures of Precision Variance

    Relative Standard Deviation

    Coefficient of variation

    ( )

    11

    2

    2 =

    =

    N

    N

    i

    xix

    s

    %100xsRSD =

    %100x

    sCV =

  • 8/13/2019 Lec Set 1 Data Analysis

    28/55

    Alternative Measures of Precision

    Spread or Range

    Difference between the largest & smallest value in

    a set of replicates Standard Deviation of computed results

    Obtained by propagation of errors

    y is computed, x is measured

    To obtain s for y we would need to know

    i.e. SD for each of the variances

    21 cxcy +=

    ys=

    21 ccx s,s,s

  • 8/13/2019 Lec Set 1 Data Analysis

    29/55

    Error Propagation Formulas

    Y = antilog aAntilogarithm

    y= log aLogarithm

    y=axExponential

    Multiplication or

    Division

    y=a+b-cAddition orSubtraction

    Std. Dev. Of yExampleType of Calc

    c

    bay

    =

    222cbay ssss ++=

    222

    +

    +

    =

    c

    s

    b

    s

    a

    s

    y

    scbay

    =

    a

    sx

    y

    say

    a

    s

    s a

    y 434.0=

    a

    ys

    y

    s303.2=

  • 8/13/2019 Lec Set 1 Data Analysis

    30/55

    Calibration curve

    y is plotted as a function of known x for a

    series of standards

    x independent variable

    y dependent variable

    Best fit line obtained by regressionanalysis using method of least squares

  • 8/13/2019 Lec Set 1 Data Analysis

    31/55

    Standard Error of a mean Std. deviation s refers to probable error for a

    single measurement

    1x 2x Nx

    Now if the distribution in the

    set of mean values is

    observed, less scatter will be

    observed as N increases

    The standard deviation of themean is denoted as standard

    error ms

    N

    sms =

  • 8/13/2019 Lec Set 1 Data Analysis

    32/55

    Reliability of s as a measure of precision

    Reliability of s increases as N increases

    s can be determined apriori using a large

    number of replicates eg. for pH measurement,

    chromatograph measurement. particularly for

    simple measurements

    For more complicated experiments, data from a

    series of samples accumulated over time can be

    used to get a pooled estimate of s

    This is a better estimate than for a single subset

  • 8/13/2019 Lec Set 1 Data Analysis

    33/55

    Pooled Std Deviation

    ( ) ( ) ( )tNNNNN

    xxxxxx

    s t

    N

    i

    i

    N

    i

    i

    N

    i

    i

    pooled++++

    ++

    ====

    ....

    ...

    4321

    1

    23

    1

    22

    1

    21

    321

    N| = # of data in set 1

    t = # of data sets

    It assumes the same source of randomerrors in all sub-sets

  • 8/13/2019 Lec Set 1 Data Analysis

    34/55

    Example: Hardness Mearurement

    Measurement conducted by 13 students

    using EDTA Titrimetric method To draw a frequency distribution of deviation

    from true value on a scale of frequency

    versus z Is the distribution a normal distribution ?

    Are there any outliers ?

    What are the assumptions ?

    Are there any bias in the measurement ?

  • 8/13/2019 Lec Set 1 Data Analysis

    35/55

    Analysis of Hardness Measurement

    13

    No. of

    Observations3.62Stdev (h-ht)

    1.74Mean (h-ht)

    0.0003.1012.9513.912-6.00849013

    0.0002.5510.9511.9100.4090.49012

    0.0001.998.959.982.0011211011

    0.1521.446.957.964.00646010

    0.2330.894.955.940.001001009

    0.1520.342.953.920.001001008

    0.314-0.220.951.900.001001007

    0.000-0.77-1.05-0.1-2-2.8097.21006

    0.081-1.32-3.05-2.1-45.32105.321005

    0.081-1.88-5.05-4.1-62.6482.64804

    0.000-2.43-7.05-6.1-86.0086803

    0.000-2.98-9.05-8.1-105.0095902

    0.000-3.53-11.05-10.1-126.001161101

    zxh-hthht

    Relative

    Freq

    Freq(x-

    mean/s

    d)

    Mid-

    point

    lower

    limit

    upper

    limit

    Error

    Analysis

    Measured

    Conc.(mg/L)

    True

    Conc.(mg/L)

    Sr No.

  • 8/13/2019 Lec Set 1 Data Analysis

    36/55

    Relative Frequency Distribution

    Error in Hardness Measurement

    0.0

    0.1

    0.2

    0.3

    0.4

    -4 -3 -2 -1 0 1 2 3 4

  • 8/13/2019 Lec Set 1 Data Analysis

    37/55

    Students t-Distribution

    W.S. Gosset experimentally determined the students t

    distribution in 1908

    For a distribution of sample means, definition of z forlarge samples

    If s is substituted for in z, the resulting quantitywould be the t statistic

    =

    x

    xz

    =

    x

    s

    xt

  • 8/13/2019 Lec Set 1 Data Analysis

    38/55

    Characteristics of t-Distribution

    Mound shaped; Symmetrical about t=0

    It is more variable than z

    z varies only due to x-bar

    The variability in t is due to two random quantities whichare independent: x-bar and s

    The t-distribution depends on sample size, n The variability in t decreases as n increases since s

    approaches . When n=; t=z

    d.f.=n-1

    Degrees of freedom = Number of squared deviationsavailable for estimating 2

  • 8/13/2019 Lec Set 1 Data Analysis

    39/55

    t-Distribution

  • 8/13/2019 Lec Set 1 Data Analysis

    40/55

    Confidence Limits

    Confidence limits define a confidence interval

    a region around the experimentally determined

    mean within which the population mean lies witha given degree of probability

    The size of the interval

    is derived from the sample standard deviation (s)

    is also affected by how closely s (sample std. dev)

    approaches (population std. dev) As s approaches (as N increases) the

    confidence limits gets narrower

  • 8/13/2019 Lec Set 1 Data Analysis

    41/55

    Confidence Interval: Large Sample Size

    Definition: z statistic

    Confidence Limit when s is a good

    approximate of

    =x

    xz

    Nx

    =

    N

    zx=

  • 8/13/2019 Lec Set 1 Data Analysis

    42/55

    Confidence Interval and Sample Size

    0.32100.416

    0.455

    0.540.583

    0.712

    1.01

    Relative size of Confidence

    Interval

    No. of Measurements

  • 8/13/2019 Lec Set 1 Data Analysis

    43/55

    Confidence Interval: Small Sample Size

    Confidence Limit when is not known

    CL

    Define t statistic

    N

    tsx +=

    xs

    x

    t

    = N

    s

    sx=

  • 8/13/2019 Lec Set 1 Data Analysis

    44/55

    t --Statistics

    t values in tabulated form are available.

    t > z

    If s is based on 3 measurements, d.f. = 2

    Value of t for 95% CI = 4.3 as compared to z

    value of 1.96

    t values are dependent on degrees offreedom (d.f.) in addition to its dependence

    on confidence level

    t z as the d.f.

  • 8/13/2019 Lec Set 1 Data Analysis

    45/55

    Depiction of Confidence Interval based

    on Normal Error Curve

    y axis relative frequency

    x axis

    =

    x

    xz

    95 times out of 100 the true

    mean will be within 1.96

    67.0

    29.1

    64.1

  • 8/13/2019 Lec Set 1 Data Analysis

    46/55

    Confidence levels for various values of z

    3.2999.9

    3.0099.7

    2.5899

    2.00961.9695

    1.6490

    1.29801.0068

    0.6750

    zConfidence Levels

  • 8/13/2019 Lec Set 1 Data Analysis

    47/55

    Confidence Limits based on t & z statistics

    Conc. of a contaminant in water (expressed in %)

    0.084 0.089 0.079

    To determine 95% CL when no additionalknowledge on precision is available

    = 252.0ix

    021218.02 =ix3

    = ixx

    ( ) ( )%005.0

    23

    252.0021218.0

    1

    22

    2

    =

    =

    =

    N

    N

    xx

    s

    i

    i

  • 8/13/2019 Lec Set 1 Data Analysis

    48/55

    Confidence Limits based on t & z statistics

    t = 4.3 for d.f. = 2 and 95% confidence

    %012.0084.03

    005.03.4

    084.0%95 =

    == N

    ts

    xCL

    %012.0084.0%95 = CL

    C fid Li i b d & i i

  • 8/13/2019 Lec Set 1 Data Analysis

    49/55

    Confidence Limits based on t & z statistics

    To determine 95%CL if from previous

    experiments it is known that Now the z statistic can be used

    z = 1.96 for 95% confidence

    A sure knowledge of decreases theconfidence interval significantly.

    %005.0=

    %006.0084.0

    3

    005.096.1

    084.0%95

    =

    == N

    zxCL

    Q li A & C l

  • 8/13/2019 Lec Set 1 Data Analysis

    50/55

    Quality Assurance & Control

    There must be unequivocal evidence to provethat the data from chemical measurements isreliable. Quality assurance studies provides such

    evidence Quality assessment involves evaluation of

    accuracy & precision of methods of measurement

    Eg. Instruments need to be calibrated frequentlywith standard samples to ensure accuracy &precision

    Quality assurance of manufactured products also

    very important. Eg. Fluoride levels in toothpasteis regulated

    Control charts can be used to monitor quality

    Q li A & C l E l

  • 8/13/2019 Lec Set 1 Data Analysis

    51/55

    Quality Assurance & Control: Example

    The accuracy and precision of a balance can bemonitored by periodically determining standardweights

    Determine if measurements made on subsequentdays are within certain limits of the standard

    UCL = + 3/N LCL = - 3/N

    Upper and lower control limits

    = Population mean

    = Population standard deviation

    For a normal error curve, the measurements areexpected to lie in this range 99.7% of the time

    Q li A & C l E l

  • 8/13/2019 Lec Set 1 Data Analysis

    52/55

    Quality Assurance & Control: Example

    5 10 15 20

    Sample (day)

    20

    LCL

    UCL

    Mass

    ofStdwt

    Balance is almost out of control on day 17

    = 20.000

    = 0.00012 g

    for mean of 5 measurement

    x = 0.00012/5

    3/N = 0.00054

    UCL = 20.00016 g

    LCL = 19.99946 g

    Th Q T t f d t ti f G E

  • 8/13/2019 Lec Set 1 Data Analysis

    53/55

    The Q Test for detection of Gross Errors

    A rationale for excluding outlying results that differexcessively from average

    Qexp = (xq xn)/ w

    Qexp is compared with Qcritical

    If Qexp>Qcritical Thequestionable result can be

    rejected with the specified

    confidence level

    d=x6-x5 w=x6-x1

    wx

    x1 x2 x3 x4 x5 x6

    d

    Q @ Specified Confidence Le el

  • 8/13/2019 Lec Set 1 Data Analysis

    54/55

    Qcrit @ Specified Confidence Level

    0.8210.7100.6425

    0.5680.4660.412100.5980.4930.4379

    0.6340.5260.4688

    0.6800.5680.50770.7400.6250.5606

    0.9260.8290.7654

    0.9940.9700.9413

    99%95%90%No. ofobservation Assumption

    The distribution

    of populationdata is normal

    A cautious

    approach to

    rejection ofoutliers is wise

    Recommendation for treatment of Outliers

  • 8/13/2019 Lec Set 1 Data Analysis

    55/55

    Recommendation for treatment of Outliers

    Re-examine all data & observations relating tooutlying result maintain lab notebook with allobservations & data

    Estimate precision of the procedure to ensurethat outlying result is actually questionable

    Repeat analysis. Check for agreement between

    new data and original set

    Apply Q test to decide if data should be retainedor rejected on statistical ground

    If Q test indicates retention consider reportingthe median instead of mean