data, statistics, and environmental regulation: avoiding pitfalls mike aucott, ph.d. division of...

31
Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology [email protected] 292-7530

Upload: michael-rowe

Post on 27-Mar-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Data, Statistics, and Environmental Regulation:

Avoiding Pitfalls

Mike Aucott, Ph.D.

Division of Science, Research & Technology

[email protected]

292-7530

Page 2: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Value of statistics

Some common problems

Non-detects

Central tendency

Uncertainty

Significance

Some suggestions for dealing with common problems

Some examples based on real data

Page 3: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Once data are properly collected and quantified, statistics come into play.

Statistics can help make sense of a jumble of data, can help identify meaningful inferences, and can prevent blunders that could lead to serious consequences.

But, they can be misused.

“The statistics on sanity are that one out of every four Americans is suffering from some form of mental illness. Think of your three best friends. If they're okay, then it's you.”

Rita Mae Brown

Page 4: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Identification of some potential problems can help avoid misuse and help harness the full power of statistics.

The first problem we often face is dealing with non-detects (NDs); i.e., values that are below the minimum detection limit (MDL).

What to do with these data?

We typically use substitution methods. There are more sophisticated statistical approaches, but they are more complicated.

With the substitution approach, NDs are typically replaced with the MDL itself, or with 1/2 the MDL. Sometimes, the NDs are discarded (not recommended) or, worse, replaced with zero.

Page 5: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Example of a problem with non-detects, based on typical data, e.g. discharges from different facilities to a waste-stream.

To quantify flows, or in some other way to characterize the entire group, requires replacement of NDs with a number.

facility concentration

1 154 105 57 58 39 3

10 311 312 313 214 215 216 217 218 219 220 ND21 ND22 ND23 ND24 ND25 ND26 ND27 ND28 ND29 ND30 ND31 ND32 ND

Page 6: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

facility concentrationND = 0 ND = 1/2 MDL ND = MDL

1 15 15 15 154 10 10 10 105 5 5 5 57 5 5 5 58 3 3 3 39 3 3 3 3

10 3 3 3 311 3 3 3 312 3 3 3 313 2 2 2 214 2 2 2 215 2 2 2 216 2 2 2 217 2 2 2 218 2 2 2 219 2 2 2 220 ND 0 0.5 121 ND 0 0.5 122 ND 0 0.5 123 ND 0 0.5 124 ND 0 0.5 125 ND 0 0.5 126 ND 0 0.5 127 ND 0 0.5 128 ND 0 0.5 129 ND 0 0.5 130 ND 0 0.5 131 ND 0 0.5 132 ND 0 0.5 1

converted concentration value

Page 7: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Problems arise when a large proportion, or a large weighted proportion, of the data are below the detection limit

facility flow mass discharged, using converted valuesND = 0 ND = 1/2 MDL ND = MDL ND = 0 ND = 1/2 MDL ND = MDL

1 1 15 15 15 15 15 154 1 10 10 10 10 10 105 2 5 5 5 10 10 107 2 5 5 5 10 10 108 3 3 3 3 9 9 99 5 3 3 3 15 15 15

10 5 3 3 3 15 15 1511 1 3 3 3 3 3 312 5 3 3 3 15 15 1513 15 2 2 2 30 30 3014 10 2 2 2 20 20 2015 10 2 2 2 20 20 2016 15 2 2 2 30 30 3017 12 2 2 2 24 24 2418 10 2 2 2 20 20 2019 2 2 2 2 4 4 420 50 0 0.5 1 0 25 5021 20 0 0.5 1 0 10 2022 30 0 0.5 1 0 15 3023 20 0 0.5 1 0 10 2024 10 0 0.5 1 0 5 1025 5 0 0.5 1 0 2.5 526 8 0 0.5 1 0 4 827 10 0 0.5 1 0 5 1028 10 0 0.5 1 0 5 1029 12 0 0.5 1 0 6 1230 10 0 0.5 1 0 5 1031 10 0 0.5 1 0 5 1032 10 0 0.5 1 0 5 10

total mass 250 353 455mean discharge 9 12 16

converted concentration value

Page 8: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

If the way that non-detects are treated influences the conclusions that are drawn from the data in an important way, seek expert advice on the best way to treat the values that are below the detection limit.

Page 9: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Assuming that the issue of non-detects is adequately dealt with, another frequently-faced challenge is describing or summarizing the data, i.e., determining the central tendency of the data.

Here again there are some pitfalls.

Page 10: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Most widely-used statistical methods assume a normal distribution of data (also called “Gaussian” or “bell-shaped” distribution).2 Unfortunately, environmental data often do not have a normal distribution.

But, there are methods to assess central tendency of non-normally distributed data that are relatively robust. And there are some pitfalls that should be avoided.

2 There are other statistical methods that don’t assume normality, e.g. distribution of ranks and resampling methods. These are not

discussed here.

Page 11: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

NJ0000795 12.00NJ0002500 0.25NJ0003476 4.08NJ0004235 2.00NJ0020028 3.45NJ0020079 4.20NJ0020141 0.08NJ0020184 2.89NJ0020206 3.49NJ0020281 8.30NJ0020290 4.14NJ0020371 0.86NJ0020389 1.33NJ0020427 5.38NJ0020532 0.52NJ0020591 1.47NJ0020605 2.13NJ0020672 2.50NJ0020711 2.60NJ0020737 8.83NJ0020761 2.31NJ0020915 2.53NJ0020923 1.26NJ0021016 2.35NJ0021083 5.47NJ0021091 2.17NJ0021105 1.03NJ0021113 4.35NJ0021172 0.50NJ0021326 3.27NJ0021334 3.16NJ0021342 1.85NJ0021369 9.82NJ0021571 0.50NJ0021598 2.96NJ0021601 1.58NJ0021610 1.56NJ0021636 0.61NJ0021687 20.72NJ0021709 0.52NJ0021717 1.32

NJ0021865 3.10NJ0021890 1.80NJ0021962 15.00NJ0022021 6.79NJ0022047 1.62NJ0022063 0.56NJ0022101 1.72NJ0022110 3.40NJ0022144 1.90NJ0022250 2.01NJ0022276 2.14NJ0022284 1.08NJ0022306 22.00NJ0022314 4.95NJ0022349 2.60NJ0022390 1.87NJ0022489 2.16NJ0022497 2.49NJ0022519 3.86NJ0022586 3.14NJ0022675 1.53NJ0022683 1.65NJ0022764 3.15NJ0022772 4.80NJ0022781 3.80NJ0022845 2.40NJ0022918 2.81NJ0022985 2.47NJ0023001 0.62NJ0023132 4.17NJ0023311 2.40NJ0023361 2.33NJ0023493 0.93NJ0023507 0.83NJ0023540 0.49NJ0023566 3.43NJ0023663 2.76NJ0023698 2.00NJ0023701 2.72NJ0023728 2.83NJ0023736 2.00

NJ0023787 1.36NJ0023809 0.77NJ0023841 1.08NJ0023949 1.50NJ0024007 4.79NJ0024015 0.86NJ0024023 3.29NJ0024031 2.47NJ0024040 1.40NJ0024104 2.97NJ0024163 4.50NJ0024414 1.50NJ0024449 0.44NJ0024457 4.17NJ0024465 1.59NJ0024473 0.60NJ0024490 5.54NJ0024511 10.95NJ0024520 5.20NJ0024562 4.28NJ0024635 0.76NJ0024643 3.17NJ0024651 5.00NJ0024660 0.10NJ0024678 2.16NJ0024686 0.16NJ0024708 2.14NJ0024716 2.90NJ0024741 4.29NJ0024759 0.33NJ0024783 4.45NJ0024791 8.34NJ0024813 1.50NJ0024821 1.89NJ0024856 1.80NJ0024864 2.38NJ0024872 4.17NJ0024902 3.69NJ0024911 2.71NJ0024929 2.98NJ0024937 4.32

NJ0024953 4.97NJ0024970 3.20NJ0024996 4.32NJ0025038 4.35NJ0025160 0.53NJ0025178 1.31NJ0025241 1.31NJ0025321 naNJ0025330 5.79NJ0025356 2.67NJ0025364 8.80NJ0025411 9.00NJ0025496 1.63NJ0025518 3.10NJ0026018 2.04NJ0026174 1.72NJ0026182 0.73NJ0026263 3.30NJ0026301 0.63NJ0026387 0.80NJ0026514 1.75NJ0026689 0.55NJ0026719 2.83NJ0026727 5.90NJ0026735 0.64NJ0026751 3.20NJ0026824 1.50NJ0026832 7.23NJ0026867 0.71NJ0026905 40.00NJ0027006 2.94NJ0027049 5.92NJ0027057 5.45NJ0027065 3.64NJ0027073 10.00NJ0027081 1.38NJ0027201 6.40NJ0027464 0.87NJ0027481 1.04NJ0027545 0.64NJ0027561 1.72

NJ0027596 0.75NJ0027669 1.67NJ0027677 3.40NJ0027685 1.11NJ0027715 0.32NJ0027758 2.17NJ0027774 1.85NJ0027821 5.61NJ0027961 4.92NJ0028002 4.99NJ0028142 2.78NJ0028304 0.62NJ0028452 1.04NJ0028479 1.80NJ0028487 0.65NJ0028541 31.00NJ0028592 naNJ0028894 6.36NJ0029084 1.92NJ0029203 1.12NJ0029386 6.16NJ0029408 3.49NJ0029432 4.40NJ0029467 1.61NJ0029475 5.33NJ0029831 3.09NJ0029912 3.00NJ0030333 5.65NJ0031046 4.67NJ0031119 1.80NJ0031267 0.13NJ0031585 1.94NJ0031674 1.70NJ0031810 3.00NJ0031992 0.88NJ0032395 3.85NJ0033189 0.60NJ0033995 1.94NJ0034282 1.07NJ0034339 1.80NJ0035084 0.15.....cont’d....

A typical environmental data set....

Page 12: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Distribution: actual values

0

20

40

60

80

100

120

140

160

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

Range values (> than number to left, < or = to number shown)

Nu

mb

er

wit

hin

ran

ge

And its histogram, showing non-normal distribution....

Page 13: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Distribution: log of actual values

0

20

40

60

80

100

120

140

160

-0.5 0 0.5 1 1.5 2 2.5

Range values (> than number to left, < or = to number shown)

Nu

mb

er w

ith

in r

ang

e

Non-normal distributions often look normal if the log of each value is used instead of the actual value......

Page 14: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

One can convert the values of “log-normal” data to the logarithms of the values, and then perform statistical tests using common statistical methods.

However, with environmental data, use of the logs of the actual data is often questionable, and can lead to conclusions that are not conservative and may be bogus. This is especially apparent in materials accounting contexts. (“There is a no law of conservation of logarithms of mass.”1)

1 Parkhurst, David F., 1999, Arithmetic Versus Geometric Means for Environmental Concentration Data, Environ. Sci. Technol., 32, 92A-98A.

Page 15: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Another example of environmental data...

What is the central tendency?

Site Asample pg/m3 log mg/m3

1 4.32 0.632 3.63 0.563 57.84 1.764 4.43 0.655 3.68 0.576 3.08 0.497 8.43 0.938 10.25 1.019 6.59 0.82

10 72.39 1.8611 29.57 1.4712 4.92 0.6913 15.27 1.1814 7.42 0.8715 85.38 1.9316 18.34 1.2617 15.12 1.1818 6.59 0.8219 2.81 0.4520 8.77 0.9421 4.98 0.7022 7.65 0.8823 11.36 1.0624 9.88 0.99

Page 16: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Distribution: Concentration of mercury bound to PM 2.5 Location A

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

pg/M3; size range

Nu

mb

er o

f sa

mp

les

in s

ize

ran

ge

Page 17: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

pg/m3

mean 16.78 0.99 100.99 = 9.72median 8.04 0.90geo mean 9.72

Three measures of central tendency; which is best?

It depends. If you want to know what sort of value you are most likely to encounter on any given occasion, median or geometric mean may be best. But........

Page 18: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

11001

1 10

C at point P = ?

Mean = 113/5 = 22.6

Median = 1

Geometric mean =10(2+1+0+0+0)/5 = 100.6 = 4.0

Stream system with five tributaries, all with equal flows; varying concentrations (C) of a pollutant - what is concentration of the pollutant at point P?

P

In a materials accounting context, the median and geometric mean are non-conservative and can lead to bogus conclusions regarding the central tendency.

Page 19: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Median concentration, Hg bound to PM 2.5, Five sites

0

5

10

15

20

25

30

35

40

A B C D E

Site

pg

/m3

Hypothetical Criterion

Page 20: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Geometric mean concentration, Hg bound to PM 2.5, Five sites

0

5

10

15

20

25

30

35

40

A B C D E

Site

pg

/m3

Hypothetical Criterion

Page 21: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Mean concentration, Hg bound to PM 2.5, Five sites

0

5

10

15

20

25

30

35

40

A B C D E

Site

pg

/m3

Hypothetical Criterion

Page 22: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

There are more sophisticated ways to estimate the central tendency, but they require additional expertise. In a materials accounting context, i.e., where you care about the mass of something, the simplest and most conservative approach is to use the mean (arithmetic average).

Unless you have a very good reason, do not use the median or geometric mean to describe or summarize non-normally distributed environmental data. Both the median and geometric mean undervalue high readings.

Page 23: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Once we’ve estimated the central tendency, how confident are we about its accuracy? How much uncertainty is associated with the estimate?

Estimating the uncertainty is important; it may be the most important tool of statistics.

With a mean value, uncertainty is typically expressed as the range in which we can be 95% (or sometimes 90% or 99%) certain that the actual mean of the entire population will be.

Page 24: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

With non-normally distributed data, especially if the number of samples is relatively small, a simple, relatively robust approach to determine the confidence interval is to use the formula:

= x t/2 s/ n

where is the population mean, x is the sample mean, t/2 is the critical value of the t distribution with n-1 degrees of freedom, s is the standard deviation of the sample, and n is the number of samples.

Page 25: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

For Site A, the mean is 16.8 pg/m3, and the 95% confidence interval of the mean is 9.5 pg/m3.

So, the true mean, which can be expected to emerge if enough samples are taken, may be as low as 7.3 pg/m3 or as high as 26.3 pg/m3.

The same method can be applied to the other sites’ data. This can have important implications relative to standards or criteria.

Page 26: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Mean concentration, Hg bound to PM 2.5, Five sites

0

5

10

15

20

25

30

35

40

A B C D E

Site

pg

/m3

Hypothetical Criterion

Page 27: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Confidence intervals also apply to the assessment of differences and trends. The human brain excels at finding patterns, even from random data. Statistics can prevent us from thinking we see a pattern in what is actually random variation.

The term significant is used to denote that some result or conclusion is not likely due to chance. Are differences or trends significant? In statistical terms, significant means there’s only a small probability (usually <5%, sometimes <1% or <10%) that the results are due to random variability.

Sometimes a little more data can clarify whether an apparent trend is in fact significant.

Page 28: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

1986 1988 1990 1992 1994 1996 1998

NJ GHG Emissions; 1985 through 1999; trend not significant

Page 29: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

1986 1988 1990 1992 1994 1996 1998

NJ GHG Emissions; 1985 through 1999; trend not significant

The apparent positive trend is not significant at the 95% confidence level because there’s at least a 5% chance the true slope could be zero or negative

Page 30: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

1986 1988 1990 1992 1994 1996 1998 2000

NJ GHG Emissions; 1985 through 2000; significant trend

Page 31: Data, Statistics, and Environmental Regulation: Avoiding Pitfalls Mike Aucott, Ph.D. Division of Science, Research & Technology maucott@dep.state.nj.us

Statistics can be misused! Important potential pitfalls:

* inappropriate handling of values below detection limit* inappropriate use of median and geometric mean* failure to consider uncertainty* failure to consider significance

But used properly, statistics will

* help make sense of a jumble of data * help identify meaningful inferences, and * prevent unwarranted conclusions.