mechanisms for data quality and validation in citizen science

Mechanisms for Data Quality and Validation in Citizen ScienceA. Wiggins, G. Newman, R. Stevenson & K. CrowstonPresented by Nathan Prestopnik

Motivation

Data quality and validation are a primary concern for most citizen science projects More contributors = more opportunities for

error

There has been no review of appropriate data quality and validation mechanisms Diverse projects face similar challenges

Contributors’ skills and scale of participation are important considerations in ensuring quality

Methods

Survey Questionnaire with 70 items, all optional 63 completed questionnaires representing 62

projects Mostly small-to-medium sized projects in US,

Canada, UK; most focus on monitoring and observation

Inductive development of framework Based on survey results and authors’ direct

experience with citizen science projects

Survey: Resources

FTEs: 0 – 50+ Average: 2.4; Median: 1 Often small fractions of several individuals’

time

Annual budgets: $125 - $1,000,000 Average: $105,000; Median: $35,000; Mode:

$20,000 Up to 5 different funding sources, usually

grants, in-kind contributions (staff time), & private donations

Age/duration: -1 to 100 years Average age: 13 years; Median: 9 years; Mode:

2 years

Survey: Methods Used

Method n Percentage

Expert review 46

77%

Photo submissions 24

40%

Paper data sheets submitted along with online entry

20

33%

Replication/rating by multiple participants 14

23%

QA/QC training program 13

22%

Automatic filtering of unusual reports 11

18%

Uniform equipment 9 15%

Validation planned but not yet implemented 5 8%

Replication/rating, by the same participant 2 3%

Rating of established control items 2 3%

None 2 3%

Not sure/don’t know 2 3%

Survey: Combining Methods

Methods n Percentage

Single method 10

17%

Multiple methods, up to 5 (average 2.5) 45

75%

Expert review + Automatic filtering 11

18%

Expert review + Paper data sheets 10

17%

Expert review + Photos 14

23%

Expert review + Photos + Paper data sheets 6 10%

Expert review + Replication, multiple 10

17%

Survey: Resources & Methods

Number of validation methods and staff are positively correlated (r2 = 0.11) More staffing = more supervisory capacity

Number of validation methods and budget are negatively correlated (r2 = -0.15) If larger budgets means more contributors, this

constrains scalability of multiple methods Larger projects may use fewer but more

sophisticated mechanisms Suggests that human-supervised methods don’t

scale

Survey: Other Validation

Options“Please describe any additional validation

methods used in your project” Several projects rely on personal knowledge of

contributing individuals for data quality Not scientifically robust, but understandably

relevant Most comments referred to details of expert

review Reinforces the perceived value of expertise

Reporting interface and associated error-checking is often overlooked, but provides important initial data verification

Choosing Mechanisms

Data characteristics to consider when choosing mechanisms to ensure quality Accuracy and precision: taxonomic, spatial,

temporal, etc. Error prevention: malfeasance (gaming the system),

inexperience, data entry errors, etc.

Evaluate assumptions about error and accuracy Where does error originate? How do mechanisms

address this? At what step in the research process? How transparent is data review and outcomes? How much data will be reviewed? In how much detail?

Mechanisms: Protocols

Mechanism Process

Type/Detail

QA project plans Before SOP in some areas

Repeated samples/tasks

During By multiple participants, single participant, or experts (calibration)

Tasks involving control items

During Contributions compared to known states

Uniform/calibrated equipment

During Used for measurements; cost/scale tradeoff; who pays?

Paper data sheets + online entry*

During Extended details, verifying data entry accuracy

Digital vouchers* During Photos, audio, specimens/archives

Data triangulation, normalization, mining*

After Corroboration from other data sources; statistical & computer science methods

Data documentation* After Provide metadata about processes

Mechanisms: Participants

Mechanism Process

Types/Details

Participant training Before, During

Initial; Ongoing; Formal QA/QC

Participant testing Before, During

Following training; Pre/test-retest

Rating participant performance

During, After

Unknown to participant; Known to participant

Filtering of unusual reports

During, After

Automatically; Manually

Contacting participants about unusual reports

After May alienate/educate contributors

Automatic recognition After Techniques for image/text processing

Expert review After By professionals, experienced contributors, or multiple parties

Discussion

Need to pay more attention to way that data are created, not just protocols but also qualities of data like accuracy, precision

Clear need for quality/validation mechanisms for analysis, not only for data collection/processing Data mining techniques Spatio-temporal modeling

Scalability of validation may be limited May need to plan different quality management

techniques based on expected/actual project growth

Future Work

Most projects worry more about contributor expertise than appropriate analysis methods Resources are needed to support suitable analysis

approaches and tools

Comparative valuation of the efficacy of the data quality and validation mechanisms identified Develop a QA/QC planning and evaluation tool

Develop examples of appropriate data documentation for citizen science projects Necessary for peer review, data re-use

Thanks!

Nate Prestopnik

DataONE working group on Public Participation in Scientific Research

US NSF grants 09-43049 & 11-11107

mechanisms for data quality and validation in citizen science

Technology

data review

data sourcesnormalization

howmuch data

motivation data quality

paper data sheets1017

data entryonline entry

data entry errors

details of expert review