mechanisms for data quality and validation in citizen science
DESCRIPTION
Presentation for a paper on ways to improve data quality for citizen science. Presentation delivered by Nathan Prestopnik at a workshop on citizen science at eScience 2011.TRANSCRIPT
Mechanisms for Data Quality and Validation in Citizen ScienceA. Wiggins, G. Newman, R. Stevenson & K. CrowstonPresented by Nathan Prestopnik
Motivation
Data quality and validation are a primary concern for most citizen science projects More contributors = more opportunities for
error
There has been no review of appropriate data quality and validation mechanisms Diverse projects face similar challenges
Contributors’ skills and scale of participation are important considerations in ensuring quality
Methods
Survey Questionnaire with 70 items, all optional 63 completed questionnaires representing 62
projects Mostly small-to-medium sized projects in US,
Canada, UK; most focus on monitoring and observation
Inductive development of framework Based on survey results and authors’ direct
experience with citizen science projects
Survey: Resources
FTEs: 0 – 50+ Average: 2.4; Median: 1 Often small fractions of several individuals’
time
Annual budgets: $125 - $1,000,000 Average: $105,000; Median: $35,000; Mode:
$20,000 Up to 5 different funding sources, usually
grants, in-kind contributions (staff time), & private donations
Age/duration: -1 to 100 years Average age: 13 years; Median: 9 years; Mode:
2 years
Survey: Methods Used
Method n Percentage
Expert review 46
77%
Photo submissions 24
40%
Paper data sheets submitted along with online entry
20
33%
Replication/rating by multiple participants 14
23%
QA/QC training program 13
22%
Automatic filtering of unusual reports 11
18%
Uniform equipment 9 15%
Validation planned but not yet implemented 5 8%
Replication/rating, by the same participant 2 3%
Rating of established control items 2 3%
None 2 3%
Not sure/don’t know 2 3%
Survey: Combining Methods
Methods n Percentage
Single method 10
17%
Multiple methods, up to 5 (average 2.5) 45
75%
Expert review + Automatic filtering 11
18%
Expert review + Paper data sheets 10
17%
Expert review + Photos 14
23%
Expert review + Photos + Paper data sheets 6 10%
Expert review + Replication, multiple 10
17%
Survey: Resources & Methods
Number of validation methods and staff are positively correlated (r2 = 0.11) More staffing = more supervisory capacity
Number of validation methods and budget are negatively correlated (r2 = -0.15) If larger budgets means more contributors, this
constrains scalability of multiple methods Larger projects may use fewer but more
sophisticated mechanisms Suggests that human-supervised methods don’t
scale
Survey: Other Validation
Options“Please describe any additional validation
methods used in your project” Several projects rely on personal knowledge of
contributing individuals for data quality Not scientifically robust, but understandably
relevant Most comments referred to details of expert
review Reinforces the perceived value of expertise
Reporting interface and associated error-checking is often overlooked, but provides important initial data verification
Choosing Mechanisms
Data characteristics to consider when choosing mechanisms to ensure quality Accuracy and precision: taxonomic, spatial,
temporal, etc. Error prevention: malfeasance (gaming the system),
inexperience, data entry errors, etc.
Evaluate assumptions about error and accuracy Where does error originate? How do mechanisms
address this? At what step in the research process? How transparent is data review and outcomes? How much data will be reviewed? In how much detail?
Mechanisms: Protocols
Mechanism Process
Type/Detail
QA project plans Before SOP in some areas
Repeated samples/tasks
During By multiple participants, single participant, or experts (calibration)
Tasks involving control items
During Contributions compared to known states
Uniform/calibrated equipment
During Used for measurements; cost/scale tradeoff; who pays?
Paper data sheets + online entry*
During Extended details, verifying data entry accuracy
Digital vouchers* During Photos, audio, specimens/archives
Data triangulation, normalization, mining*
After Corroboration from other data sources; statistical & computer science methods
Data documentation* After Provide metadata about processes
Mechanisms: Participants
Mechanism Process
Types/Details
Participant training Before, During
Initial; Ongoing; Formal QA/QC
Participant testing Before, During
Following training; Pre/test-retest
Rating participant performance
During, After
Unknown to participant; Known to participant
Filtering of unusual reports
During, After
Automatically; Manually
Contacting participants about unusual reports
After May alienate/educate contributors
Automatic recognition After Techniques for image/text processing
Expert review After By professionals, experienced contributors, or multiple parties
Discussion
Need to pay more attention to way that data are created, not just protocols but also qualities of data like accuracy, precision
Clear need for quality/validation mechanisms for analysis, not only for data collection/processing Data mining techniques Spatio-temporal modeling
Scalability of validation may be limited May need to plan different quality management
techniques based on expected/actual project growth
Future Work
Most projects worry more about contributor expertise than appropriate analysis methods Resources are needed to support suitable analysis
approaches and tools
Comparative valuation of the efficacy of the data quality and validation mechanisms identified Develop a QA/QC planning and evaluation tool
Develop examples of appropriate data documentation for citizen science projects Necessary for peer review, data re-use
Thanks!
Nate Prestopnik
DataONE working group on Public Participation in Scientific Research
US NSF grants 09-43049 & 11-11107