christian gendreau , david shorthouse & peter desmet

32
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Christian Gendreau, David Shorthouse & Peter Desmet

Upload: stew

Post on 24-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions. Christian Gendreau , David Shorthouse & Peter Desmet. Game plan. Introduction to Canadensys Data quality @ Canadensys Canadensys p rocessing solutions Numbers from Canadensys - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data quality challenges in the Canadensys network of

occurrence records: examples, tools, and solutions

Christian Gendreau, David Shorthouse & Peter Desmet

Page 2: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Game plan• Introduction to Canadensys• Data quality @ Canadensys• Canadensys processing solutions• Numbers from Canadensys• Hopes and expectations

Page 3: Christian  Gendreau , David  Shorthouse & Peter  Desmet

A NetworkOf people and collections

Page 4: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Canadensys Headquarters Université de Montréal Biodiversity Centre

Page 5: Christian  Gendreau , David  Shorthouse & Peter  Desmet

data.canadensys.net/vascan

Page 6: Christian  Gendreau , David  Shorthouse & Peter  Desmet

data.canadensys.net/ipt

Page 7: Christian  Gendreau , David  Shorthouse & Peter  Desmet

data.canadensys.net/explorer

Page 8: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data quality related activitiesFrom an aggregator perspective

Page 9: Christian  Gendreau , David  Shorthouse & Peter  Desmet

During data entry• Help to avoid typographical errors• Help to convert verbatim data

Actor : data entry person

Page 10: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Before publication

Actor : data publisher

• Detect file character encoding issue• Detect duplicate or missing IDs

Previous Activity:Data entry

Page 11: Christian  Gendreau , David  Shorthouse & Peter  Desmet

During aggregation• Process data: validation, cleaning• Produce structured reports : quality control

Actor : data aggregator

Previous Activity:Before publication

Page 12: Christian  Gendreau , David  Shorthouse & Peter  Desmet

After aggregation• Allow and facilitate community feedback• Help data publisher to integrate corrections

Actor : users and community

Previous Activity:Aggregation

Page 13: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Canadensys toolsduring data entry

data.canadensys.net/tools

Page 14: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Why do we process data?• Enrich our Explorer, http://data.canadensys.net• Provide structured reports to data providers

• Help identify records that need re-examination• Help to improve data entry procedure

Page 15: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data processing

Page 16: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Processing solutionsNarwhals to the rescue

Narwhal image Public Domain

Page 17: Christian  Gendreau , David  Shorthouse & Peter  Desmet

The narwhal-processor approach● Single field processing to allow complex

processing (combined fields)● Processors with common interface ease

integration and usage● Collaboration

https://github.com/Canadensys/narwhal-processor

Page 18: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data usabilitybefore processing

country text state/province text coordinates dates0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

92%

60%

96%

44%

% o

f non

-nul

l cle

an v

erba

tim d

ata

Page 19: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data usabilityafter processing

• 7% of provided country text

USAISO 3166-

2:US, United States

Page 20: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data usabilityafter processing

• 7% of provided country text• 16% of provided state/province text

QuéISO 3166-2

CA-QC, Quebec

Page 21: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data usabilityafter processing

• 7% of provided country text• 16% of provided state/province text• 4% of provided coordinates

45° 32' 25" N, 129° 40' 31"

W

45.5402778, -129.6752778

Page 22: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data usabilityafter processing

• 7% of provided country text• 16% of provided state/province text• 4% of provided coordinates• 42% of provided dates

2008 VI 13 2008-06-13

Page 23: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Data usabilityincluding processed data

country text state/province text coordinates dates0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

92%

60%

96%

44%

7%

16%

4%

42%

% o

f non

-nul

l pro

vide

d

Page 24: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Projects With Data Quality Tools• Atlas of living Australia• GBIF Norway, GBIF Spain, National Biodiversity

Network, BioVeL … • GBIF libraries• Most nodes have their own data quality

routine

Page 25: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Hopes and expectations

Page 26: Christian  Gendreau , David  Shorthouse & Peter  Desmet

• Maintain taxonomic authority files• Maintain country, province and city lists

We do not want to

Page 27: Christian  Gendreau , David  Shorthouse & Peter  Desmet

• Efficiently use specialized resources/services• Provide report, quality indices

We prefer to

Page 28: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Help from Semantic Web• Data in other languages (French, Spanish, …)

should not be flagged as error• Misspellings should be shared as a common

resource (e.g. SKOS)• Understand historical data (e.g. collected in

USSR in 1980)

Page 29: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Reporting and log• DarwinCore annotations for processed data• Shared vocabulary for structured reports and

quality indices

Page 30: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Summary• Tools available for sharing• Use, review, contribute• Opportunity for broad coordination and

increased efficiencies

Page 31: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Thanks

Anne Bruneau, Institut de recherche en biologie végétale andDépartement de Sciences Biologiques, Université de Montréal

Page 32: Christian  Gendreau , David  Shorthouse & Peter  Desmet

Contacthttp://www.canadensys.nethttp://github.com/Canadensys@Canadensys

Gulo gulo, Larry Master (www.masterimages.org)