Better Data, Better Science![ Better Science through Better Data Management ]
Todd D. O’BrienNOAA – NMFS - COPEPOD
“BETTER DATA” is …
• Easily Accessible
• Well Documented
• Integrated / Interlinked
• The Best Quality possible
Oops! (When Data Management Fails)
“BETTER DATA” is …
• Easily Accessible
• Well Documented
• Integrated / Interlinked
• The Best Quality possible
“BETTER DATA” is …
• Easily Accessible
• Well Documented
• Integrated / Interlinked
• The Best Quality possible
WHY QC?
• To find errors in the data …
WHY QC?
• To find errors in the data …
– To detect instrument failure or sampling problems
WHY QC?
• To find errors in the data …
– To detect instrument failure or sampling problems
– To detect phenomena of scientific interest• Natural physical or biological events
• Something “new”
WHY QC?
• To find errors in the data … that were not present in the original data ?!
WHY QC?
• To find errors in the data … that were not present in the original data ?!
– Data Pathway errors• human error
• computer error
WHAT TO QC?
• Individual values (the measurements)?
• Profile of multiple values?
• Cruise of multiple profiles?
• Project of multiple cruises?
• Region or Ocean of multiple Projects?
• Entire World of multiple Regions?
What software, tools, and skills are available?
Station Lon Lat Time SPEED1 -69.30732 39.86233 7:002 -68.93825 38.70241 8:00 29.213 -68.54282 37.30523 9:00 34.854 -67.96285 35.5917 10:00 43.425 -66.56567 33.1664 11:00 67.186 -66.11751 32.45462 12:00 20.197 -67.54106 34.58994 13:00 61.598 -65.03667 30.87291 14:00 107.579 -64.11399 30.84654 15:00 22.1510 -63.56039 31.37378 16:00 18.3511 -65.64299 34.53722 18:00 45.4512 -67.35653 38.46515 19:00 102.8513 -60.89783 38.14881 19:15 620.7814 -67.67287 39.41418 20:00 220.5515 -68.25284 40.38957 21:00 27.23
What software, tools, and skills are available?
Station Lon Lat Time SPEED1 -69.30732 39.86233 7:002 -68.93825 38.70241 8:00 29.213 -68.54282 37.30523 9:00 34.854 -67.96285 35.5917 10:00 43.425 -66.56567 33.1664 11:00 67.186 -66.11751 32.45462 12:00 20.197 -67.54106 34.58994 13:00 61.598 -65.03667 30.87291 14:00 107.579 -64.11399 30.84654 15:00 22.1510 -63.56039 31.37378 16:00 18.3511 -65.64299 34.53722 18:00 45.4512 -67.35653 38.46515 19:00 102.8513 -60.89783 38.14881 19:15 620.7814 -67.67287 39.41418 20:00 220.5515 -68.25284 40.38957 21:00 27.23
What software, tools, and skills are available?
What software, tools, and skills are available?
Station Lon Lat Time SPEED1 -69.30732 39.86233 7:002 -68.93825 38.70241 8:00 29.213 -68.54282 37.30523 9:00 34.854 -67.96285 35.5917 10:00 43.425 -66.56567 33.1664 11:00 67.186 -66.11751 32.45462 12:00 20.197 -67.54106 34.58994 13:00 61.598 -65.03667 30.87291 14:00 107.579 -64.11399 30.84654 15:00 22.1510 -63.56039 31.37378 16:00 18.3511 -65.64299 34.53722 18:00 45.4512 -67.35653 38.46515 19:00 102.8513 -60.89783 38.14881 19:15 620.7814 -67.67287 39.41418 20:00 220.5515 -68.25284 40.38957 21:00 27.23
What software, tools, and skills are available?
30
35
40
45
-75 -70 -65 -60
What software, tools, and skills are available?
0
100
200
300
400
500
600
700
800
900
1000
00.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
2.2
2.4
2.6
2.8 3
3.2
3.4
3.6
3.8 4
4.2
4.4
4.6
4.8 5
5.2
5.4
5.6
5.8 6
6.2
6.4
6.6
6.8 7
7.2
7.4
7.6
7.8 8
8.2
8.4
8.6
8.8 9
9.2
9.4
9.6
9.8 10
10.2
10.4
10.6
10.8 11
11.2
11.4
11.6
11.8 12 Mor
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
What software, tools, and skills are available?
Let’s get started …
QC OF THE “WHAT & HOW”
QC OF THE “WHAT & HOW”
• Need to first understand the methods, variables, and units of the data before trying to QC the data
QC OF THE “WHAT & HOW”
• Need to first understand the methods, variables, and units of the data before trying to QC the data
– Are all labels clear and unambiguous
– Are methods provided (or a reference)
– What are the value units
QC OF THE “WHEN & WHERE”
QC OF THE “WHEN & WHERE”
• Primary Data:– First, check the master ship record– Then check PI files
QC OF THE “WHEN & WHERE”
• Primary Data:– First, check the master ship record– Then check PI files
• Simple Range Checks– Time (0-23? 1-24?)
• What is the time zone? – Lat +/- 90 Lon +/- 180
• Are hemisphere signs present (E/W) or described
QC OF THE “WHEN & WHERE”
• Map the Cruise Track– sorted by station sequence– sorted by sampling time
QC OF THE “WHEN & WHERE”
• Calculate ship speed (distance/time) between stations
Station Lon Lat Time SPEED1 -69.30732 39.86233 7:002 -68.93825 38.70241 8:00 29.213 -68.54282 37.30523 9:00 34.854 -67.96285 35.5917 10:00 43.425 -66.56567 33.1664 11:00 67.186 -66.11751 32.45462 12:00 20.197 -67.54106 34.58994 13:00 61.598 -65.03667 30.87291 14:00 107.579 -64.11399 30.84654 15:00 22.1510 -63.56039 31.37378 16:00 18.3511 -65.64299 34.53722 18:00 45.4512 -67.35653 38.46515 19:00 102.8513 -60.89783 38.14881 19:15 620.7814 -67.67287 39.41418 20:00 220.5515 -68.25284 40.38957 21:00 27.23
QC OF THE “HOW MUCH”
QC OF THE “HOW MUCH”
• First, look at the background environment• Check for depth inversions• Check for density inversions• Look at T vs. S plot
QC OF THE “HOW MUCH”
• Look at the variable vs. depth
QC OF THE “HOW MUCH”
• Check against basic value ranges
0
20
40
60
80
100
120
140
160
0 5 10 15
Measurement
Depth
QC OF THE “HOW MUCH”
• Check against basic value ranges
• Check for excessive gradients (spikes) between values at adjacent depths
0
20
40
60
80
100
120
140
160
0 5 10 15
Measurement
Depth
QC OF THE “HOW MUCH”
Expert / Specialist Data Centers
Expert / Specialist Data Centers
• Can provide guidance on– Metadata (standards, minimum requirements)– Data Formats (format suggestions / review)– Tools and Methods
Expert / Specialist Data Centers
• Can provide guidance on– Metadata (standards, minimum requirements)– Data Formats (format suggestions / review)– Tools and Methods
• May have advanced visualization or QC methods available for your data.
Empirical Comparisons with Historical Observations (ECHO)
Expert / Specialist Data Centers(just a few examples)
• CCHDO- CLIVAR Carbon & Hydrographic Data Office
• BCO-DMO- Biological and Chemical Oceanography Data
Management Office
• BODC- British Oceanographic Data Centre
• COPEPOD- Coastal & Oceanic Plankton Ecology, Production & Observation Database
The Conclusions
Some Conclusions
• Each additional layer of QC and examination may highlight issues that were previously undetected.
Some Conclusions
• Each additional layer of QC and examination may highlight issues that were previously undetected.
• Each instance of transfer or reformatting the data has a chance of introducing new errors (or data loss).
Some Conclusions
• Each additional layer of QC and examination may highlight issues that were previously undetected.
• Each instance of transfer or reformatting the data has a chance of introducing new errors (or data loss).
• The comprehensiveness of the co-stored metadata will determine the extent to which the data are still usable/understandable 10+ years after the project.
“BETTER DATA” is …
• Easily Accessible
• Well Documented
• Integrated / Interlinked
• The Best Quality possible