towards a generic approach to validation: the validat ... · already in 2009, eurostat launched a...

21
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses Statistisches Bundesamt SDE – Budapest Towards a Generic Approach to Validation: the ValiDat Foundation Project

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

Statistisches Bundesamt

SDE – Budapest

Towards a Generic Approach to Validation: the ValiDat Foundation Project

Page 2: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 2

ValiDat - Foundation A bit of history

n  Once upon a time (actually now) .. n  .. 28 Member states in the European Statistical System

(ESS) sent data to Eurostat

n  .. and got a lot of problems: n  The gatekeeper at the central fortress refused entry and

sent the data back: They did not comply to the rules! n  The poor member states tried again and again .. n  .. until finally some brave knights decided to fight the

dragon of uncertainty and non-standards

Page 3: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 3

ValiDat - Foundation A bit of history

More prosaic: n Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS n In 2013 that project became the ESS Vision implementing Project „Validation“ n In 2014, a member states became more actively involved through a so-called ESSnet project n Italy, the Netherlands, Germany and Lithuania started working on methodological and technological questions (as well as questions of standardization) n In a truly interdisciplinary approach, the team engages methodologists and technicians from several fields

Page 4: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 4

ValiDat - Foundation What are we talking about?

n  Right from the start, we had serious problems agreeing on a common definition of validation

n  Specifically, the relationship between data validation and editing was not clear to some of us ..

n  .. but we had wise guys from the methodological side in our team:

Data Validation is an activity verifying whether a combination of values is a member of a set of acceptable

combinations n  As we see it: data validation is part one of the data editing

process

Page 5: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 5

ValiDat - Foundation if  employment  status  ==  “old-­‐age  pensioner”  and  age  <  35    then  error  “Too  young!”   0.5 < turnover(curMonth)/turnover(prevMonth) < 2 WENN  ANZAHL  VON  Familie[ALLE].Person[MIT  Alter  <  18]  >  0  DANN  ...  

ENDE   IF maritalstate=married THEN

Age>15 “Too young to be married”

ENDIF

profit  <=  0.6*revenue  

Page 6: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 6

ValiDat - Foundation Is there a business case?

n  When we started our survey on data validation in the ESS we were not completely aware of the scale of the „problem“:

n  Effort: The amount of effort put into data validation (and editing) in the five sample domains was estimated by the member states to make up 40 to 60 % of the total effort

n  Relevance: We have not asked for the impact of data validation on data quality (non-sampling errors) but assume that it is generally of paramount importance

Page 7: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 7

ValiDat - Foundation Business case - implications:

n  If validation has such a high impact on data quality and consumes so many resources, then it should be n  well understood, n  fairly wide standardized n  and as far as possible automated

n  Sequence: Understanding is the a) methodological foundation of b) standardization which in turn will be the base for c) technical innovation (and process enhancements)

Page 8: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 8

ValiDat - Foundation The Base Line: Methodology

n  A central part of the methodological work of the ESSnet project is writing a „handbook“ i.e. compiling from the work of others and make it available (pragmatically) for a general audience of statisticians

n  Why are we doing validation (remember the business case!)? n  Enhance data quality dimensions:

n  Directly (like accuracy, coherence and compatability)

n  Indirectly (timeliness) as restrictions

Page 9: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 9

ValiDat - Foundation The Base Line: Methodology

n  Content of handbook: n  What n  Why n  How n  When

Table  of  contents  1..........................................................................................................................................................................  1  

1   Introduction  ...............................................................................................................................................  2  

2   Data  validation  ...........................................................................................................................................  2  

2.1   What  is  data  validation.  .....................................................................................................................  3  

2.2   Why  data  validation.  Relationship  between  validation  and  quality  ..................................................  5  

2.3   How  to  do  data  validation:  validation  levels  and  validation  rules  .....................................................  7  

2.3.1   Validation  levels  from  a  business  perspective  ...........................................................................  8  

2.3.2   Validation  rules  ........................................................................................................................  13  

2.4   Generic  framework  for  validation  levels  and  validation  rules  .........................................................  17  

2.4.1   Validation  levels  based  on  decomposition  of  metadata  ..........................................................  17  

2.4.2   A  formal  typology  of  data  validation  functions........................................................................  19  

2.4.3   Validation  levels  .......................................................................................................................  20  

2.4.4   Relation  between  validation  levels  from  a  business  and  a  formal  perspective  ......................  21  

2.4.5   Applications  and  examples  ......................................................................................................  23  

3   Data  validation  as  a  process  .....................................................................................................................  24  

3.1   Data  validation  in  a  statistical  production  process  (GSBPM)  ...........................................................  24  

3.2   The  informative  objects  of  data  validation  (GSIM)  ..........................................................................  27  

4   The  data  validation  process  life  cycle  ......................................................................................................  30  

4.1   Design  phase  ....................................................................................................................................  32  

4.2   Implementation  phase  .....................................................................................................................  33  

4.3   Execution  phase  ...............................................................................................................................  34  

4.4   Review  phase  ...................................................................................................................................  35  

References  .......................................................................................................................................................  37  

Appendix  2.3.2:  List  of  validation  rules  ............................................................................................................  38  

 

Page 10: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 10

ValiDat - Foundation The Base Line: Methodology – What?

n  The handbook provides classification schemes for validation rules: n  Levels n  Pragmatic typology n  Formal typology

n  All have their merits and help communicate about validation

Class     Description  of  input   Example  function   Description  of  example  

    Single  data  point    

Univariate  comparison  with  constant  

    Multivariate  (in-­‐record)    

Linear  restriction  

    Multi-­‐element  (single  variable)  

 

Condition  on  aggregate  of  single  variable  

    Multi-­‐element  multivariate  

 

Condition  on  ratio  of  aggregates  of  two  variables  

    Multi-­‐measurement  

 

Condition  on  difference  between  current  and  previous  observation.  

    Multi-­‐measurement    multivariate  

 

Condition  on  ratio  of  sums  of  two  currently  and  preciously  observed  observations.  

    Multi-­‐measurement  multi-­‐element  

 

Condition  on  ratio  of  current  and  previously  observed  aggregate.  

    Multi-­‐measurement  multi-­‐element,  multivariate  

 

Condition  on  difference  between  ratios  of  previous  and  currently  observed  aggregates.  

    Multi-­‐universe  multi-­‐element  multivariate  

 

Condition  on  ratio  of  aggregates  over  different  variables  of  different  object  types.  

    Multi-­‐universe  multi-­‐measurement  multi-­‐element  multi-­‐time  

 

Condition  on  difference  between  ratios  of  aggregates  of  different  object  types  measured  at  different  times.  

Typology  dimension                                                                                        Types  of  checks    

1  

Identity  checks   Range  checks  • bounds  fixed  • bounds  depending  on  entries  in  other  

fields  

2  Simple  checks,  based  directly  on  the  entry  of  a  target  field  

More  “complex”  checks,  combining  more  than  one  field  by  functions  (like  sums,  differences,  ratios)  

 

Page 11: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 11

ValiDat - Foundation The Base Line: Methodology – What?

n  Levels and rule types are building blocks to discuss other important concepts like: n  Structural vs. content based validation n  Simple vs. complex rule types n  Soft vs. hard checks n  Micro data vs. macro data validation

n  They can be used as a framework for metrics, languages and technologies

n  One of the results of our survey was that there is no common understanding on these methodological issues yet!

Page 12: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 12

ValiDat - Foundation The Base Line: Methodology – When?

HERE

HERE

HERE HERE

Page 13: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 13

ValiDat - Foundation The Base Line: Methodology – How?

n  Validation Life Cycle

Simon et al. 2015

Page 14: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 14

ValiDat - Foundation The Base Line: Methodology – How?

n  How do we know that we have struck the right balance between n  Improving data quality n  At acceptable costs

n  Our solution: use metrics! n  Analyse the internal consistency of validation rule sets n  Analyse the value of validation rules on observed data n  Analyse validation rule sets in comparison to observed

and expected data

Page 15: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 15

ValiDat - Foundation Language

n  The future validation language has two main goals: n  It should provide an unambigous

communication channel for specialists (humans!)

n  It should feed different IT-systems with the necessary specific information about a particular survey

n  These might be conflicting aims!

Page 16: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 16

ValiDat - Foundation Language: A new Sta(nda)r(d) is born

n  VTL - Validation and Transformation Language has been specified by the SDMX community

Page 17: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 17

ValiDat - Foundation Language: A new Sta(nda)r(d) is born

n  Language is currently under review in our project n  Different Aspects:

n  Correctness and coherence n  Completeness n  Usability (by human users) n  Feasibility (for machine-to-machine communication)

n  Evaluation will be publicly available soon ..

Page 18: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 18

ValiDat - Foundation From Language to Tools and Services

n  Proof of Concept being launched n  From real world to .. n  .. VTL to .. n  .. national Systems

n  CBS (flexible R solution) n  Destatis (mighty IT-infrastructural approach)

Page 19: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 19

ValiDat - Foundation Tools and services

n  Heterogeneity: Tools beeing used for validation are varied. A large number of general purpose and specialized tools are applied in the NSI of the ESS (compare survey results)

n  Costs: Variety causes costs! n  Common Infrastructure: As a solution to secure

homogeneous (high quality) output at reasonable (and lower than now) costs

n  Preconditions: Methodology and language have to be standardized before

n  Is it worth it?

n  We believe: Yes!

Page 20: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

Statistisches Bundesamt

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

14/09/15 Folie 20

ValiDat - Foundation Your input is needed!

n  In this presentation we covered a lot of ground without going into detail

n  We would appreciate if you could give us feedback on the methodological findings as stated in the handbook (publicly available by the end of this year!)

n  And we would be very happy, if you (or colleagues from your institution) would discuss the implications and proposals made in our ESSnet at a

Workshop in Wiesbaden, November 10 -11!

n  .. and help fight the Data Ping-Pong!

Page 21: Towards a Generic Approach to Validation: the ValiDat ... · Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS ! In 2013 that project became

© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses

Statistisches Bundesamt

Köszönöm szépen