datavalidationinfrastructure: the validate package · package markvanderlooandedwindejonge...

26
Basic concepts and workflow Validation features Outlook Data validation infrastructure: the validate package Mark van der Loo and Edwin de Jonge Statistics Netherlands Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Upload: others

Post on 24-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Data validation infrastructure: the validatepackage

Mark van der Loo and Edwin de Jonge

Statistics Netherlands

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 2: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Validate

GoalTo make checking your data against domain knowledge andtechnical demands as easy as possible.

Content of this talkI Basic concepts and workflowI Examples of possibilities and syntaxI Outlook

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 3: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Basic concepts and workflow

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 4: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Data Validation

IntuitivelyCheck whether data holds up to (multivariate) relations that youexpect from domain knowledge.

Validation rulesThese relations can often be expressed as a set of simple rules.

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 5: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Basic concepts of the validate package

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 6: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Example: retailers data

library(validate)data(retailers)

dat <- retailers[4:6]head(dat)

## turnover other.rev total.rev## 1 NA NA 1130## 2 1607 NA 1607## 3 6886 -33 6919## 4 3861 13 3874## 5 NA 37 5602## 6 25 NA 25

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 7: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Basic workflow

# define rulesv <- validator(turnover >= 0, other.rev >= 0

, turnover + other.rev == total.rev)# confront with datacf <- confront(dat, v)# analyze resultssummary(cf)

## rule items passes fails nNA error warning## 1 V1 60 56 0 4 FALSE FALSE## 2 V2 60 23 1 36 FALSE FALSE## 3 V3 60 19 4 37 FALSE FALSE## expression## 1 turnover >= 0## 2 other.rev >= 0## 3 abs(turnover + other.rev - total.rev) < 1e-08

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 8: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Plot the validation

barplot(cf, main="retailers")

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 9: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Get all results

# A value For each item (record) and each rulehead(values(cf))

## V1 V2 V3## [1,] NA NA NA## [2,] TRUE NA NA## [3,] TRUE FALSE FALSE## [4,] TRUE TRUE TRUE## [5,] NA TRUE NA## [6,] TRUE NA NA

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 10: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Shortcut using check_that()

cf <- check_that(dat, turnover >= 0, other.rev >= 0, turnover + other.rev == total.rev)

Or, using the magrittr ‘not-a-pipe’ operator:

dat %>%check_that(turnover >= 0

, other.rev >= 0, turnover + other.rev == total.rev) %>%

summary()

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 11: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Read rules from file

### myrules.txt

# inequalitiesturnover >= 0other.rev >= 0

# balance ruleturnover + other.rev == total.rev

v <- validator(.file="myrules.txt")

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 12: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Validation features

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 13: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Validating aggregates

RuleMean turnover must be at least 20% of mean totalrevenue

Syntax

mean(total.rev,na.rm=TRUE) /mean(turnover,na.rm=TRUE) >= 0.2

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 14: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Rules that express conditions

RuleIf the total revenue is larger than zero, the turnover mustbe larger than zero.

Syntax

# executed for each rowif (total.rev > 0) turnover > 0

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 15: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Functional dependencies

RuleTwo records with the same zip code, must have the samecity and street name.

zip → city + street

zip ~ city + street

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 16: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Using transient variables

RuleThe turnover must be between 0.1 and 10 times its median

Syntax

# transient variable with the := operatormed := median(turnover,na.rm=TRUE)turnover > 0.1*medturnover < 10*med

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 17: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Don’t repeat yourself

RuleTurnover, other revenue, and total revenue must bebetween 0 and 2000.

Syntax

G := var_group(turnover, other.rev, total.rev)

G >= 0G <= 2000

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 18: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Validating types

RuleTurnover is a numeric variable

Syntax

# any is.-function is valid.is.numeric(turnover)

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 19: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Validating metadata

RulesI The variable foo must be present.I The number of rows must be at least 20

Syntax

# use the "." to access the dataset as a whole"foo" %in% names(.)nrow(.) >= 20

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 20: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Referencing other datasets

RuleThe mean turnover of this year must not be more than 1.1times last years mean.

Syntax

# use the "$" operator to reference other datasetsv <- validator(

mean(turnover, na.rm=TRUE) <mean(lastyear$turnover,na.rm=TRUE))

cf <- confront(dat, v, ref=list(lastyear=dat_lastyear))

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 21: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Other features

I Rules (validator objects)I Select from validator objects using []I Extract or set rule metadata (label, description, timestamp, . . .)I Get affected variable names, rule linkageI Summarize rulesI Read/write to yaml format

I ConfrontI Control behaviour on NAI Raise errors, warningsI Set machine rounding limit

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 22: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Outlook

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 23: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

In the works / ideas

I More analyses of rulesI More programmabilityI More (interactive) visualisationsI Roxygen-like metadata specificationI More support for reportingI . . .

We’d ♥ to hear your comments, suggestions, bugreports

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 24: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Validate is just the beginning!

See github.com/data-cleaning

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 25: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

References

I Di Zio et al (2015) Methodology for data validation. ESSNeton validation, deliverable. [pdf]

I Van der Loo (2015) A formal typology of data validationfunctions. in United Nations Economic Comission for EuropeWork Session on Statistical Data Editing , Budapest. [pdf]

I Van der Loo, M. and E. de Jonge (2016). Statistical DataCleaning with Applications in R, Wiley (in preparation).

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package

Page 26: Datavalidationinfrastructure: the validate package · package MarkvanderLooandEdwindeJonge StatisticsNetherlands MarkvanderLooandEdwindeJonge Datavalidationinfrastructure: thevalidate

Basic concepts and workflowValidation features

Outlook

Contact, links

Code, bugreportsI cran.r-project.org/package=validateI github/data-cleaning/validate

This talkslideshare.com/markvanderloo

[email protected] @[email protected] @edwindjonge

Mark van der Loo and Edwin de Jonge Data validation infrastructure: the validate package