an information model for a metadata-driven editing and ...€¦ · an information model for a...
Post on 05-Oct-2020
6 Views
Preview:
TRANSCRIPT
An information model for a metadata-driven
editing and imputation system
Rok Platinovsek UNECE Work Session on Statistical Data Editing, April 24-26 2017
Contents
• Banff parameters
• Metadata information model
• Data organization
2 24-26 April 2017 Rok Platinovsek
Banff
Banff procedures Edit Specification
and Analysis
Edit Summary Statistics Tables
ErrorLocalization
Deterministic Imputation
DonorImputation
Imputation Estimators
Prorating
MassImputation
OutlierDetection
Amendment
Review
Selection
4 24-26 April 2017 Rok Platinovsek
Banff processor
5 24-26 April 2017 Rok Platinovsek
The metadata information model
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
7 24-26 April 2017 Rok Platinovsek
E&I metadata information model
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
8 24-26 April 2017 Rok Platinovsek
Process flow
• Topmost object
• Several process flows, e.g. production and testing
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
9 24-26 April 2017 Rok Platinovsek
Function
• Description of E&I activity without reference to
implementation
• Purpose attribute (review | selection | amendment)
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
10 24-26 April 2017 Rok Platinovsek
Method
• Central information object in the model
• Implementation attribute, e.g., “Banff donor imputation”
• Parameter set depends on implementation
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
11 24-26 April 2017 Rok Platinovsek
Driver table
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
12 24-26 April 2017 Rok Platinovsek
Function Method name Purpose Implementation
Verify the set of edits prep
Verify the set of edits BANFF veryfyedits
Edit summary tables review
Edit summary tables BANFF editstats
Identify outliers selection
Identify outliers; historic method BANFF outlier
Identify outliers; current method BANFF outlier
Identify inconsistent observations and select fields for imputation selection
Identify inconsistent observations and select fields for imputation BANFF errorloc
Impute missing values and fields identified by error localization amendment
Deterministic imputation BANFF deterministic
Donor imputation within areas BANFF donorimputation
Donor imputation (unrestricted) BANFF donorimputation
Estimator imputation (negative values not accepted) BANFF estimatorimputation
Estimator imputation for QR_PROF (negative values accepted) BANFF estimatorimputation
E&I process for the egg-lying statistic (production)
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
13 24-26 April 2017 Rok Platinovsek
Method parameters
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
14 24-26 April 2017 Rok Platinovsek
Scalar parameter
• Name-value pair
• E.g. name=”mindonors” value=”10”
• Information model extensible
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
15 24-26 April 2017 Rok Platinovsek
Variable list
• Ordered vector of variable names
• Each variable marked up via unique variable-ID
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
16 24-26 April 2017 Rok Platinovsek
Expression
• SAS expression, used for various purposes
• E.g. “strat=1” used to select a subset to which E&I action is
applied
• Contains a variable list of variables used in the expression
Process flow
Function
Method
Scalar parameter
Variable listWeighted
variable listExpression Edit set Estimator set
Edit Estimator
Algorithm
17 24-26 April 2017 Rok Platinovsek
Edit & edit set
• The same edit can appear in several edit sets
• Several methods can use the same edit set
• An edit is expressed as a SAS expression
Example: edit rule definition
18 24-26 April 2017 Rok Platinovsek
Edit 1
Edit rule
HEN_LT20 + HEN_GE20 + HEN_OTH = HEN_TOT
Expression
HEN_LT20,
HEN_GE20,
HEN_OTH,
HEN_TOT
Variable list
Example: donor imputation
19 24-26 April 2017 Rok Platinovsek
Edit 1
Edit rule
Edit 2
Edit rule
Post-imp edit 1
Edit rule
Post-imp edit 2
Edit rule
Edit 3
Edit rule
…
Production edits
Edit set
Post-imp edits
Edit set
Donor imputation (unrestricted)
Method
STRAT,
HEN_TOT,
QR_REV
Var list
name="acceptnegative“, value="yes"
Scalar parameter
name="mindonors “, value=“10"
Scalar parameter
…
Metadata versioning
• Once parameters are used in production, they should be retained
indefinitely
• => Need a versioning mechanism
• Example:
20 24-26 April 2017 Rok Platinovsek
<methodparameters>
<scalarparameter vstart="2017-01-11" vend="" name="mindonors" value="10"/>
</methodparameters>
<methodparameters>
<scalarparameter vstart="2017-01-11" vend=“2017-04-25" name="mindonors" value="10"/>
<scalarparameter vstart="2017-04-25" vend="" name="mindonors" value="15"/>
</methodparameters>
Data organization
Data organization principles
• E&I process fully described in the metadata
• Audit trail info required for traceability and reproducibility:
a) Mark the field that was reviewed, selected or amended
b) Identify the method via reference to metadata
c) Timestamp (there may be multiple parameter set versions)
22 24-26 April 2017 Rok Platinovsek
A naive data organization model
23 24-26 April 2017 Rok Platinovsek
edt_status edt_mref edt_time id year class var1 var2 edt_n_var1
original 2016-03-01 09:31 001 2015 2 45 150 .
original 2017-02-15 15:01 001 2016 2 51 156 .
original 2016-03-01 09:31 002 2015 9 12 99 .
original 2017-02-15 15:01 002 2016 9 60 110 .
selection/banff_errorloc ref5 2017-02-16 10:03 002 2016 9 . . 1
amendment/banff_estimator ref7 2017-02-15 10:23 002 2016 9 13 110 1
• Status can be original or denote the E&I method in question
• Edt_mref – link to metadata
• Indicator variable (edt_n_) added for each original variable that is
subject to E&I actions
Demands on the data organization model
a) Data pertaining to different production cycles can be extracted in
a standardized way. The extracted data have a standard
structure.
b) Any editing version can be extracted in a standardized way.
c) Indicators like the imputation rate can be calculated in a
standardized way.
d) Traceability and reproducibility of E&I actions via audit trail.
24 24-26 April 2017 Rok Platinovsek
Conclusions
Conclusions
• Information model for a Banff-based E&I system with full audit trail
• Metadata information model: 12 metadata objects that fully
specify the E&I process
• Data organization principles
• Extendable to non-Banff implementations with minimal or no
changes
26 24-26 April 2017 Rok Platinovsek
Thank you!
Rok Platinovsek rok.platinovsek@stat.fi
UNECE Work Session on Statistical Data Editing, April 24-26 2017
top related