metadata models in survey computing some results of metanet – wg 2 metis 2004, geneva w. grossmann...
TRANSCRIPT
Metadata Models in Survey Computing
Some Results of MetaNet – WG 2
METIS 2004, GenevaW. Grossmann
University of Vienna
10.02.2004 METIS 2004 Geneva 2
Contents
MetanetRequirements for ModelsKey Features of the ModelImplications for Terminology
10.02.2004 METIS 2004 Geneva 3
METANET 1
A network of excellence funded by EUROSTAT 2000/01 – 2003, 5 work groups WG 1: Methodology and Tools WG 2: Harmonisation of Metadata –
Structure and Definitions WG 3: Best Practice for Migration WG 4: Adoption Issues WG 5: Terminology (ad hoc)
10.02.2004 METIS 2004 Geneva 4
METANET 2
Within WG 2 two different approaches: Terminology Model (cf. WP 12) Unified Metadata Architecture for
Statistics (UMAS model)
10.02.2004 METIS 2004 Geneva 5
METANET 3
Intention of the UMAS model Statistics deals with different kinds of
data, e.g. surveys, registers, classifications, …
These data show a dynamic defined by statistical processing activities
Define a model which supports besides description of data description of the statistical dynamic
10.02.2004 METIS 2004 Geneva 6
Requirement Analysis 1 Method
Requirement analysis is based on Examination of a number of activities in
survey processing Documentation of these activities inside
statistical systems, in particular proposal of Banca d’Italia, DDI, OECD, SCB-DOK, SDDS,
Statistics Netherlands (Input-Throughput-Output model)
General methods for documentation, e.g. Dublin Core, Facet Classifications, ISO-Standards
10.02.2004 METIS 2004 Geneva 7
Requirement Analysis 2 Example A: Sampling
Terminology view Sampling is the process of selecting a
number of cases from all the cases in a particular group or universe
Operational view Input: Sampling Frame Output: Sample
10.02.2004 METIS 2004 Geneva 8
Requirement Analysis 3 Example A: Sampling
Details of operational view What is an appropriate definition for the sampling
frame given the problem (e.g. coverage)? What kind of additional information should be
available for the sampling frame (e.g. auxiliary variables)?
How can we obtain an appropriate representation of the desired sampling frame (e.g. merging existing frames, selecting from existing frames)
Who is responsible for the frame in the future?
10.02.2004 METIS 2004 Geneva 9
Requirement Analysis 4 Example A: Sampling
Which sampling technique is appropriate for our problem (Note that there are some relations between structure of sampling frame and possible sampling techniques)
In which form is the output (i.e. the sample) represented in the system?
Who is responsible for the sampling procedure?
10.02.2004 METIS 2004 Geneva 10
Requirement Analysis 5 Example B: Editing
Terminology view Editing is the process of detecting and
adjusting individual errors in data records resulting from data collection and capture…
Operational view Input: A variable together with a set of
admissible values for the variable within a specific context
Output: A summary statement about quality of the variable or a listing of errors for each case
10.02.2004 METIS 2004 Geneva 11
Requirement Analysis 6 Example B: Editing
Details of operational view Context may be defined in various ways
Subject matter considerations, e.g. there is only one person in a household, who can claim to be head of household
Context may be defined by some more technical reasons, e.g. use as measurement unit for annual income 1000€
Context may be defined by pure technical reasons, e.g. “f” for female and “m” for male
10.02.2004 METIS 2004 Geneva 12
Requirement Analysis 7 Example B: Editing
Context defines rules for the admissible values of the variable Within one data set, Within one infological model (e.g. person-household) Within a time series
Rules may be formulated As strong constraints, i.e. logical conditions on
combination of values As soft constraints, i.e. statistical conditions on
combinations of values Rules have to be processed in algorithmic
form and maintained by an administrative procedure
10.02.2004 METIS 2004 Geneva 13
Requirement Analysis 8 Example C: Weighting
Terminology view Weight is the importance of an object
in relation to a set of objects to which it belongs; ….
Operational view Input: A statistical dataset together
with appropriate information Output: Statistical dataset augmented
by the weight information
10.02.2004 METIS 2004 Geneva 14
Requirement Analysis 9 Example C: Weighting
Details of operational view Which subject matter problem should be
solved by weighting (e.g. representation of strata, post-stratification, …)?
Which procedure should be used for weighting (e.g. base weights, calibration weights, ….)?
In which form are the data and the additional information about the population available (e.g. population data as summary table or as register with auxiliary variables)?
10.02.2004 METIS 2004 Geneva 15
Requirement Analysis 10 Example C: Weighting
How can we access and combine the different data?
Who is responsible for the different datasets? How is the output represented (e.g. as
weight for the dataset, as weights for the sampling procedure, as summary table)?
Are we interested in reuse of the procedure for new data sets (e.g. the same weighting procedure within a series)?
10.02.2004 METIS 2004 Geneva 16
Requirement Analysis 11 Example D: Analytical Units
Terminology view Analytical units represent real or
artificially constructed units for which statistics are compiled
Operational view Input: Two or more statistical units Output: A new statistical unit
10.02.2004 METIS 2004 Geneva 17
Requirement Analysis 12 Example D: Analytical Units
Details of operational view What is the conceptual definition of the
statistical units? How are the conceptual definitions captured
by operational characteristics (e.g. auxiliary variables)?
How can we access and manipulate the operational characteristics in order to produce the new analytical unit?
How is the new analytical unit embedded into an existing administrative framework?
10.02.2004 METIS 2004 Geneva 18
Requirement Analysis 13 Summary
The examples show that Models should be based on terminology but are
more than terminology Models have to consider different types of
“statistical” objects For these objects we have to know the
concepts represented as data, together with the relations between the concepts
We have to know the statistical meaning of the objects, together with their statistical relations
10.02.2004 METIS 2004 Geneva 19
Requirement Analysis 14 Summary
We have to take into account the specific format of realisation of the objects as physical datasets
We must include statements about responsibility, access rights and other administrative details
We need a flexible coupling mechanism for the objects according to processing needs
We have to develop a description formalism for statistical processing
We have to take into account information requirements of external users
10.02.2004 METIS 2004 Geneva 20
Key Features of the Model 1
In order to meet the different aspects of the requirement analysis a model with four different facets, resembling the idea of facet classifications used by librarians and archivist, was designed
10.02.2004 METIS 2004 Geneva 21
Key Features of the Model 2
“Structure Facet” The objects of interest, so called
“statistical categories”: statistical unit, statistical population, statistical variables, statistical values together with a number of
related objects like classifications, statistical datasets,
10.02.2004 METIS 2004 Geneva 22
Key Features of the Model 3
“statistical domains” for coupling objects according to processing needs (basically a system of catalogues for the other objects)
Each instance of the structure has a twofold representation inside a system
As data (“Category-Instance data”) As description (“Category-Instance model” ,
i.e. metadata)
10.02.2004 METIS 2004 Geneva 23
Key Features of the Model 4
“View Facet” describe the instances “Conceptual point of view”
subject matter definition “Statistical point of view”
The statistical properties of the instances necessary for processing
“Data management point of view” All information necessary for machine supported
storage and manipulation “Administrative point of view”
Management and bookkeeping of the structures
10.02.2004 METIS 2004 Geneva 24
Key Features of the Model 5
“Stage Facet” describes processing at the data as well as at the metadata level
DEFINITIONandDESIGN
{ CM1, CM2, . . . }Metadata level
{ TM (CM1, CM2, ... )}
PRODUCTION PROCESSING
{ CID1, CID2, . . . }Data level
{ TD (CID1, CID2, ... )}
DISSEMINATIONandEXCHANGE
10.02.2004 METIS 2004 Geneva 25
Key Features of the Model 6
“Production blueprint” Keeps the information how the instance is set
up inside the system according to the four different views of the view facet
“Processing blueprint” Describes the processing activities for the
instances according to the four different views of the view facet
10.02.2004 METIS 2004 Geneva 26
Key Features of the Model 7
“Function facet” All aspects of communication and usage
of meta-information by humans inside the system as well as in connection with dissemination and exchange
Who is involved in communication? What information is communicated? How is the information communicated ?
10.02.2004 METIS 2004 Geneva 27
Implications for Terminology 1
Statistical information systems use terminology from different sources Statistics, Computer Science, Economy, Social
Sciences,…
We can at best collect terminology and bring it into an order according to some modelThis ordering defines a “statistical ontology” which is above terminology and has to use few common agreed terms
10.02.2004 METIS 2004 Geneva 28
Implications for Terminology 2
Activities of METANET WG 4 showed that there is rather large agreement between statisticians about the main terms for important structures: Statistical Unit Statistical Variable Statistical Values
10.02.2004 METIS 2004 Geneva 29
Implications for Terminology 3
Based on such agreement we can assign each terminology item a specification according to the ontologyUMAS proposes the following classification What is the source of terminology (e.g. statistics, general
standards, application area,…)? For which structure applies the term? For which view is the term used? In which processing stage is the term used? Which function aspects covers the term?
10.02.2004 METIS 2004 Geneva 30
Implications for Terminology 4
MCV (SDMX) proposes the following classification Administration (close relation to administrative view) Concepts, Definitions, Standards (close relation to
conceptual view and structure) Data Collection, manipulating/accounting convention
(close relation to stage facet) Quality and performance metadata (close relation to
function facet in connection with dissemination and exchange)
10.02.2004 METIS 2004 Geneva 31
Summary
Statistical processing activities define metadata requirements usually not considered in traditional data modelling The processing activities require a model which supports flexible coupling of entities according to processing needsWe need besides terminology also specification of the terminology in context of a statistical ontology
Thank you!