linking the dames & e-stat nodes paul lambert, 26 feb 2010, bristol, e-stat review meeting dames...

Linking the DAMES & e-Stat Nodes

Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting

DAMES is the ‘Data Management through e-Social Science’ research Node , www.dames.org.uk

2

1. Some background on DAMES

2. First thoughts on linking DAMES and e-Stat

3. Some proposals on usability / services

3

1) Data Management though e-Social Science

DAMES – www.dames.org.uk

ESRC Node funded 2008-2011

Aim: Useful social science provisionsSpecialist data topics – occupations; education qualifications;

ethnicity; social care; health Mainstream packages and accessible resources

Aim: To exploit/engage with existing DM resources In social science – e.g. ESDS, CESSDA In e-Science – e.g. OGSA-DAI; OMII

4

To us ‘Data management’ means…

‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]

Usually performed by social scientists themselves• Pre-analysis tasks (though often revised/updated)• Inputs also from data providers

Usually a substantial component of the work process• But may not be explicitly rewarded (and sometimes penalised)

differentiate from archiving / controlling data itselfdifferentiate from archiving / controlling data itself

5

Some components…

Manipulating data Recoding categories / ‘operationalising’ variables

Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)

Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions

Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’

Cleaning data ‘missing values’; implausible responses; extreme values

6

Example – recoding data

Count

323 0 0 0 0 323

982 0 0 0 0 982

0 425 0 0 0 425

0 1597 0 0 0 1597

0 0 340 0 0 340

0 0 3434 0 0 3434

0 0 161 0 0 161

0 0 0 1811 0 1811

0 0 0 0 2518 2518

0 0 0 331 0 331

0 0 0 0 421 421

0 0 0 257 0 257

102 0 0 0 0 102

0 0 0 0 2787 2787

138 0 0 0 0 138

1545 2022 3935 2399 5726 15627

-9 Missing or wild

-7 Proxy respondent

1 Higher Degree

2 First Degree

3 Teaching QF

4 Other Higher QF

5 Nursing QF

6 GCE A Levels

7 GCE O Levels or Equiv

8 Commercial QF, No OLevels

9 CSE Grade 2-5,ScotGrade 4-5

10 Apprenticeship

11 Other QF

12 No QF

13 Still At School No QF

Highesteducationalqualification

Total

-9.001.00

Degree2.00

Diploma

3.00 Higherschool orvocational

4.00 Schoollevel orbelow

educ4

Total

7

Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

8

Matching files (‘deterministic’)

Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching

SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta

One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta

Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)

Many-to-Many matches

Related cases matching

9

A bit of focus…

I tend to emphasise two data management activities:

1) Variable constructions o Coding and re-coding values

2) Linking datasetso Internal and external linkages

10

..plus the centrality of keeping clear records of DM activities

Reproducible (for self)Replicable (for all)Paper trail for whole

lifecycleCf. Dale 2006; Freese 2007

In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

Syntax Examples: www.longitudinal.stir.ac.uk

Principle DAMES services (current status)

GESDE specialist data environments (prototypes)

Occupations, educational qualifications, ethnicity

Data curation tool (prototype)

Data fusion tool (prototype)

Secure data demonstrator for e-Health research (complete) Micro-simulation model for social care data (prototype) Training workshops and events (in progress)

11

GEMDE – Grid Enabled Specialist Data Environments

12

GEODE – Occupational data

Data curation tool

14

The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

Data fusion tool

15

16

2. Linking DAMES and e-Stat

High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities

- Using/adapting DAMES contributions- DAMES services for data linking- DAMES resources for recoding variables

- Making replication central to the data story

Data and variables

DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data)

Anything on educational qualifications, occupations, ethnicity is of particular interest

Generic tools for merging micro-dataGeneric tools for other variable processes

17

Data oriented review

Applied research perspective Range of data resources Accessing and documenting data resource

options

18

The implementation for e-Stat

This is mostly a blank space… …and we’ve not hitherto used Python

Data curation tool and GEODE/GEEDE use IRODS

GEMDE uses a bespoke SQL database Data fusion tool uses R (and some Stata)

scripts accessed via a Liferay portal

20

3. A pitch for specific e-Stat facilities

..harvest the best of data analysis packages from applied data perspective

Replication in ‘human readable syntax’Something like Stata’s ‘est store’ for multiple

model comparisonsFluency in data oriented options Training resources in data

Est store demo here

21

Appendix items

22

23

Data file specification Variable manipulation & analysis

DAMES most common commands:

Commands invoking other packages

-> usedataset{UKDA_5151}

-> usedatafile{individuals wave A}

-> matchdata{individuals wave A;individuals wave B; link

variable=pid; format=wide}

-> SPSS{match files file=“aindresp.sav” /file=“bindresp.sav”

/by=pid}

-> SPSS{fre var=ajbrgsc}

-> Stata{recode ageb 16/30=1 31/50=2 *=.}

-> R{..}

-> Stata{do $path2\part1_analysis.do}

Model 1:

Graphics

Text interface

Invoked manually or in response to manipulating graphs

BHPS, wave A individuals

BHPS wave B individuals.

Analytical file

Wave C

Gender Current job RGSC

Spouse CAMSIS

Age (yrs) Age

bands

Spouse SOC

24

‘The significance of data management for social survey research’

(see http://www.esds.ac.uk/news/eventdetail.asp?id=2151)

The data manipulations described above are a major component of the social survey research workload

Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories• Dealing with missing records

Post-release manipulations performed by researchers • Re-coding measures into simple categories

We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently

So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

25

Some provocative examples for the UK…

Social mobility is increasing, not decreasing!− Popularity of controversial findings associated with Blanden et al (2004)− Contradicted by wider ranging datasets and/or better measures of stratification position− DM: researchers ought to be able to more easily access wider data and better variables

Degrees, MSc’s and PhD’s are getting easier!− {or at least, more people are getting such qualifications}− Correlates with measures of education are changing over time − DM: facility in identifying qualification categories & standardising their relative value within

age/cohort/gender distributions isn’t, but should, and could, be widespread

‘Black-Caribbeans’ are not disappearing! − As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly

prominent due to return migration and social integration of immigrant descendants − Data collectors under-pressure to measure large groups only− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such

as by merging survey data sources and/or linking with suitable summary measures

26

Comment – growing interest in data management..?

Historically, references covering DM were few and far between• Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London:

Unwin Hyman Ltd. Recently, there’s been a small burst of relevant references

• Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics 17.0. Chicago, Il.: SPSS Inc. .

• Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.

• Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.

• http://www.esds.ac.uk/support/onlineguides.asp• http://www.longitudinal.stir.ac.uk/

..and growing interest re. ‘documentation for replication’ • Dale, A. (2006). Quality Issues with Survey Research. International Journal of

Social Research Methodology, 9(2), 143-158.• Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not

Sociology? Sociological Methods and Research, 36(2), 2007.

27

E-Science and Data Management

E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM…

1. Concern with standards setting in communication and enhancement of data

2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources

3) Contribution of metadata tools/standards for variable harmonisation and standardisation

4) Linking data subject to different security levels

5) The workflow nature of many DM tasks

linking the dames & e-stat nodes paul lambert, 26 feb 2010, bristol, e-stat review meeting dames...

Documents

aggregate data

enhancing data

data providers

data c7

data story slide

related data resources

organisation of data

data management activities