linking the dames & e-stat nodes paul lambert, 26 feb 2010, bristol, e-stat review meeting dames...
TRANSCRIPT
Linking the DAMES & e-Stat Nodes
Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting
DAMES is the ‘Data Management through e-Social Science’ research Node , www.dames.org.uk
2
1. Some background on DAMES
2. First thoughts on linking DAMES and e-Stat
3. Some proposals on usability / services
3
1) Data Management though e-Social Science
DAMES – www.dames.org.uk
ESRC Node funded 2008-2011
Aim: Useful social science provisionsSpecialist data topics – occupations; education qualifications;
ethnicity; social care; health Mainstream packages and accessible resources
Aim: To exploit/engage with existing DM resources In social science – e.g. ESDS, CESSDA In e-Science – e.g. OGSA-DAI; OMII
4
To us ‘Data management’ means…
‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]
Usually performed by social scientists themselves• Pre-analysis tasks (though often revised/updated)• Inputs also from data providers
Usually a substantial component of the work process• But may not be explicitly rewarded (and sometimes penalised)
differentiate from archiving / controlling data itselfdifferentiate from archiving / controlling data itself
5
Some components…
Manipulating data Recoding categories / ‘operationalising’ variables
Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)
Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions
Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’
Cleaning data ‘missing values’; implausible responses; extreme values
6
Example – recoding data
Count
323 0 0 0 0 323
982 0 0 0 0 982
0 425 0 0 0 425
0 1597 0 0 0 1597
0 0 340 0 0 340
0 0 3434 0 0 3434
0 0 161 0 0 161
0 0 0 1811 0 1811
0 0 0 0 2518 2518
0 0 0 331 0 331
0 0 0 0 421 421
0 0 0 257 0 257
102 0 0 0 0 102
0 0 0 0 2787 2787
138 0 0 0 0 138
1545 2022 3935 2399 5726 15627
-9 Missing or wild
-7 Proxy respondent
1 Higher Degree
2 First Degree
3 Teaching QF
4 Other Higher QF
5 Nursing QF
6 GCE A Levels
7 GCE O Levels or Equiv
8 Commercial QF, No OLevels
9 CSE Grade 2-5,ScotGrade 4-5
10 Apprenticeship
11 Other QF
12 No QF
13 Still At School No QF
Highesteducationalqualification
Total
-9.001.00
Degree2.00
Diploma
3.00 Higherschool orvocational
4.00 Schoollevel orbelow
educ4
Total
7
Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk
8
Matching files (‘deterministic’)
Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching
SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta
One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta
Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)
Many-to-Many matches
Related cases matching
9
A bit of focus…
I tend to emphasise two data management activities:
1) Variable constructions o Coding and re-coding values
2) Linking datasetso Internal and external linkages
10
..plus the centrality of keeping clear records of DM activities
Reproducible (for self)Replicable (for all)Paper trail for whole
lifecycleCf. Dale 2006; Freese 2007
In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)
Syntax Examples: www.longitudinal.stir.ac.uk
Principle DAMES services (current status)
GESDE specialist data environments (prototypes)
Occupations, educational qualifications, ethnicity
Data curation tool (prototype)
Data fusion tool (prototype)
Secure data demonstrator for e-Health research (complete) Micro-simulation model for social care data (prototype) Training workshops and events (in progress)
11
GEMDE – Grid Enabled Specialist Data Environments
12
GEODE – Occupational data
Data curation tool
14
The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way
Data fusion tool
15
16
2. Linking DAMES and e-Stat
High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities
- Using/adapting DAMES contributions- DAMES services for data linking- DAMES resources for recoding variables
- Making replication central to the data story
Data and variables
DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data)
Anything on educational qualifications, occupations, ethnicity is of particular interest
Generic tools for merging micro-dataGeneric tools for other variable processes
17
Data oriented review
Applied research perspective Range of data resources Accessing and documenting data resource
options
18
The implementation for e-Stat
This is mostly a blank space… …and we’ve not hitherto used Python
Data curation tool and GEODE/GEEDE use IRODS
GEMDE uses a bespoke SQL database Data fusion tool uses R (and some Stata)
scripts accessed via a Liferay portal
20
3. A pitch for specific e-Stat facilities
..harvest the best of data analysis packages from applied data perspective
Replication in ‘human readable syntax’Something like Stata’s ‘est store’ for multiple
model comparisonsFluency in data oriented options Training resources in data
Est store demo here
21
Appendix items
22
23
Data file specification Variable manipulation & analysis
DAMES most common commands:
Commands invoking other packages
-> usedataset{UKDA_5151}
-> usedatafile{individuals wave A}
-> matchdata{individuals wave A;individuals wave B; link
variable=pid; format=wide}
-> SPSS{match files file=“aindresp.sav” /file=“bindresp.sav”
/by=pid}
-> SPSS{fre var=ajbrgsc}
-> Stata{recode ageb 16/30=1 31/50=2 *=.}
-> R{..}
-> Stata{do $path2\part1_analysis.do}
Model 1:
Graphics
Text interface
Invoked manually or in response to manipulating graphs
BHPS, wave A individuals
BHPS wave B individuals.
Analytical file
Wave C
Gender Current job RGSC
Spouse CAMSIS
Age (yrs) Age
bands
Spouse SOC
24
‘The significance of data management for social survey research’
(see http://www.esds.ac.uk/news/eventdetail.asp?id=2151)
The data manipulations described above are a major component of the social survey research workload
Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories• Dealing with missing records
Post-release manipulations performed by researchers • Re-coding measures into simple categories
We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently
So the ‘significance’ of DM is about how much better research might be if we did things more effectively…
25
Some provocative examples for the UK…
Social mobility is increasing, not decreasing!− Popularity of controversial findings associated with Blanden et al (2004)− Contradicted by wider ranging datasets and/or better measures of stratification position− DM: researchers ought to be able to more easily access wider data and better variables
Degrees, MSc’s and PhD’s are getting easier!− {or at least, more people are getting such qualifications}− Correlates with measures of education are changing over time − DM: facility in identifying qualification categories & standardising their relative value within
age/cohort/gender distributions isn’t, but should, and could, be widespread
‘Black-Caribbeans’ are not disappearing! − As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly
prominent due to return migration and social integration of immigrant descendants − Data collectors under-pressure to measure large groups only− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such
as by merging survey data sources and/or linking with suitable summary measures
26
Comment – growing interest in data management..?
Historically, references covering DM were few and far between• Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London:
Unwin Hyman Ltd. Recently, there’s been a small burst of relevant references
• Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics 17.0. Chicago, Il.: SPSS Inc. .
• Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.
• Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.
• http://www.esds.ac.uk/support/onlineguides.asp• http://www.longitudinal.stir.ac.uk/
..and growing interest re. ‘documentation for replication’ • Dale, A. (2006). Quality Issues with Survey Research. International Journal of
Social Research Methodology, 9(2), 143-158.• Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not
Sociology? Sociological Methods and Research, 36(2), 2007.
27
E-Science and Data Management
E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM…
1. Concern with standards setting in communication and enhancement of data
2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources
3) Contribution of metadata tools/standards for variable harmonisation and standardisation
4) Linking data subject to different security levels
5) The workflow nature of many DM tasks