training workshop, 24-25 november 2010, univ. stirling organised by the esrc node ‘data management...
Post on 18-Dec-2015
215 views
TRANSCRIPT
Training Workshop, 24-25 November 2010, Univ. Stirling
Organised by the ESRC Node ‘Data Management through e-Social Science’ (www.dames.org.uk).
Data Management, Documentation and Workflows
for Social Survey Research
2
3
‘Data Management though e-Social Science’
DAMES – www.dames.org.uk
ESRC Node funded 2008-2011
Aim: Useful social science provisionsSpecialist data topics – occupations; education
qualifications; ethnicity; social care; health
Programme of case studies and provisions – more later
1. Data Management, Documentation and Workflows for
Social Survey Research
Paul Lambert, 24-25 November 2010
Presented to ‘Documentation and Workflows for Social Survey Research’, a workshop organised by the ESRC ‘Data Management
through e-Social Science’ research Node
(www.dames.org.uk).
5
Data management, documentation and workflows..
Defining data management, documentation and workflows in survey research
Documentation for replicationDocumentation for replication
Further comments and principles in effective social Further comments and principles in effective social survey researchsurvey research
6
a) ‘Data management’ means… ‘the tasks associated with linking related data resources, with
coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]
Usually performed by social scientists themselvesMost overt in quantitative survey data analysis
• ‘variable constructions’, ‘data manipulations’• navigating abundance of data – thousands of variables
Usually a substantial component of the work process
Here we differentiate from archiving / controlling data itselfHere we differentiate from archiving / controlling data itself
7
Some components…
Manipulating data Recoding categories / ‘operationalising’ variables
Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)
Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions
Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’
Cleaning data ‘missing values’; implausible responses; extreme values
8
Example – recoding data
Count
323 0 0 0 0 323
982 0 0 0 0 982
0 425 0 0 0 425
0 1597 0 0 0 1597
0 0 340 0 0 340
0 0 3434 0 0 3434
0 0 161 0 0 161
0 0 0 1811 0 1811
0 0 0 0 2518 2518
0 0 0 331 0 331
0 0 0 0 421 421
0 0 0 257 0 257
102 0 0 0 0 102
0 0 0 0 2787 2787
138 0 0 0 0 138
1545 2022 3935 2399 5726 15627
-9 Missing or wild
-7 Proxy respondent
1 Higher Degree
2 First Degree
3 Teaching QF
4 Other Higher QF
5 Nursing QF
6 GCE A Levels
7 GCE O Levels or Equiv
8 Commercial QF, No OLevels
9 CSE Grade 2-5,ScotGrade 4-5
10 Apprenticeship
11 Other QF
12 No QF
13 Still At School No QF
Highesteducationalqualification
Total
-9.001.00
Degree2.00
Diploma
3.00 Higherschool orvocational
4.00 Schoollevel orbelow
educ4
Total
9
Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk
10
‘The significance of data management for social survey research’
The data manipulations described above are a major component of the social survey research workload
Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories; Dealing with missing records
Post-release manipulations performed by researchers • Re-coding measures into simple categories• All serious researchers perform extended post-release management (and have the scars to show for it)
We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently
So the ‘significance’ of DM is about how much better research might be if we did things more effectively…
11
Some provocative examples for the UK…
Social mobility is increasing, not decreasing− Popularity of controversial findings associated with Blanden et al (2004)− Contradicted by wider ranging datasets and/or better measures of stratification position− DM: researchers ought to be able to more easily access wider data and better variables
Degrees, MSc’s and PhD’s are getting easier− {or at least, more people are getting such qualifications}− Correlates with measures of education are changing over time − DM: facility in identifying qualification categories & standardising their relative value within
age/cohort/gender distributions isn’t, but should, and could, be widespread
‘Black-Caribbeans’ are not disappearing − As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly
prominent due to return migration and social integration of immigrant descendants − Data collectors under-pressure to measure large groups only− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such
as by merging survey data sources and/or linking with suitable summary measures
b) ‘Documentation’ refers to
Here we mean the ‘paper trail’ in the conduct of secondary survey research
For scientists, this is the log book / journal / laboratory notebook
12
Image of Alexander Graham Bell’s 1876 notebook, taken from: http://sandacom.wordpress.com/2010/03/11/the-face-rings-a-bell/
Thought’s on documentation
Our tasks (organising and analysing electronic data) don’t seem to lend themselves to easy documentation Paper or electronic file summaries Different formats of data Rapid updates over time
Effective documentation is possible, but it requires some effort (Long, 2009)
13
14
..good levels of documentation are not engrained in the social sciences!
“…Little or nothing is systematically archived from these electronic sources. How many of us routinely keep copies of our old word-processing files once they are no longer of current relevance for research or teaching activities. We have been reminded…of the insecurity and non-survival of departmental and professional files stored in broom cupboards, but how many electronic files even get into that cupboard in the first place?” Scott (2005: 142)
c) ‘Workflows’
In general, a collection of processes which all link with or contribute to a wider project
The study of workflows involves the systematic organisation of those processes
Storing the elements of the processesNoting inter-dependencies Depicting the overall process (e.g. graphically)Modelling the overall process
15{Not in the dictionary… a made-up word to mean what we like!}
16
The idea of workflows
Workflow modelling has an exciting future.. Workflow documentation
o MyExperiment [http://www.myexperiment.org/]o Social survey analysis [Dale, 2006; Freese, 2007;
Long, 2009]
At present…Tool development in processDepositing workflows might impose constraints/burdens
17
Data file specification Variable manipulation & analysis
DAMES most common commands:
Commands invoking other packages
-> usedataset{UKDA_5151}
-> usedatafile{individuals wave A}
-> matchdata{individuals wave A;individuals wave B; link
variable=pid; format=wide}
-> SPSS{match files file=“aindresp.sav” /file=“bindresp.sav”
/by=pid}
-> SPSS{fre var=ajbrgsc}
-> Stata{recode ageb 16/30=1 31/50=2 *=.}
-> R{..}
-> Stata{do $path2\part1_analysis.do}
Model 1:
Graphic
Text interface
Invoked manually or in response to manipulating graphs
BHPS, wave A individuals
BHPS wave B individuals.
Analytical file
Wave C
Gender Current job RGSC
Spouse CAMSIS
Age (yrs) Age
bands
Spouse SOC
Syntax file image
18
Example of using MS Excel for workflow documentation
19
20
A bit of focus…
Most of the DAMES applications aim to facilitate one of two data management activities and their documentation:
1) Variable constructions o Coding and re-coding values
2) Linking datasetso Internal and external linkages
21
A bit more focus…
The current workshop is concerned with research practices and facilities for social survey data management
To raise for discussion important topics associated with data management
To illustrate effective means of achieving good practice during data management
o Software perspectives – e.g. Treiman 2009; Long 2009; Levesque 2010; Sarantakos 2007
o A focus on ‘Stata’
22
Why did Stata suddenly come into this?
Data management requirements
Specific tasks Generic approaches
Bespoke database software
Governance models
E-Social Science
Researchers’ database software
(SPSS, Stata, etc)
We see Stata emerging as effective for specific tasks
and compatible with generic approaches
23
Data management, documentation and workflows..
Defining data management, documentation and Defining data management, documentation and workflows in survey researchworkflows in survey research
Documentation for replicationDocumentation for replication
Further comments and principles in effective social Further comments and principles in effective social survey researchsurvey research
24
‘Documentation for replication’
..as a reasonable expectation for scientific research that is cumulative and based upon empirical observation…
Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic.Dale, A. (2006). Quality Issues with Survey Research. International Journal of
Social Research Methodology, 9(2), 143-158.Freese, J. (2007). Replication Standards for Quantitative Social Science: Why
Not Sociology? Sociological Methods and Research, 36(2), 153-171.
…See our first lab session on using software effectively for documentation for replication…
25
What needs replication? Your own analysis (in response to comments,
revisions, requests for access) Others’ analysis
To build upon – cumulative science To critique / cross-examine
In secondary survey research Complex data is often updated (new related records; revised
and re-released; re-weighted or re-standardardised; new levels of access/linkage)
New analysis feasible - variable operationalisations; new statistical methods
26
J. Scott Long (2009)
Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.
1-5: Programming in Stata6: Cleaning your data7: Analysing data and presenting results8: Protecting your work
27
Treiman (2009)
Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.
Good professional practice = Suitable choice of analytical methods to test
ideas Documentation of choices and data operations
28
How to approach documentation for replication in social survey research?
Made easy by secondary access to datasets and standardised software
1) Using software effectively• See our ‘software session 1’
2) Careful syntactical documentation
3) Workflow perspectives / tools
4) Metadata standards
29
Keep clear records of your DM activities!
Reproducible (for self)Replicable (for all)Paper trail for whole
lifecycleCf. Dale 2006; Freese 2007
In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)
Syntax Examples: www.longitudinal.stir.ac.uk
30
Stata syntax example (‘do file’)
31
Syntax documentation
Long (2009) is highly prescriptive {may not be wholly attainable}
Key issues: 1. Textual level command specification2. Organisation of syntax files
**Master files and subfiles (and macros)**
3. Setting consistent paths to source data4. Reasonable level of manual annotation of files 5. Use a text editor!!
32
33
34
The idea of workflows
Workflow modelling is exciting future.. Workflow documentation
o MyExperiment [http://www.myexperiment.org/]o Social survey analysis [Dale, 2006; Freese, 2007;
Long, 2009]
At present…Waiting for tool developmentDepositing workflows might impose constraints/burdens
35
Metadata documents for documentation for replication
Metadata documents can/should be stored / distributed / disseminated
Main relevant types of metadata documents:
a) Annotated syntax files
b) Handwritten workbooks
c) Codebooks and data file metadata
36
Annotated syntax files Storage:
Supply authorship details, conditions of access, origins and context of data, software version
‘Robustify’ your programme (generic locations; ‘capture drop’) Dissemination:
Available from authors archive Repec – http://ideas.repec.org/ (Economics) GEODE/DAMES – www.dames.org.uk (Occupations, Education) UKDA/ESDS and related data providers (monitored) Personal webpages – e.g.
www.camsis.stir.ac.uk/downloads/data/other/casoc_isco.do
37
38
Handwritten workbooks
Key here is that they must be published..• Technical papers• Websites• ….• An emerging payoff - citation indexing!
o Croxford, L. (2004). Construction of Social Class Variables. Edinburgh: Working Paper 4 of the ESRC research project on Education and Youth Transitions in England, Wales and Scotland, 1984-2002, Centre for Educational Sociology,
University of Edinburgh, and http://www.ces.ed.ac.uk/eyt/EYT_papers/WP04.pdf.
39
“Because claims in published papers that additional materails are “available from author” usually prove false, at least after a few months, the California Center for Population Research at UCLA recently implemented a mechanism by which additional materials, for example, -do- and –log- files, can be attached to papers posted in its Population Working Paper archive. Other research centers are to be encouraged to do the same” (Treiman, 2009: 404)
40
E-Science and workflow documentation tools..
…seek to capture the full record of the work process and all files relevant for documentation (e.g. http://www.myexperiment.org/)
41
Codebooks and data file metadata
Codebook log using data_file_name_codebook.log, replace textdisp "DateTime: $S_DATE $S_TIME"notesdatasignaturecodebook, compresscodebookdescribelabelbook, detaillog close
See UKDA: data_dictionary.rtf
42
Metadata standards
Formal standards for recording data existmost widely used is the ‘DDI’, Data Documentation
Initiative, http://www.icpsr.umich.edu/DDI/)Xml format typewritten or software derived, can be
read by software / browsers Includes options for variable labels, recodes, text
descriptions
See UKDA, study_information.htm NESSTAR
43
44
45
46
NESSTAR
47
Summary: Documentation and workflows
Achieving good documentation is facilitated by effective workflows
o File locations / stamps / transferability o Variable metadata o Structured logs of all operations – syntax programs
48
Data management, documentation and workflows..
Defining data management, documentation and Defining data management, documentation and workflows in survey researchworkflows in survey research
Documentation for replicationDocumentation for replication
Further comments and principles in effective social survey research
49
Data management components of the survey research process
4 good habits and principles
3 Challenges
50
(a) Good habit: Keep clear records of your DM activities
Reproducible (for self)Replicable (for all)Paper trail for whole
lifecycleCf. Dale 2006; Freese 2007
In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)
Syntax Examples: www.longitudinal.stir.ac.uk
51
Stata syntax example (‘do file’)
52
Software and handling variables – our view
Stata is the superior package for secondary survey data analysis:
o Advanced data management and data analysis functionalityo Supports easy evaluation of alternative measures (e.g. est
store)o Culture of transparency of programming/data manipulationo Cf. Scott Long (2009)o But: Not available to all users
53
(b) Principle: Use existing standards and previous research
Variable operationalisationsUse recognised recodes / standard classifications
• NSI harmonisation standards (e.g. ONS)• Cross-national standards [Hoffmeyer-Zlotnick & Wolf 2003;
Harkness et al. 2005; Jowell et al. 2007] • Research reviews [e.g. Shaw et al. 2007]• Common v’s best practices (e.g. dichotomisations)
Use reproducible recodes / classifications (paper trail)
Other data file manipulations• Missing data treatments• Matching data files (finding the right data)
54
(c) Principle: Do something, not nothing
We currently put much more effort into data collection and data analysis, and neglect data manipulation
Survey research – the influence of ‘what was on the archive version’
…In my experience, a common reason why people didn’t do more DM was because they were frightened to…
55
(d) Principle: Learn how to match files (‘deterministic’)
Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching
SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta
One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .Stata: merge pid using file2.dta
Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid)
Many-to-Many matches
Related cases matching
56
Some challenges for data management..
(e) Agreeing about variable constructions
Unresolved debates about optimal measures and variables
Esp. in comparative research such as across time, between countries
In DAMES, we have particular interests in comparability for: Longitudinal comparability
(http://www.longitudinal.stir.ac.uk/variables/) Scaling / scoring categories to achieve ‘meaning equivalence’
or ‘specific measures’
57
Some challenges for data management..
(f) Worrying about data security
DM activities could challenge data security Inspecting individual cases Multiple copies of related data files Ability to link with other datasets ‘Hands-on’ model of data review
New and exciting data resources • have more individual information• are more likely to be released with stringent conditions• may jeopardize traditional DM approaches
58
Some routes to secure data
Secure ‘portals’ for direct access to remote data
Secure settings (e.g. safe labs)Data annonymisation and attenuation Emphasis on users’ responsibility rather than
the data provider
59
Some challenges for data management..
(g) Incentivising documentation / replicability
There is little to press researchers to better document DM, but much to press them not to
• Make DM and its documentation easier?• Reward documentation (e.g. citations)?
60
Data management, documentation and workflows..
Defining data management, documentation and Defining data management, documentation and workflows in survey researchworkflows in survey research
Documentation for replicationDocumentation for replication
Further comments and principles in effective social Further comments and principles in effective social survey researchsurvey research
61
References
Blanden, J., Goodman, A., Gregg, P., & Machin, S. (2004). Changes in generational mobility in Britain. In M. Corak (Ed.), Generational Income Mobility in North America and Europe. Cambridge: Cambridge University Press.
Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158.
Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.
Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley.
Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers.
Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage.
Levesque, R., & SPSS Inc. (2010). Programming and Data Management for SPSS 18.0: A Guide for PASW Statistics and SAS users. Chicago, Il.: SPSS Inc.
Long, J. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan. Scott, J. (2005). Some principal concerns in the shaping of Sociology. In A. H. Halsey & W. G. Runciman
(Eds.), British Sociology: See from without and within (pp. 136-144). London: The British Academy. Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook of
Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press. Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey
Bass. University of Essex, & Institute for Social and Economic Research. (2009). British Household Panel Survey:
Waves 1-17, 1991-2008 [computer file], 5th Edition. Colchester, Essex: UK Data Archive [distributor], March 2009, SN 5151.