stephenson - data curation for quantitative social science research
TRANSCRIPT
L IBBIE STEPHENSON, DATA ARCHIVIST (RETIRED)
UCLA SOCIAL SCIENCE DATA ARCHIVE
HTTPS://DATAVERSE.HARVARD.EDU/DATAVERSE/ SSDA_UCLA
Data Curation for Quantitative Social Science Research:
A Case Study
NISO Virtual Conference: Data Curation – Cultivating Past Research
Data for Future Consumption August 31, 2016
DISCLAIMER
I am retired from UCLA so my comments reflect my own experience and expertise. They do not necessarily reflect the ideas, opinions or practices of anyone at UCLA.
These materials are free for you to use, but please cite accordingly.
NISO - AUGUST 31, 2016
2
OVERVIEW
About the Archive
About the data we manage
What we are trying to do
What we actually do
Some illustrations
NISO - AUGUST 31, 2016
3
ABOUT THE ARCHIVE
Operating since 1964 -- before email, PC’s, Internet, laptops, smart phones; Manage survey/quantitative data stored on media from punch cards to cloud
Staff have library science degrees; statistical and technical expertise; quantitative social science background
Serve all UCLA quantitative researchers: Provide reference, cataloging/metadata, long term archiving; support in data rescue, management, security.
NISO - AUGUST 31, 2016
4 h
ttp
s://
dat
aver
se.h
arva
rd.e
du
/dat
aver
se/s
sda_
ucl
a
SURVEY/QUANTITATIVE
RESEARCH
Carried out in the U.S. since 1940’s -- post WW2
1960’s -70’s -- ICPSR & academic archives
1970’s -- growth of data oriented professional associations (IASSIST, APDU, IFDO, CESSDA)
Focused on society and social norms
Predict outcomes; test assumptions; study change over time; run experiments
NISO - AUGUST 31, 2016
5
Note: in any discipline we also need to understand the work flow of the research and the way individuals approach their work.
CURATION GOALS
Researcher driven philosophy of open access, data sharing, reuse
Collaborative, multi-unit or multi-institutional
Ensure data conservation and long term usability, as well as discovery and access
Processes and work flows support disaster planning
Use of best and trusted digital repository policies, models, practices, and work flows
Reflect values of accountability and integrity NISO - AUGUST 31, 2016
6
POLICIES SUPPORT PRACTICE
Foundational, essential to a strong data curation infrastructure.
Encompasses what is acquired/collected, curation levels and scope, ensures long term usability, drives processes and work flows
Social Science Data Archive policy
TOOL : Policy-making for Research Data in Repositories by Ann Green, Stuart Macdonald and Robin Rice.
NISO - AUGUST 31, 2016
7
OUR STEPS IN CURATION
Initial contact
Data Quality Review and Appraisal
Ingest Verification Metadata Physical storage
Access
Preservation
NISO - AUGUST 31, 2016
8
INITIAL CONTACT
Data Curation Profile
Data Management Plan
Guide to Social Science Data Preparation and Archiving
NISO - AUGUST 31, 2016
9
APPRAISAL
Archival Collection Policy
Also depends on:
Resources to process
Long term resources
Fitness, usefulness
Data Deposit Form signatures and completeness; commitment to share data; privacy and confidentiality
NISO - AUGUST 31, 2016
10
DATA QUALITY REVIEW
Use of statistical packages, emulator, Adobe Pro, Excel, Colectica, Text editor
Verify deposit package, check sums, freq’s, compare data to documentation
Completeness of codebook, question text, sampling, weighting, recodes, methods
Disclosure analysis, check for personal identifiers and assess privacy/confidentiality of respondents
Documentation converted to PDF/A
11
NISO - AUGUST 31, 2016
EXAMPLE: WHAT KIND OF DATA?
NISO - AUGUST 31, 2016
12
CODEBOOK DOCUMENTS THE
COLUMNS
NISO - AUGUST 31, 2016
13
5002 01 01 302000 001 101 10004B121068965
Each item is called a variable. We refer to the numeric content of each item as a value.
COMPARE FREQS TO CODEBOOK
NISO - AUGUST 31, 2016
14
VALUES VALUE LABELS
VARIABLE
RUN MARGINALS/FREQUENCIES
NISO - AUGUST 31, 2016
15
Sex of Respondent Frequency Percent Valid Percent Cumulative Percent Valid MALE 856 45.1 45.1 45.1 FEMALE 1041 54.9 54.9 100.0 Total 1897 100.0 100.0 What is your race - ethnicity Frequency Percent Valid Percent Cumulative Percent Valid White 618 32.6 32.6 32.6 Hispanic 475 25.0 25.0 57.6 Black 474 25.0 25.0 82.6 Asian or Pacific Islander 282 14.9 14.9 97.5 Native American or Alaskan native 17 .9 .9 98.4 Identifies more than one of the above groups 20 1.1 1.1 99.4 DON'T KNOW 2 .1 .1 99.5 REFUSED 9 .5 .5 100.0 Total 1897 100.0 100.0
INGEST – PHYSICAL FORMATS
Virus check, run check sums, address versioning, fixity, file naming conventions
Convert files to archival formats if required
Back copies to external media
Copy datasets to Dataverse; Safe Archive tool
Use of secure file transfer client
SQL/PHP scripts for local holdings file
Compression software (7-zip)
NISO - AUGUST 31, 2016
16
Address disaster plan and file access (public and local); Security requirements; LOCKSS
INGEST– BIBLIOGRAPHIC METADATA
Bibliographic metadata enables search and discovery:
Establish bibliographic-level identity for unique items
Bibliographic record to WorldCat/Voyager
Add record to holdings database (SQL)
Create Dataverse record; Assign persistent identifier
NISO - AUGUST 31, 2016
17
Produce and review with investigator
WHAT ELSE DO WE NEED TO
KNOW ABOUT THE DATA?
Description of the study
Citation
Funding source
Methodology
Sampling
Publications
NISO - AUGUST 31, 2016
18
EXAMPLE - DATAVERSE
NISO - AUGUST 31, 2016
19
Links to tools to manage collections
Navigate to and search for studies
Studies can be downloaded or analyzed online
VARIABLE LEVEL SEARCH
CAPABILITIES
Enables searching across many studies at once.
Enables searching shared catalogs of multiple archives
TOOLS: Colectica Repository and NESSTAR
Requires local or remote hosting of software.
Can share the metadata files for repurposing.
NISO - AUGUST 31, 2016
20
DATA DOCUMENTATION
INITIATIVE
Document, Discover, and Interoperate
“International standard for describing data that result from observational methods in the social, behavioral, economic, and health sciences”
“Facilitates interpretation and understanding -- both by humans and computers”
NISO - AUGUST 31, 2016
21 h
ttp
://w
ww
.dd
ialli
ance
.org
/
INGEST-VARIABLE LEVEL METADATA
Descriptive metadata of detailed information about the data enables understandability and reuse:
Create variable-level metadata, using Colectica or NESSTAR to produce standardized metadata records
Create DDI record; full DDI codebook
Migrate DDI to Colectica Repository
NISO - AUGUST 31, 2016
22
Produce and review with investigator
NESSTAR
EXAMPLE - IMPORTING DATA
Use the Data tab to import files from SPSS or STATA formats.
NISO - AUGUST 31, 2016
23
Label
Question
text
Numeric
values
Variable Details include variable name, label, description or question text, and types of coding.
NISO - AUGUST 31, 2016
24
EXAMPLE DDI FROM COLECTICA
NISO - AUGUST 31, 2016
25
DDI fields are in red; used to create documentation; can be repurposed
PRESERVATION AND CURATION
Continuous monitoring of file formats; migrate to new formats when: New operating system; New version of statistical software New mode of file transfer; Code change
Monitoring of database function; software updates or redesigns
Monitoring of servers, external media health; replace as needed
Data forensics; check sums; validation; authentication; version control; format migration; refresh media; record preservation metadata -- DDI
Review disaster plan and collection policy at regular intervals
Review new or revised regulations for intellectual property; security; data producers/distributors; funding agencies
Review with original depositor, their data management plans, changes in access or user permissions
26 Focus is on functional-level preservation and long term usability through use of DDI and continuous review.
UNCOMFORTABLE TRUTHS
Data management in institutions requires high level administrative participation; new, sustained funding; and differently trained staff
Data management planning is not a static event but a continuous process to ensure long term independently understandable informed reuse of research
There is an urgent need for standards, tools, and best practice models for many different file formats and disciplines
NISO - AUGUST 31, 2016
27
NEXT STEPS FOR PRACTITIONERS
“Crucial metadata about data are not always being captured or created and linked to data in repositories. Storage and persistence of data submissions isn't enough. We need data archivists and librarians to commit to partnering with researchers to curate data -- to review incoming data for usability, confidentiality, and completeness of descriptive information.”
NISO - AUGUST 31, 2016
28
Ann Green (2016) Email communication Used with permission
ANY QUESTIONS?
THANK YOU!
Social Science Data Archive, UCLA
Box 951484 Los Angeles, CA 90095-1484 310-825-0716
NISO - AUGUST 31, 2016
29
LINKS
Social Science Data Archive dataverse.harvard.edu/dataverse/ssda_ucla
Data Seal of Approval www.datasealofapproval.org/en/
National Digital Stewardship Alliance ndsa.org/activities/levels-of-digital-preservation/
Open Archival Information System www.oclc.org/research/publications/library/2000/lavoie-oais.html
Social Science Data Archive Policy data-archive.library.ucla.edu/SSDA_collectionAndArchivingPolicy.pdf?_ga= 1.3255478.786669706.1378228281
Data Curation Profile datacurationprofiles.org/
Data Management Planning at ICPSR www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/index.html
ICPSR Guide to Data Preparation www.icpsr.umich.edu/icpsrweb/content/deposit/guide/
Colectica www.colectica.com/
NESSTAR www.nesstar.com/index.html
DDI www.ddialliance.org/
Dataverse dataverse.org/
NISO - AUGUST 31, 2016