icpsr data exploration tools
DESCRIPTION
Part I of a workshop conducted by ICPSR. This deck describes data exploration tools.TRANSCRIPT
ICPSR AT 50:Facilitating Research
and Data Sharing
Part I: Data ExplorationIASSIST Vancouver, BCMay 31, 2011
Welcome to Vancouver!Our Agenda
• Data Exploration– A Continuing Quest to Ease your Search– Social Science Variables Database– Bibliography of Data-related Literature
• Data Sharing– 2010 US Census Data– Public Data Collections
• Data Management– Data Management Plans– Computing & Data Sharing in Secure Environments– Managing Restricted Contracts
Managing the Clock
• Intro and Data Exploration (9:30-10:30)– Break
• Data Sharing (10:45-11:30)– Break
• Data Management (11:45–12:30)– Escape!
Disclaimer: Times are approximate!
• One of the world’s oldest and largest social science data archives, est. 1962
• Data distributed on punch cards, then reel-to-reel tape, now: – Data available on demand– Over 7,000 studies with over 65,000 data sets
• Membership organization among 21 universities, now:– Currently about 700 members world-wide– Federal funding of public collections
What is ICPSR? - Then and Now -
What We Do – It’s About Data!
• Seek research data and pertinent documents from researchers (PIs, research agencies, government)
• Process and preserve the data and documents
• Disseminate data
• Provide education, training, & instructional resources
Why People Use ICPSR
• Write articles, papers, or theses using real research data
• Conduct secondary research to support findings of current research or to generate new findings
• Use as intro material in grant proposals• Preserve/disseminate primary research
data– Fulfill data management plan (grant)
requirements• Study or teach quantitative methods
Data Exploration
The Challenge – Hoards of Data & Metadata
How does one make sense of:
• 7,000 studies• 65,000 datasets• 550,000 files• Millions of variables• 60,000 bibliographic citations
Data Exploration- Integrated Search -
Better Search for Better Results
Search Results
Docs, subjects, PIs, etc
SSVD the
variables
Data-related biblio
Integrating ICPSR’s Search“Sponsored by SOLR/Lucene”
• In 2009, an improved search engine• Later, construction of full-text search • Faceted search to narrow large result sets
Search Terms: teen drug use
Reviewing the Study Home Page
The Search Continues: Automatic Search Updates
• Receive automatic updates on the study or series
• And updates on your query
Data ExplorationThe Social Science Variables
Database
Search Results
Docs, subjects, PIs, etc
Data-related biblioSSVD
the variables
The Social Science Variables Database (SSVD)
Sanda Ionescu,Documentation Specialist
The Social Science Variables Database at ICPSR• Enables ICPSR users to search variables
across datasets• Assists in:
– Data discovery – Comparison / harmonization projects – Data harvesting – Data analysis– Question mining for designing new
research
The Social Science Variables Database at ICPSR
Tool for teaching– Research Methods:–Concept operationalization– Effect of question wording, context, and
answer categories on variable distributions– Substantive classes:–Cultural / social changes reflected in
different question wordings, or elicited answers (longitudinal or time series data)
The Social Science Variables Database at ICPSR• Officially launched Spring 2009.• Pre-launch: two to three years’
preparation period– Gather variable-level documentation;
apply/refine selection criteria, quality checks
– Build database to host variable descriptions
– Initial upload: 3,500 files describing data from about 1,300 studies.
The Social Science Variables Database at ICPSR• Variables documented using the Data
Documentation Initiative (DDI) specification• DDI: a standard for documenting social
science data, written in XML– Easy to parse / process– Allows fine-grained searches– Flexible display in a variety of formats – Highly shareable, promotes interoperability– Ideal archival format (ASCII, not software
dependent)
The Social Science Variables Database at ICPSR
DDI variable descriptions • Generated through an automated
process used archive-wide to produce ICPSR’S archival and distribution information packages
• Include question text if available in the source documentation
The Social Science Variables Database at ICPSR
Relational database• Built in Oracle as a separate entity, with
links to studies’ and series’ descriptions (also stored in Oracle)
• Compatible with both DDI 2 and 3 (input and output)
• Oracle Text searches used in Beta-testing phase– Slow retrieval– Limited to 500 results
The Social Science Variables Database at ICPSR• Search: autumn 2009 switched to Solr/Lucene:
• Easy indexing• Faster searches, unlimited hits• Facets/Filters imported from Study Descriptions (also
DDI compatible)– Series– Study– Time Period– Geography
• Storage: XML files are being indexed and searched directly – no longer uploaded in the database
The Social Science Variables Database at ICPSR
• Current content:– 2,602 studies (48 percent of ICPSR
holdings with data and setups)– 6,493 datasets– Approx. 1.7 million variables
• Continues to grow by including– All new releases, if suitable– Retrofits as made available by small-
scale projects
The Social Science Variables Database at ICPSR
• DDI fields searched:– Variable name– Variable label – Question text sequence – Descriptive text – Category label
• Variable notes – not indexed / searched, but they are displayed
The Social Science Variables Database at ICPSR
The Public Search Features:• Stemming• “Phrase searches”• Fielded searches (treated as a default
Boolean “and”: Boolean operators “or,” and “not” are ignored)– Variable label– Question text– Value labels
http://www.icpsr.umich.edu/icpsrweb/ICPSR/
The Social Science Variables Database at ICPSRProjected improvements/additional features:• Enable selection of multiple filters• Enable users to toggle on/off stemming• Enable searching “within” results (adding new
query to a result set)• Show / hide response categories on result page• Create interface for selecting results and
exporting selection in a particular format• From individual variable display, enable
navigation to previous or next variable (to show context)
The Social Science Variables Database at ICPSR
Usage data (source: Google Analytics)
Data ExplorationThe Bibliography of Data-related
Literature
Search Results
Docs, subjects, PIs, etc
SSVD the
variables
Data-related biblio
ICPSR Bibliography of Data-related Literature
Elizabeth MossAssistant Librarian, ICPSR
ICPSR Bibliography of Data-related Literature
What we will cover:
• What it is and how to access it
• How and why we developed it
• Main features
• How instructors find it useful
• You are a good source
What it is and how to access it
What it is and how to access it
It’s really a searchable database . . . containing 60,000 citations of known
published and unpublished works resulting from analyses of data archived at ICPSR
. . .that can generate study bibliographies associating each study with the literature
about it
. . . Now included in the integrated search on the ICPSR Web site
• Brainchild of Richard Rockwell, former ICPSR director
• Funded by a grant from the National Science Foundation in 2000 to build the collection and create a way to access it
• ICPSR membership and federally-funded archives continue to support it
How and why we developed it
• Resources using data in the ICPSR holdings as the primary data source
• Resources using ICPSR data in a comparison with the primary dataset investigated
• Resources "about" an ICPSR dataset or study series.
How and why we developed it
What’s in the collection?
How and why we developed it
http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/methodology.jsp
How and why we developed it
How and why we developed it
How and why we developed it
Demonstrate impact of data for funding
Main features
http://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/index.jsp
Main features
Search features:• Searches the full text of the elements of
citations, e.g., title, author, journal• Boolean “and” is assumed, and phrase
searching in quotation marks:adolescents and “mental health” — this works
• No Boolean “or” “not”:Havens or “Havens, Jennifer” — this doesn’t work (becomes “and”)
Main features
Linking from the search results:
• To full text for journals Directly via DOI Using OpenURL via Google Scholar and
WorldCat
• To full text of reports and other resources via PDF or HTML links
• To the detailed, fielded publication record
Main features
Internal and external linking from the detailed citation record:• To the related study(s)
• To other citation records of publications by the same author
• To other articles in the same journal (but outside the search)
• To full text options
Main features
Exporting citations:
• From search results: Up to 500 records in
RIS format, exports directly to EndNote
• From individual detailed record: Export the citation in RIS format
Main features
Filtering and sorting features:
• Filter search results by author, pub type, journal, pub. year
• Coming soon—pub year range filter (similar to that in study search)
• Sort search results by relevance, pub date (oldest or newest), title, recency
Browse from main Bibliography page:
• By author name (no authority control)Juster, F. (2)Juster, F. Thomas (22)Juster, F.T. (1)
• By journal title name (authority control)
Main features
Main features
Link from individual study pages:• to the dynamically-generated study
bibliography• to series collections, when applicable
Link from series description pages:• to series bibliographies from the series
page
How instructors find it useful
Senior seminar classes
• Profs choose dataset and ask students to think of a research question
• Bibliography allows students to see the wide variety of topics available for a single dataset
How instructors find it useful
Research proposal design
• Good for finding studies that examine what a student wants to propose
• Does the data they would want already exist?
• If so, are there survey questions they could replicate?
• Authors’ suggestions for future research
How instructors find it useful
Undergraduate introduction
• Research papers—Good starting point for finding literature on a particular topic
• Finding data—Starting with the Bibliography can be more intuitive
How instructors find it useful
From the ICPSR blog:
“I can't say enough about how much I like the Bibliography of Data-related Literature. I find that students prefer to use this to identify key writings about data obtained from ICPSR. Students are sometimes really overwhelmed by trying to do literature searches in the many article databases subscribed to by the Library and they don't find what they need by using Google Scholar. So, I direct them to the Bibliography first to identify authors and subject terms. They can then use these to carry out successful searches in article databases.”
How instructors find it useful
From the ICPSR blog:
“As a companion to the Bibliography I also use the instructional tool: Exploring Data Through Research Literature (EDRL). I think Rachel Barlow did a fantastic job on this. I have adapted pieces of EDRL for use in class presentations with great success. If you are in a library and you are involved in information literacy activities, this is a great tool.”
The EDRL – an Online Module
How instructors find it useful
You are a good source
Get credit for your work AND let us know about that of others:
• Send a citation via the Web form
• Or send them in an email to [email protected]
• If you have a large library, we can take EndNote XML imports, or even RIS-format imports
You are a good source
You are a good source
A final request:
• When you write articles, reports, papers, and presentations that analyze or significantly discuss data, CITE the data
• Encourage others to do it, too
• Here’s how and why
Let’s Take a BreakReturn at 10:45