big data case studies

29
The HathiTrust Research Center's Extracted Features Dataset: An Opportunity for "Distant" Reading of Millions of Books from the World's Great Research Libraries Sayan Bhattacharyya ([email protected] ) Peter Organisciak ([email protected] ) Graduate School of Library and Information Science, UIUC, Urbana- Champaign Loretta Auvil and Boris Capitanu Ted Underwood (Illinois Informatics Institute, UIUC) (Department of English, UIUC, Urbana- Champaign)

Upload: uiresearchpark

Post on 14-Jan-2017

412 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data Case Studies

The HathiTrust Research Center's Extracted Features Dataset: An Opportunity for "Distant" Reading of Millions of Books from the World's Great Research LibrariesSayan Bhattacharyya ([email protected])Peter Organisciak ([email protected] )

Graduate School of Library and Information Science, UIUC, Urbana-Champaign

Loretta Auvil and Boris Capitanu Ted Underwood(Illinois Informatics Institute, UIUC) (Department of English, UIUC, Urbana- Champaign)

Page 2: Big Data Case Studies

WHATisIT?

Page 3: Big Data Case Studies

a dataset of page-level extracted features for scanned books in the HathiTrust Digital Library

Page 4: Big Data Case Studies

Raw Text▼

Translation into features (we drop you off here)

▼Algorithmic Use

Page 5: Big Data Case Studies

"that": { "DT": 1, "IN": 4, "WDT": 1 }, "the": { "DT": 5 }, "there": { "EX": 2 }, "thin": { "JJ": 1 }, "this": { "DT": 1 }, "though": { "IN": 1 }, "thought": { "VBD": 1 }, "three": { "CD": 1 }, "through": { "IN": 1 }, "to": { "TO": 2 }, "to-day": { "NN": 1 }, "twisted": { "VBN": 1 }, "twitched": { "VBD": 1 },

(Verb, past tense)

(Determiner,Preposition or subordinating conjunction,Wh-determiner)

Page 6: Big Data Case Studies
Page 7: Big Data Case Studies

CONTACT: [email protected] - @POrg DATA: https://sharc.hathitrust.org/features

Page 8: Big Data Case Studies
Page 9: Big Data Case Studies

WHYshould youCARE?

Page 10: Big Data Case Studies

1) It’s huge2) It’s accessible3) You can do cool

stuff with it

Page 11: Big Data Case Studies

1) It’s Huge

Page 12: Big Data Case Studies

The HathiTrust Digital Library (HTDL)• Approximately 14 million books• from the world’s great research libraries:• a large chunk of mankind’s historical textual record of

culture

•1.8 billion pages•610 billion words•Approximately 4.8 million of the 14 million books are in the public domain

• Current applications are set up to work with these 4.8 million books

Page 13: Big Data Case Studies
Page 14: Big Data Case Studies

2) It’s Accessible

Page 15: Big Data Case Studies

One of the largest archives of pre-digital human creation, downloadablehttps://sharc.hathitrust.org/features

Page 16: Big Data Case Studies

Can’t share 10 million in-copyright works, but…

Page 17: Big Data Case Studies

3) can do useful stuff with it

Page 18: Big Data Case Studies

Large corpora allow for •historical•cultural•linguistic

insights

Page 19: Big Data Case Studies

Co-occurrence tablesby David Mimno (Computer Science Dept.,Cornell University)

Available for use at:http://mimno.infosci.cornell.edu/wordsim/nearest.html

Explanatory article at:http://www.mimno.org/articles/wordsim/

Page 20: Big Data Case Studies

Bookwormhttp://bookworm.htrc.illinois.eduFaceted visualization of trends over 4.8 million books

CONTACT: [email protected] - @POrg DATA: https://sharc.hathitrust.org/features

Page 21: Big Data Case Studies

The HathiTrust+Bookworm project

Team Members:• Current:

J. Stephen Downie, University of Illinois, Urbana-Champaign Erez Lieberman Aiden, Baylor College of Medicine Benjamin Schmidt, Northeastern University Robert McDonald, Indiana University Loretta Auvil, University of Illinois, Urbana-Champaign Peter Organisciak, University of Illinois, Urbana-ChampaignMuhammad Shamim, Baylor College of MedicineSayan Bhattacharyya, University of Illinois, Urbana-Champaign Leena Unnikrishnan, Indiana University

• Past:Colleen Fallaw, University of Illinois, Urbana-ChampaignMatt Nicklay, Baylor College of Medicine

Funded by an NEH Implementation Grant (2014-2016)

Page 22: Big Data Case Studies

Hooking up Extracted Unigrams with Bookworm: Advantages?First advantage: Good metadata!•HTDL has good and detailed metadata• metadata was meticulously created by librarians from

contributing libraries

allows for highly faceted queries:

Page 23: Big Data Case Studies

Hooking up HTDL with Bookworm: Advantages?Second advantage: HTDL’s workset functionality (contd.)

HT+Bookworm queryon HTDL collection Workset

HathiTrust Research Center’s Workset Builder

HT+Bookworm queryon workset

Workset creation and refinement workflow

Page 24: Big Data Case Studies

How scientific inquiry meets humanistic inquiryin culturomics as performed by HT+Bookworm

•Scientific inquiry concerns:•Generalization across entities•Discovery of patterns across entities

•Humanistic inquiry concerns:•Close engagement with specific entities• Attending to singular instances among entities

Page 25: Big Data Case Studies

How scientific inquiry meets humanistic inquiryin culturomics as performed by HT+Bookworm

(contd.):

Three use cases•How: • change in social context across time correlates with change in preponderance of one word-concept over another

• occurrences of related word-concepts in multiple languages/places compare with each other

•metaphorical associations of a word vary across time

Page 26: Big Data Case Studies

How change in social context across time correlates with

change in preponderance of one word-concept over another

Search for lady in ( Language: English )Search for woman in ( Language: English )

Page 27: Big Data Case Studies

How change in social context across time correlates with

change in preponderance of one word-concept over another(contd.)

Search for lady in ( Language: English ) AND ( Publication Country: United States )Search for woman in ( Language: English ) AND ( Publication Country: United States )Search for lady in ( Language: English ) AND ( Publication Country: United Kingdom )Search for woman in ( Language: English ) AND ( Publication Country: United Kingdom )

Page 28: Big Data Case Studies

How metaphorical associations of a word vary across time

Streamgraph plot of depression with Library of Congress categories, between 1850 and 1923 : http://bit.ly/1QatsPs

Page 29: Big Data Case Studies

DATASEThttps://sharc.hathitrust.org/features

ACKNOWLEDGEMENTSBoris Capitanu Ted Underwood

Loretta AuvilColleen Fallaw J. Stephen Downie

Benjamin Schmidt (Bookworm)

Special thanks to the National Center for Supercomputing Applications (NCSA)

National Enfowment for the Humanities (NEH)

Contact:Peter Organisciak

[email protected]

Sayan Bhattacharyya

[email protected]