ala 2010 -- jeremy york

25
HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research New Generations of Research Strategies and Challenges Strategies and Challenges Jeremy York NISO/BISG Forum NISO/BISG Forum ALA 2010

Upload: bisg

Post on 28-Nov-2014

931 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ALA 2010 -- Jeremy York

HATHI TRUSTA Shared Digital Repository

Delivering Data For New Generations of ResearchNew Generations of Research

Strategies and ChallengesStrategies and ChallengesJeremy York

NISO/BISG ForumNISO/BISG ForumALA 2010

Page 2: ALA 2010 -- Jeremy York

IntroductionIntroduction

• Digital RepositoryDigital Repository– Initial focus on digitized book and journal content

– “Light” archive– Light  archive

• Collections and CollaborationC h i ll ti– Comprehensive collection

– Shared strategies

Local services– Local services

– Public Good

Page 3: ALA 2010 -- Jeremy York

Content DistributionContent Distribution

19%

In Copyright

81%Public Domain

6,173,575 – Total1,177,667 – Public Domain     

* As of June 15, 2010

Page 4: ALA 2010 -- Jeremy York

Language Distribution (1)Language Distribution (1)

The top 10 languages make up ~86% 

ItalianArabic2%

Polish1% Remaining 

p g g p %of all content 

English48%

h

Japanese4%

Italian3%

2% Languages14%

48%

FrenchSpanish

Chinese4%

German8%

French7%

Russian5%

Spanish4%

5%

* As of June 15, 2010

Page 5: ALA 2010 -- Jeremy York

Language Distribution (2)Language Distribution (2)

Serbian%

Romanian%

Ancient‐GreekYiddishSlovenian%

Multiple

The next 40 languages make up 

Hindi6%

Portuguese6%

Hebrew

Vietnamese2% Ukrainian

2%Bulgarian

2%

1%

Armenian1%Greek1%

Panjabi1%

Malay1%Catalan1%

1%Malayalam1% Slovak

1%

1%1%

Finnish1%

p1% ~13% of total

Hebrew6%

Indonesian6%

D t hNorwegian

Hungarian2% Sanskrit

2%

Ukrainian2%

1%1% 1%1%

Dutch5%

LatinKorean2%

Bengali2%

Norwegian2%

5%Urdu4%

Swedish4%TurkishCzechThaiDanish

Undetermined3%Tamil

Persian3%

2%

4%Turkish4%

Unknown4%

Czech3%

Thai3%3%Croatian

3%

3%

* As of June 15, 2010

Page 6: ALA 2010 -- Jeremy York

Originating InstitutionOriginating Institution

Uni ersit of Indiana  University of Penn State University of Wisconsin

6%

University3%

University of Minnesota

1%University

0%

University of California

University of Michigan65%

25%

65%

* As of June 15, 2010

Page 7: ALA 2010 -- Jeremy York

Content over timeContent over time

80%

100%

40%

60% Minnesota

Penn State

California

0%

20%

4

California

Indiana

Wisconsin

Michigan

Sep‐04

Nov‐04

Jan‐05

Mar‐05

May‐05

Jul‐0

5

Sep‐05

Nov

‐05

an‐06

ar‐06

y‐06

MichiganN Ja

Ma

May

* As of June 15, 2010

Page 8: ALA 2010 -- Jeremy York

Content GrowthContent Growth

Page 9: ALA 2010 -- Jeremy York
Page 10: ALA 2010 -- Jeremy York

Data Distribution & APIsData Distribution & APIs

• OAI‐PMHOAI PMH

• Metadata files

ibli hi• Bibliographic API

• Data API

Page 11: ALA 2010 -- Jeremy York

Extended ServicesExtended Services

• Community Development EnvironmentCommunity Development Environment

• Non‐Google Ingest

k/ l• Non‐Book/Non‐Journal Ingest

• Computational Research

Page 12: ALA 2010 -- Jeremy York

Strategies for Computational ResearchStrategies for Computational Research

• Data distributionData distribution

• Protocol‐based access

h C• Research Center

Page 13: ALA 2010 -- Jeremy York
Page 14: ALA 2010 -- Jeremy York

SEASR ArchitectureVisualizationsVisualizations

AppsApps ServicesServicesPluginsPluginsWeb AppsWeb Apps

User InterfacesUser Interfaces

ComponentsComponents

Meandre Data‐Intensive FlowsMeandre Data‐Intensive Flowsr Tools

r Tools

RepositoriesRepositories

Meandre WorkbenchMeandre Workbench

ComponentsComponents

Meandre InfrastructureMeandre Infrastructure

VisualizationVisualization

Component RepositoryComponent Repository Component DiscoveryComponent Discovery

AnalyticsAnalyticsDataData

Develop

erDevelop

er DataAnalysis

ComponentsFlows

DataAnalysis

ComponentsFlows

Virtualization InfrastructureVirtualization Infrastructure

Cloud ComputingCloud Computing

Page 15: ALA 2010 -- Jeremy York

SEASR @ Work – Tag Cloud

• Count tokens• Filter options• Filter options

supportedSt d• Stem words

Page 16: ALA 2010 -- Jeremy York

SEASR @ Work – Entity Mash-upE tit E t ti ith• Entity Extraction with OpenNLP or Stanford NER

• Locations viewed on Google Map D i d• Dates viewed on Simile Timeline

Page 17: ALA 2010 -- Jeremy York

SEASR @ Work – Entities To Network

• Identify entities• Define relationships between entities withinDefine relationships between entities within

same sentence

Page 18: ALA 2010 -- Jeremy York

SEASR @ Work – Text Clustering

• Clustering of Text by token counts• Filtering options for stop words Part of Speech• Filtering options for stop words, Part of Speech• Dendogram Visualization

Page 19: ALA 2010 -- Jeremy York

SEASR @ Work – Audio Analysis• NEMA: Executes a SEASR

flow for each run

– Loads audio data– Loads audio data

– Extracts features for every 10 sec moving

i d f diwindow of audio

– Loads and applies the models

– Sends results back to the WebUI

NESTER: Annotation of• NESTER: Annotation of Audio via Spectral Analysis

Page 20: ALA 2010 -- Jeremy York

SEASR @ Work – Zotero• Plugin to Firefox • Zotero manages the

collection• Launch SEASR Analytics

– Citation Analysis uses the– Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR

– Zotero Export to Fedora through SEASRthrough SEASR

– Saves results from SEASR Analytics to a Collection

• Launch MONK• Launch MONK Processing– MONK DB Ingestion Workflow

Page 21: ALA 2010 -- Jeremy York

SEASR @ Work – Emotion Tracking

Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

Page 22: ALA 2010 -- Jeremy York

Sentiment Analysis: Visualization

Page 23: ALA 2010 -- Jeremy York

Person Extraction:Scott's Waverley, Ivanhoe, and The Heart of Midlothian. 

Page 24: ALA 2010 -- Jeremy York

Location Extraction:Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent

Page 25: ALA 2010 -- Jeremy York

Thank you!

hathitrust‐[email protected]@umich.edu