cni2012

95
Opening Slide

Upload: brian-tingle

Post on 09-Dec-2014

445 views

Category:

Technology


0 download

DESCRIPTION

#cni12f SNAC slides

TRANSCRIPT

Page 2: Cni2012

2012-11-04 - SLIDE

12/11/12

Building an Archival Identity Management Network: Transforming

Archival Practice and Historical Research

Daniel Pitti* and Brian Tingle*** Institute for Advance Technology in the Humanities

** California Digital Library

Thanks to Ray R. Larson of the University of California, Berkeley, School of Information for many of the slides here

Page 3: Cni2012

2012-11-04 - SLIDE

12/11/12

Funding and People• Funding and Timeline

– National Endowment for the Humanities– May 2010-April 2012– Andrew W. Mellon Foundation– May 2012-April 2014

• People– Daniel Pitti (PI) and Worthy Martin (Institute for Advanced

Technology in the Humanities, University of Virginia)– Adrian Turner and Brian Tingle (California Digital Library,

University of California)– Ray Larson (School of Information, University of California,

Berkeley)

Page 4: Cni2012

2012-11-04 - SLIDE

12/11/12

The Source Data• EAD-encoded finding aids (guides to archival

records)– 150K– Primarily from U.S. sources, but also U.K. and

France• Archival authority records (360K)

– National Archives and Records Administration– State Archive of New York– Smithsonian Institution– British Library– National Archives (France) & BnF

• WorldCat Archival Descriptions: 2M

Page 5: Cni2012

2012-11-04 - SLIDE

12/11/12

Library and Museum Authority Records• Getty Vocabulary Program: Union List of

Artist Names (293K personal and corporate names)

• Virtual International Authority File (16M+ cluster records)– Contributed from around the world by national

libraries and others

Page 6: Cni2012

2012-11-04 - SLIDE

12/11/12

Page 7: Cni2012

2012-11-04 - SLIDE

12/11/12

Methods and Processing• Extract EAC-CPF records from existing EAD-

encoded archival descriptions– Extracting both creators and referenced CPF

names• Match EAC-CPF records against one another

and against existing authority records (ULAN, VIAF, LCNAF)– Enhance EAC-CPF by normalizing entries,

adding alternative entries, titles (VIAF), and historical data (ULAN)

• Create a prototype historical resource and access system– Historical data and social-professional networks– Links to archive, library, and museum resources

(by and about)

Page 8: Cni2012

2012-11-04 - SLIDE

12/11/12

Example EAD Record (Hub)<EAD> <EADHEADER LANGENCODING = "ISO 639"> <EADID>GB 0133 TAB </EADID> <FILEDESC> <TITLESTMT> <TITLEPROPER>Tabley Muniments </TITLEPROPER> </TITLESTMT> <PUBLICATIONSTMT> <PUBLISHER>John Rylands University Library of Manchester </PUBLISHER> <ADDRESS> <ADDRESSLINE>150 Deansgate </ADDRESSLINE> <ADDRESSLINE>Manchester </ADDRESSLINE> <ADDRESSLINE>... (Parts removed )… </FRONTMATTER>

<ARCHDESC LEVEL = "FONDS" LANGMATERIAL = "English"> <DID> <REPOSITORY>University of Manchester, John Rylands University Library of Manchester </REPOSITORY> <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB" REPOSITORYCODE = "0133">GB 0133 TAB </UNITID> <UNITTITLE LABEL = "Title" ENCODINGANALOG = "ISADG3.1.2.">Tabley Muniments </UNITTITLE> <UNITDATE LABEL = "Dates of Creation" ENCODINGANALOG = "ISADG3.1.3.">19th century </UNITDATE> <PHYSDESC LABEL = "Extent" ENCODINGANALOG = "ISADG3.1.5."> <EXTENT>1.24 cu.m </EXTENT> </PHYSDESC> <ORIGINATION LABEL = "Creator" ENCODINGANALOG = "ISADG3.2.1."> <FAMNAME SOURCE = "NCARULES">Warren, family, of Tabley, Cheshire </FAMNAME> <PERSNAME SOURCE = "NCARULES">Warren, John Byrne Leicester, 1835-1895, 3rd Baron de Tabley, poet </PERSNAME> </ORIGINATION> </DID>

Page 9: Cni2012

2012-11-04 - SLIDE

12/11/12

Example EAD Record (Hub)<BIOGHIST ENCODINGANALOG = "ISADG3.2.2."> <HEAD>Administrative/Biographical History </HEAD> <P>The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire, was born in 1835, the son of the 2nd Baron de Tabley (1811-1887), and his wife, Catherina. His mother was Italian, the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he published under the pseudonyms George F. Preston (1859-1862) and William Lancaster (1863-1868), but latterly under his own name. </P> <P>His early verse included <TITLE>Praeterita </TITLE> (1863), <TITLE>Eclogues and Monodramas </TITLE> (1864), <TITLE>Studies in Verse </TITLE> (1865), <TITLE>Philocletes </TITLE> (1866), and <TITLE>Orestes </TITLE> (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and Swinburne. In 1873 he produced …. (some data removed)…

Page 10: Cni2012

2012-11-04 - SLIDE

12/11/12

Example EAD Record (Hub)<SCOPECONTENT ENCODINGANALOG = "ISADG3.3.1."> <HEAD>Scope and Content </HEAD> <P>The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates. </P> <P>Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and loose manuscripts of verse and other writings. There are various bundles and boxes relating to &quot;Coins&quot;, &quot;Botany&quot;, &quot;Poetry&quot;, &quot;Literary&quot;, &quot;Financial&quot; and bookplates. </P> </SCOPECONTENT> <ADD> <OTHERFINDAID ENCODINGANALOG = "ISADG3.4.6."> <P>Preliminary survey list. </P> </OTHERFINDAID> <RELATEDMATERIAL ENCODINGANALOG = "ISADG3.5.3."> <P>There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM. The Library also has custody of the important Tabley Book Collection. </P> </RELATEDMATERIAL> <SEPARATEDMATERIAL> <P>The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record Office. Some of these papers were originally in the custody of the John Rylands University Library of Manchester. </P> </SEPARATEDMATERIAL> </ADD>

Page 11: Cni2012

2012-11-04 - SLIDE

12/11/12

Example EAD Record (Hub)<CONTROLACCESS> <HEAD>Index terms </HEAD> <GEOGNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "a">Tabley Inferior</EMPH> <EMPH ALTRENDER = "a-">Cheshire SJ7378</EMPH> </GEOGNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Benson</EMPH> <EMPH ALTRENDER = "forename">Arthur Christopher</EMPH> <EMPH ALTRENDER = "dates">1862-1923</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Bridges</EMPH> <EMPH ALTRENDER = "forename">Robert Seymour</EMPH> <EMPH ALTRENDER = "dates">1844-1930</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Duff</EMPH> <EMPH ALTRENDER = "title">Sir</EMPH><EMPH ALTRENDER = "forename">Mountstuart Elphinstone Grant</EMPH> <EMPH ALTRENDER = "dates">1829-1906</EMPH> <EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Gosse</EMPH><EMPH ALTRENDER = "title">Sir</EMPH><EMPH ALTRENDER = "forename">Edmund William</EMPH> <EMPH ALTRENDER = "dates">1849-1928</EMPH> <EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME>

<PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Milnes</EMPH> <EMPH ALTRENDER = "forename">Richard Monckton</EMPH> <EMPH ALTRENDER = "dates">1809-1885</EMPH> <EMPH ALTRENDER = "epithet">1st Baron Houghton</EMPH> </PERSNAME> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a">Bookplates</EMPH> </SUBJECT> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a">Botany</EMPH> </SUBJECT> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a">Numismatics</EMPH> </SUBJECT> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a-">Poetry</EMPH> <EMPH ALTRENDER = "a">Modern</EMPH> <EMPH ALTRENDER = "y">19th century</EMPH> </SUBJECT> </CONTROLACCESS> </ARCHDESC></EAD>

Page 12: Cni2012

2012-11-04 - SLIDE

12/11/12

2010-2012 Extraction Results• Source data: 30,000 finding aids • EAC-CPF records extracted

– LoC: 43,702 from 1,159 finding aids– OAC: 91,811 from ~15,400 – NWDA: 22,609 from 5,160– VH: 15,175 from 8,390– Total 173,297

Page 13: Cni2012

2012-11-04 - SLIDE

12/11/12

Phase II preliminary results• unmerged SIA Henry Correspondence• 32,988 Names

• unmerged WorldCat MARC• 4,548,270 Names

Page 14: Cni2012

2012-11-04 - SLIDE

12/11/12

Methods and Processing• Extract EAC-CPF records from existing EAD-

encoded archival descriptions– Extracting both creators and referenced CPF

names• Match EAC-CPF records against one another

and against existing authority records (ULAN, VIAF, LCNAF)– Enhance EAC-CPF by normalizing entries, adding

alternative entries, titles (VIAF), and historical data (ULAN)

• Create a prototype historical resource and access system– Historical data and social-professional networks– Links to archive, library, and museum resources

(by and about)

Page 15: Cni2012

2012-11-04 - SLIDE

12/11/12

The Problem• Proliferation of the forms of names

– Different names for the same person– Different people with the same names

• Examples – from Books in Print (semi-controlled but not

consistent)– ERIC author index (not controlled)

Page 16: Cni2012

2012-11-04 - SLIDE

12/11/12

Goethe

…etc…

Page 17: Cni2012

2012-11-04 - SLIDE

12/11/12

John Muir

Page 18: Cni2012

2012-11-04 - SLIDE

12/11/12

Library and Archive Authority Control• Library (or bibliographic) authority control is almost

exclusively about the control of names• Archival identity control involves biographical-

historical description of the CPF entity– Descriptions based on controlled vocabularies, for

example, occupations, place of birth and death– But also biographical-historical description

• Prose• Chronological list

• Archival authority control provides context for understanding records, the context of their creation, the provenance

Page 19: Cni2012

2012-11-04 - SLIDE

12/11/12

Repository of merged EAC

RecordsEAC Repository

VIAF Repository

Connect exactly

matching records

Connect records using

name authority

information

Repository of connected EAC

Records(MongoDB)

Merge

Cheshire Search

Merging EAC-CPF RecordsLCNAF Repository ULAN Repository

Page 20: Cni2012

2012-11-04 - SLIDE

12/11/12

Repository of merged EAC

RecordsEAC Repository

VIAF Repository

Connect exactly

matching records

Connect records using

name authority

information

Repository of connected EAC

Records(MongoDB)

Merge

Cheshire Search

Merging EAC-CPF Records

Page 21: Cni2012

2012-11-04 - SLIDE

12/11/12

Connect Exact Matches• The EAC-CPF records provide the names

without having to parse texts, etc.• Allows us to use some simple methods like

exact matching– Assume identical name entries means the

same person/corporate body/family– Enter the full names and record IDs into a

database and flag IDs with same names for merging

Page 22: Cni2012

2012-11-04 - SLIDE

12/11/12

But…• Exact merging assumes that archives are

following LC cataloging practice in their EAD records– There are some problems with this assumption

Page 23: Cni2012

2012-11-04 - SLIDE

12/11/12

Some failures for merging…• Different abbreviations:

– A. & G. Carisch & C.– A. & G. Carisch & Co.

• And spacing issues:– A. C. Peters & Bro.– A. C. Peters & Brother.– A. C. Peters. (??)– A. C.Peters & Bro.

• Completeness and alternate rules– Tabb, John B. (John Banister), 1845-1909.– Tabb, John Banister, 1845-1909.

• Also differing transliterations for non-Latin scripts

Page 24: Cni2012

2012-11-04 - SLIDE

12/11/12

More…• Variant romanizations (and spacing):

– M. P. Belaieff.– M. P. Belaïeff.– M. P. Bieliaev.– M.P. Belaïeff.– M.P.Belaïeff.

• Initials vs. names:– Zabolotskii, N.A.– Zabolotskii, Nikolai Alekseevich, 1903-1958.– Zabolotskii.

Page 25: Cni2012

2012-11-04 - SLIDE

12/11/12

More…• Inverted order vs. uninverted

– Taylor, Zachary, 1784-1850.– Zachary Taylor.

• Various combinations:– Tchaikovsky, Peter I.– Tchaikovsky, Pëtr Il.– Tchaikovsky, Piotr Ilyich.– Tchaikovsky, Pyotr Il.– Tchaikovsky, Pyotr Ilyich.

Page 26: Cni2012

2012-11-04 - SLIDE

12/11/12

Repository of merged EAC

RecordsEAC Repository

VIAF Repository

Connect exactly

matching records

Connect records using

name authority

information

Repository of connected EAC

Records(MongoDB)

Merge

Cheshire Search

Merging EAC-CPF Records

Page 27: Cni2012

2012-11-04 - SLIDE

12/11/12

Search Authority Files• For each name, formulate a search of the

VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching)– Search both the “authoritative” and “non-

authoritative” forms– Consider any name matching a non-

authoritative form to be a candidate match for the authoritative form

– Flag EAC records that match the same authority record as potential matches

Page 28: Cni2012

2012-11-04 - SLIDE

12/11/12

Shingle Language Model for names

Name: Einstein Albert

Shingle sequence: ein, ins, nst, ste, tei, ein … , ert

Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein

Krishna Janakiraman and Sean Marimpietri - Biograph

NGRAM or Shingle Matching

Page 29: Cni2012

2012-11-04 - SLIDE

12/11/12

Name 1 : Einstein AlbertName 2 : Ainshtain AlbertName 3 : Albert Einstein

ein

ins

nstste

ein In n a

alb

ert

al

rteteiein

Ain

ins

nshsht

hta tai ain

alb

ert

al

rteteiein

ein

ins

nstste

ein In n a

alb

ert

al

rteteiein

lbelbe lbe

Shingle Language Model for names

Krishna Janakiraman and Sean Marimpietri - Biograph

Page 30: Cni2012

2012-11-04 - SLIDE

12/11/12

Repository of merged EAC

RecordsEAC Repository

VIAF Repository

Connect exactly

matching records

Connect records using

name authority

information

Repository of connected EAC

Records(MongoDB)

Merge

Cheshire Search

Merging EAC-CPF Records

Page 31: Cni2012

2012-11-04 - SLIDE

12/11/12

Merge Flagged Records• For all of the exact matches and authority

matches– Use the Authoritative form of the name– Combine data from each match into a single

EAC-CPF record– Retain all source record IDs and information

• Finally, output the merged EAC-CPF records

Page 32: Cni2012

2012-11-04 - SLIDE

12/11/12

Inputs to SNAC merging• LoC: 43,702 EAC-CPF records derived from 1159

finding aids

• OAC: 91,814 EAC-CPF records derived from ~15,400 finding aids

• NWDA: 24952 EAC-CPF records derived from 5,568 finding aids

• VH: 15,175 EAC-CPF records

• Total: 175,688 Input EAC records for merging

• Result: 128,781 “unique” names

Page 33: Cni2012

2012-11-04 - SLIDE

12/11/12

Another view of the numbers…• 95624 Person names merged from 125555

Person records• 31287 Institutions merged from 47189

Institution records• 1980 Families merged from 2899 Family

records

Page 34: Cni2012

2012-11-04 - SLIDE

12/11/12

Merging Conclusions• There will not be a single merging method,

but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information

Page 35: Cni2012

2012-11-04 - SLIDE

12/11/12

Next• Developing an updateable database of

merged EAC data (dumping Mongo for PostgreSQL)– Will permit incremental addition of new data

and support editing and “forced” merges• Process the 2M WorldCat archival

descriptions• Process the 150,000 finding aids• Convert several hundred thousand archival

authority records into EAC-CPF and match/merge process

Page 36: Cni2012

2012-11-04 - SLIDE

12/11/12

Methods and Processing• Extract EAC-CPF records from existing EAD-

encoded archival descriptions– Extracting both creators and referenced CPF

names• Match EAC-CPF records against one another and

against existing authority records (ULAN, VIAF, LCNAF)– Enhance EAC-CPF by normalizing entries, adding

alternative entries, titles (VIAF), and historical data (ULAN)

• Create a prototype historical resource and access system– Historical data and social-professional networks– Links to archive, library, and museum resources

(by and about)

Page 37: Cni2012

12/11/12

Outline

• User Persona

• Search and Display

• Network graph visualization

• Linked Data / RDF

• Future Plans

Page 38: Cni2012

12/11/12

Meet the target users

• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks.  Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event.  He also TAs an undergraduate history class and sometimes has to help students find topics for papers. 

• Connie: Works at an institution that contributed records to the project.  Is going to be asking themselves how this site would be useful to their users.  Wants to understand how their records were used and what the added value is.

• Quincy: Library School Student working to QA record matching.

• Adele: Person doing authority work during collection processing.

• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically.

Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or

product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)

Page 39: Cni2012

12/11/12

Outline

• User Persona

• Search and Display

• Network graph visualization

• Linked Data / RDF

• Future Plans

Page 40: Cni2012
Page 41: Cni2012

12/11/12

Page 42: Cni2012

12/11/12

Page 43: Cni2012

12/11/12

Page 44: Cni2012
Page 45: Cni2012
Page 46: Cni2012
Page 47: Cni2012
Page 48: Cni2012
Page 49: Cni2012
Page 50: Cni2012

Advanced limits match EAC sections

Page 51: Cni2012
Page 52: Cni2012
Page 53: Cni2012
Page 54: Cni2012
Page 55: Cni2012
Page 56: Cni2012
Page 57: Cni2012
Page 58: Cni2012
Page 59: Cni2012
Page 60: Cni2012
Page 61: Cni2012
Page 62: Cni2012
Page 63: Cni2012
Page 64: Cni2012
Page 65: Cni2012
Page 66: Cni2012
Page 67: Cni2012
Page 68: Cni2012

12/11/12

Outline

• User Persona

• Search and Display

• Network graph visualization

• Context widget (needs new name)

• Linked Data / RDF

• Future Plans

Page 69: Cni2012

12/11/12

Tinkerpop graph database stack

• Simple "property graph" model

• "JDBC for graph databases" [SNAC is using Neo4J for the graphDB]

• XPath like "gremlin" for graph query

• REST interfaces with "Rexster"

• For me, this was 10 to 100 times easier than using RDF

Page 70: Cni2012
Page 71: Cni2012
Page 72: Cni2012
Page 73: Cni2012
Page 74: Cni2012
Page 75: Cni2012
Page 76: Cni2012
Page 77: Cni2012
Page 78: Cni2012
Page 79: Cni2012

12/11/12

Outline

• User Persona

• Search and Display

• Network graph visualization

• Linked Data / RDF

• Future Plans

Page 80: Cni2012

12/11/12

What is Linked Open Data?

• w3c Semantic Web Technology Stack

• Web of atomized Data, not a web of documents

• RDF; OWL ontologies; SPARQL queries; triple/quad/quint stores

• httpRange14; content negotiation; CURIE

• No restrictions on data use; free and easy license

• Lenny wants it, but does Randy?

Page 81: Cni2012

12/11/12

What is Linked Open Data?

• Getting to the good stuff

• Blue underlined text

• Pulling in data from multiple sources, in an intelligent way, into a "document"

• Understand and discover relationships

• Open access for research, education, private study and other fair use

Page 82: Cni2012

RDFa owl:sameAs

Page 83: Cni2012

HTML 5 microdata in chron list

Page 84: Cni2012

Thanks Ed Summers!

RDF of the social graph

Page 85: Cni2012
Page 86: Cni2012
Page 87: Cni2012
Page 89: Cni2012
Page 91: Cni2012

12/11/12

My opinion on the use cases for w3c RDF tech

• Good for publishing data

• Good for controlled vocabularies

• Data models?

• Most people with open source RDF-store type systems do the real stuff with solr

• Consider a graph database

Page 92: Cni2012
Page 93: Cni2012

12/11/12

Outline

• User Persona

• Search and Display

• Linked Data / RDF

• Network graph visualization

• Future Plans

Page 94: Cni2012

12/11/12

Future Plans

• Conduct assessment activities involving members of target audiences to establish mental model of users for design work

• Scale interface to millions of names

• Visualizations useful and integrated (network and geospatial)

• Stable URLs between batches for linked data

• Social and personalization features (gateway to crowdsourcing)

• Integration with local systems (such as with the context widget)

Page 95: Cni2012

12/11/12

• Photo attribution http://www.flickr.com/photos/dsevilla/139656712/in/photostream/

• http://xtf.cdlib.org/

• http://code.google.com/p/eac-graph-load/source/browse/README.txt

• http://tinkerpop.com/

• http://thejit.org/

• https://github.com/tingletech/snac-related-widget