publications, identity, and...
TRANSCRIPT
1
Publications, Identity, and Disambiguation
NIH Workshop on Identifiers and Disambiguation in Scholarly Work
Denise Beaubien Bennett Gainesville, FL March 18, 2010
2
"Until George W. Bush became President, the first President Bush
never used his middle initials," George H.W. Bush's chief of staff, Jean Becker, says. "But once his son became President, the elder Bush began to realize that it was necessary, to
help identify which President Bush was being referred to.”
• How confident are we that all mentions of plain “George Bush” refer to Senior?
• Remember that George H.W. Bush had several roles: CIA Director, Ambassador to China, Vice President
2
3
Automated disambiguation
• Scopus • Web of Science • CiteSeer • DBLP author search engine – query interpreted as set
of prefixes (implicit truncation) of name parts
• Author-ity
• improving recall and precision over time!
4
Scopus – snapshot from 2007 2007 – one solid cluster, 6 ambiguous outliers
3
Scopus in 2010: improving 2010 - one solid cluster, 3 ambiguous outlier names
6
Web of Science
• Their example shows incompleteness of disambiguation; continue using all variations
with and without apostrophe
4
WoS Distinct Author Sets – clustering is improving
DIY disambiguation
Web of Science
5
CiteSeer – disambiguated (but not perfect)
unclustered items are mostly typos
alternate name resolves to preferred name
6
7
Author-ity clusters
8
Author-ity pairwise ranking
Author-ity ranking results
9
Author-ity ranking – the bottom super-high probability through 130.
less than 50% with title far off topic
Voluntary Profiles Author (or proxy) created and maintained • Compliance challenges with ingestion and
updating • Usually include numbers
• COS Expertise - 480,000 profiles • ResearcherID (to be used by ORCID)
• RePEc Author Service in IDEAS
10
19
COS Community of Science
useful tools
18 months ago
20
ResearcherID
author-controlled profile
11
21
ResearcherID - features
value added from WoS – only works on cites in WoS
ResearcherID dups
keywords helpful when present
12
RePEc Author Service • Relies on authors to maintain their profiles and
identify articles as written by them • 23,000+ registered authors and 7000+ registered
non-authors
from 2007: dups & funnies
disambiguated index is much cleaner in 2010
they track lost and deceased authors
13
25
In development
• Cooperative Identities Hub
• ISNI
• ORCID
26
Manual checking
• no guarantee of perfection • scalability
• MathSciNet • Mathematics Genealogy Project • ACM
14
MathSciNet clusters all papers but preserves name on piece
28
However…
• Even the small, discipline-specific database of MathSciNet cannot corral all the duplicate names. – only half of the entries disambiguated for:
• Zhang, Lei • Zhang, Li
• Red herring: how many people only author one paper in their career???
– about 46% in Medline (sec. 3.5)
15
Many people, same name
30
MGP -
…
16
ACM – discloses the weighting
ACM Digital Library – not quite yet
17
33
After we disambiguate, we can:
• Link / cluster records within the silo – highlighting the preferred version
• Link headings (or records) across silos
• Analyze / repackage / mashup the data
34
Linking within a silo
• more examples -- inspiration from outside the university/research world
18
Linking in Community-maintained IMDB
others born the same day or year or place
links to people, films, etc. credit!
Community-maintained - MusicBrainz
members & years
19
Community-maintained - MusicBrainz
please – no “eyes” no “pears” no hyphen
38
Linking across silos • VIAF – Virtual International Authority File
• Getty ULAN – Union List of Artist Names
• Names Project - UK individuals and institutions – for benefit of institutional and subject repositories
• BKN People – using Bibliographic Ontology (BIBO) to aggregate author silos
• rely on local silos for maintenance
20
VIAF – linking across files
authority record in BNF (France) matches these other files
Getty Union List of Artist Names
• ULAN • Used mostly by museums • Merges multiple authority files • Displays all options and sources • Guides to preferred name
21
name variations
preferred among options
22
relationships
sources
Names project (UK)
23
45
Names Project (UK)
46
BKN People: uses BIBO
24
47
BKN People: uses BIBO
48
Analyzing / repackaging the data
– discover outliers through analysis • what’s wrong with this picture?
– run the outliers by human checkers
– use the analyzed results to refine the disambiguation
25
WorldCat Identities
more than birth/death dates
the fun stuff
Anne O’Tate (Author-ity) analyze by address
note the fractions of addresses
26
Anne O’Tate (Author-ity) analyze by topic
neat clustering, compared to “Topics” with 324 results
analyze – author’s impact within silo
IDEAS / RePEc
27
MathSciNet collaboration distance
the Kevin Bacon of Math
How close are these authors?
28
DBLP Vis – coauthor intensity
see # papers with coauthor when mouse-over a year
DBLP Vis – coauthor timecolor
see fatter boxes on graph when mouse-over a year
29
57
Features to help disambiguate
• affiliation (how many addresses/year?) • email address • coauthors • keywords from source or all metadata • dates - degree years, expected range • web page – URL and other data
• caution - what fuzziness/distance is acceptable? differences by disciplines?
Use with care: one author, many interests
30
59
For contemplation and discussion
60
Assigning numbers
• Centralized numbering system – governance issues, unpalatable to some
• Individual small silo numbering – can be highly accurate
• Record linking across files – easily accomplished
• Getting started -- authors could include number(s) with all contact info
31
61
Trustworthiness
• Am I in control of all of my publications? • If I’m logged in (to ResearcherID, via my
university account, etc.) and I indicate “these items are mine,” should you trust my accuracy?
• Have I captured all of my items? – variants on my name – items I forgot – items credited without my awareness
61
62
Issues to explore
• Ingestion vs. maintenance – very different problems
– author compliance needed?
• De-duplication (within and across silos)
• Management and cooperation for updating • Scalability • Automated vs. manual techniques • Optimizing computational performance • Long tail of one-hit authors (how much attention?)
32
63
Researchers, projects, products, models
• Great review (by the Author-ity folks) Smalheiser NR, Torvik VI. (2009) Author name disambiguation.
Databases and those who created or tinkered with them
• MathSciNet • ULAN • DBLP - Han • CiteSeer – Giles, Han • IMDB – Malin • ANAC – Levy sheet music • Medline – Torvik and Smalheiser • D-Dupe - Getoor • rexa.info – McCallum • VIAF - Hickey