using the memento framework to assess content drift in scholarly communication
TRANSCRIPT
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Acknowledgements:
Shawn Jones, Harihar Shankar (LANL)
Richard Tobin, Claire Grover (University of of Edinburgh)
Andy Jackson (British Library)
Martin Klein@mart1nkle1n
Herbert Van de Sompel@hvdsomp
Research Library
Los Alamos National Laboratory
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
2
Link Rot
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
3
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
4
Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
5
http://dl00.org
2000
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
6
http://dl00.org
2004
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
7
http://dl00.org
2005
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
8
http://dl00.org
2008
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
9
Content Drift
(in legal documents)
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
10
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
11
Content Drift
(in scholarly articles)
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
12
Referenced in
http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110
published on August 15th 2009
May 8th 2009 August 27th 2009
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
13
Referenced in
http://arxiv.org/abs/astro-ph/9707064
published on July 4th 1997
June 7th 1997 today
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
14
ArXivCorpus
1997 1999 2001 2003 2005 2007 2009 2011
0 2
00
00
60
00
01
00
000
14
00
00
180
00
0
articles
URI references
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
15
http://hiberlink.org/Definition:
• Link Rot + Content Drift = Reference Rot
Observation:
• Links to these resources are subject to Reference Rot
• Web at large resources referenced in scholarly articles
Problem:
• Threat to integrity of the web-based scholarly record
• Resources do not have the same sense of fixity like e.g.,
journal articles
• Resources’ custodianship is different, in terms of long-
term archiving, integrity, and access
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
16
http://dx.doi.org/10.1371/journal.pone.0115253
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
17
Focus: Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
18
http://dx.doi.org/10.1371/journal.pone.0167475
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
19
Study Dataset
• 3.5 million articles from arXiv, Elsevier, PMC
• Published between Jan 1997 – Dec 2012
• Converted from PDF to XML
• Extraction of URIs to web at large resources (>1 million)
• Keep track of articles’ publication date
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
20
Novel Approach to Assess Content Drift
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
21
Step 1: Find Mementos
• ~ 1 million URI references
• ~ 650k Memento Pre/Post pairs
discovered via Memento
https://mementoweb.org
https://tools.ietf.org/html/rfc7089
t t+1t-1
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
22
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
23
• Apply content similarity measures
• How similar is representative?
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
24
Content Similarity Measures
• Compute normalized scores (values between 0...100) for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
25
Representative Mementos
• Idea
• If perfect score in all 4 similarity measures
Memento Pre and Post are the same
Representative Mementos
• Sanity check needed
• Via HTTP headers: E-Tag and Last-Modified
• If same for Pre and Post Memento
HTTP-same
• Sanity check passed!
• 98.88% of Memento pairs that are HTTP-same have perfect
score in all 4 similarity measures
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
26
• ~ 313k referenced URIs have
representative Mementos
Step 2: Select Representative Mementos
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
27
Representative Mementos in arXiv
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
28
arXiv
Elsevier
PMC
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
29
• 241k out of 313k URIs have a live web version
Step 3: Dereference Live Web Version of URI
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
30
Step 4: Representative Memento vs. Live Version
• Apply content similarity measures
• Bin results into 6 clusters
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
31
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
32
Aggregate
Similarity
Score
Good:
23.7% of
URIs have
*not*
drifted!
Bad:
3/4 URIs
*have*
drifted!
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
33
Content Drift & Link Rot Over Time - arXiv
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
34
arXiv
Elsevier
PMC
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
35
Take-Aways
1. Scholarly articles increasingly contain URI references to web at
large resources.
2. Such resources are subject to reference rot (link rot + content drift).
3. Custodians of these resources are typically not overly concerned
with archiving of their content and longevity of the scholarly record.
4. Spoiler: Authors, publishers, web archives, and other parties can
help tackle this problem (see my lightning talk + poster on Robust
Links).
Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Martin Klein@mart1nkle1n
Herbert Van de Sompel@hvdsomp
Research Library
Los Alamos National Laboratory