link rot + content drift - stm · link rot + content drift = reference rot stm_2015 1 december 2015...
TRANSCRIPT
![Page 1: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/1.jpg)
Link Rot + Content Drift
STM_2015
1 December 2015
Funded by the Andrew W. Mellon Foundation
Peter Burnhill EDINA, University of Edinburgh
for Hiberlink Team at University of Edinburgh & LANL Research Library
![Page 2: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/2.jpg)
Link Rot + Content Drift = Reference Rot
STM_2015
1 December 2015
Funded by the Andrew W. Mellon Foundation
Peter Burnhill EDINA, University of Edinburgh
for Hiberlink Team at University of Edinburgh & LANL Research Library
![Page 3: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/3.jpg)
Link Rot
‘Link Rot’ is known to be scary
![Page 4: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/4.jpg)
Content Drift - is when content at end of URI has changed
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(b) Dynamic content
as values on webpage
changes over time
(a) Over time, same URI
pointed different (often
unrelated) web pages
Not the same as
when first seen and
referenced by author
![Page 5: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/5.jpg)
= Reference Rot
“when links to web resources
no longer point to what they once did”
This is Threat to Integrity of Scholarly Record.
![Page 6: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/6.jpg)
Project 2 years: March 2013 to June 2015
Funder Andrew W. Mellon Foundation
Partners University of Edinburgh
EDINA & Language Technology Group,
School of Informatics
Los Alamos National Laboratory
Research Library
This is report of ‘Hiberlink’ Investigation
1. Defined Reference Rot & the Threat it posed
2. Generated large-scale evidence to measure extent and
way in which it exists & undermines the Scholarly Record
3. Then to envisage potential & practical Remedy
![Page 7: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/7.jpg)
c. 400,000 articles across three large corpora: ArXiv, PMC & Elsevier
2. Generating Evidence via Large-scale Enquiry
This involved
• converting PDFs into XML
• locating references in the body of the text as well as under ‘References’
• extracting each and every URL
• then using a ‘white list’ of publisher websites:
• Scholarly Web
• Web-at-large
=> over a million web at large references
![Page 8: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/8.jpg)
Focus on References to the Wild Web in
Scholarly Communication
=> Scholarly => To Web at large
Link Rot DOI, HTTP version of DOI ‘Web today, gone tomorrow’
Content Decay Has ‘fixity’ Need to add fixity to the dynamic
Archiving: CLOCKSS,
Portico, LOCKSS, etc, as
per Keepers Registry …
http://thekeepers.org
Focus for Hiberlink
![Page 9: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/9.jpg)
Scholarly Articles increasingly link to
Web Resources, not just back to other Articles
PMC corpus Elsevier corpus
![Page 10: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/10.jpg)
We then began to ask questions of those URIs
① Is the URI still on the ‘Live Web’?
• Allowed up to a maximum of 50 redirects
② Is there a ‘Memento’ of that content in the ‘Archived Web’?
• Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK
National Archives Web Archive & Icelandic National Archive
③ Does what exists at the end of that URI correspond to the
content that the author intended?
Memento: a prior version, what the Original Resource was like at some time in the past.
![Page 11: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/11.jpg)
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five
Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
PMC corpus
Results: Most Referenced URIs at risk, Many are Lost,
within 14 days of publication date!
3 in 4 are at risk of loss A fifth are lost forever!
![Page 12: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/12.jpg)
This is also true for what is commercially published:
within 14 days of publication date!
Elsevier corpus
3 in 4 are at risk of loss A third are lost forever!
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five
Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
![Page 13: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/13.jpg)
Content Drift (UK Web Archive, BL)
Andy Jackson (2015) Ten years of the UK web archive: what have we saved?
http://netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_03_Jackson.pptx
![Page 14: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/14.jpg)
=> Content of Citations Rot over Time!!
![Page 15: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/15.jpg)
… meaning rotten references for the reader
![Page 16: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/16.jpg)
Rot in References means a Defective Article!
… and sale of rotten goods
![Page 17: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/17.jpg)
Remedy for fish: Quick Freeze & Archive
![Page 18: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/18.jpg)
Remedy for References: Snapshots of Web Content
① Preparation -> Study -> Compose -> Submission
② Publication -> Editing -> (Revision) -> Acceptance -> Issue
③ Access platform/post-publication -> Reader Access -> Use
When best to intervene within 3 workflows:
![Page 19: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/19.jpg)
Clearly best at the earliest moment of capture
![Page 20: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/20.jpg)
… when the Authors are trawling for content
![Page 21: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/21.jpg)
… for what an Author regards as significant
![Page 22: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/22.jpg)
… or needs to provide as evidence
![Page 23: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/23.jpg)
3. Potential & Practical Remedy: Take a Snapshot & put in safe place until needed
a) Use web-scale archives that support on-demand creation of snapshots of URIs:
– archive.today; Internet Archive; perma.cc; webcitation.org
b) Embed this action in software used into basic workflows:
Activity Actor Snapshot Quality
1. Preparation Author/reference tool best
2. Submission /Issue Editor/manuscript
system
good
3. Access
(post-publication)
Aggregator/
publisher platform
better late than not
4. Shelving Librarian/IR, journal archive better than nothing
![Page 24: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/24.jpg)
①Preparation -> Study -> Compose -> Submission
a) We put focus on note-taking software: eg EndNote, Mendeley, Reference Manager, RefMe,
Zorero
b) We developed Plug-in for Zotero [open source]
• So good things could happen under the hood!
‘Best’: help authors do the right thing!
![Page 25: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/25.jpg)
a) Take simple URI - to article in New Yorker magazine (say)
• http://www.newyorker.com/magazine/2015/01/26/cobweb
b) Augment Link with Archive URI and Datetime
• http://web.archive.org/web/20150219094636/http://www.newyorker.co
m/magazine/2015/01/26/cobweb
• Archive timestamp: 2015-02-19T09:46:36
=> & so construct & cite the Robust Link:
• <a href=“http://www.newyorker.com/magazine/2015/01/26/cobweb”
data-
versionurl=“http://web.archive.org/web/20150219094636/http://www.n
ewyorker.com/magazine/2015/01/26/cobweb” data-
versiondate=“2015-02-19T09:46:36”>Cobweb</a>
Actionable Metadata: Use a ‘Robust Link’
• Robust links are modified <a> HTML elements
Herbert Van de Sompel et al. (2015) Robust Links - Link Decorations http://robustlinks.mementoweb.org/spec/
![Page 26: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/26.jpg)
What should we expect of the Publisher?
Beyond the assurance that
the fish / references / articles
sold are not rotten
![Page 27: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/27.jpg)
How Can Publishers Stop/Halt Reference Rot?
• You are better placed to know that!
• But here are some Hiberlink suggestions …
At the Point of Ingest
![Page 28: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/28.jpg)
How Can Publishers Stop/Halt Reference Rot?
• You are better placed to know that!
• But here are some Hiberlink suggestions …
At the Point of Sale & Use
![Page 29: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/29.jpg)
Activity Responsibility Snapshot Quality
2. Submission/Issue Editor/System good
3. Access Platform better late than not
Recommended Actions on Reference Rot
② Publication -> Editing -> (Revision) -> Acceptance -> Issue
• Accept Robust Links in Cited References!
• Batch archive snapshots & use Robust Links
③ Access/Post-Publication -> Reader Access -> Use
• Employ ‘Link Decoration’ & Robust Links for
references in past publications
![Page 30: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/30.jpg)
Link Decoration: JavaScript + Memento API
Demo - http://robustlinks.mementoweb.org/demo/uri_references_js.html robustlinks.js - https://github.com/mementoweb/robustlinks
![Page 31: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/31.jpg)
Creating Robust Links ‘post hoc’ …
If no ‘date accessed’, then use article’s Publication Date,
or *better still*, the Acceptance Date
– Express the Date in an actionable manner (‘datePublished’ or ‘dateModified’ Schema.org properties) in HTML pages that contain URI references
– Tailor robustlinks.js to exclude links to articles
– Inject robustlinks.js in HTML pages that contain URI references
… enables Users to follow that Link into Web Archives using Memento and/or WayBack Machine
![Page 32: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/32.jpg)
Be Proactive: Referenced Web Content for new articles
When ingesting new content into the platform:
– Parse for URI references
– Separate references to web-at-large from publisher sites
– Create snapshots in web archives of those URIs
– Then use Link Decorations in HTML to convey:
• original URI + snapshot URI + snapshot Date/Time
![Page 33: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/33.jpg)
1. A parser converts .pdf to .html & extracts URIs
2. Triggers archiving of content for each reference
• Author & Editor need to work together to determine
which archival copy is used
3. Creates an HTML version that includes the
Robust Link for each cited reference.
Algorithm for OJS plugin should generalise
to other submission systems
Our remedy was to write plugin for OJS
![Page 34: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/34.jpg)
We have also sketched HiberActive Infrastructure
Publishing
platform HiberActive
External archival
service
(e.g. Internet Archive)
• Asynchronous (returns Robust Link)
• Distributed (archived with different organisations)
• Lightweight (leveraging HTTP & what already exists)
to act as middleware between existing software & web archives
![Page 35: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/35.jpg)
Summary: Hiberlink Outcomes & Next Steps
1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it exists &
undermines the Scholarly Record
3. Pointed to potential & practical Remedy
4. Tell the world about these achievements
5. Engage with others
to build infrastructure
to prompt adoption (copying) of prototypes by 3rd
parties, such as reference managers, editorial systems,
publication systems, archival systems
![Page 36: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/36.jpg)
=> Published References have Robust Links to
what the author intended
![Page 37: Link Rot + Content Drift - STM · Link Rot + Content Drift = Reference Rot STM_2015 1 December 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh](https://reader033.vdocuments.mx/reader033/viewer/2022042515/5f52c5c66135ed28e93cbbea/html5/thumbnails/37.jpg)
STM_2015
1 December 2015
Funded by the Andrew W. Mellon Foundation
Thank you,
Questions welcome
& any interest in working together
http://hiberlink.org #hiberlink
Email: [email protected]