comparing published scientific journal articles to their pre-print versions
TRANSCRIPT
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.eduUniversity of California Los Angeles
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20162
Scientific Output in Numbers
Global STM publishing market > $25 billion• 55% of this from USA• 28% from Europe, Middle East
• Journals core part of scholarly communication process• English language journal revenue: ~ $10 billion
• ~ 70% of that out of libraries’ budget
• > 28k scholarly peer-reviewed journals (+3.5% p.a.)• ~ 2.5 million articles per year (+3% p.a.)• 21% of research papers from USA
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20163
University of California Publication Impact
“Research Performance of the UC System,” Elsevier, March 2015
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20164
Open Access by Disciplines
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010http://dx.doi.org/10.1371/journal.pone.0011273
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20165
Open Access Rate Overall
2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20166
Open Access Rate Overall
2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)
20.4% OA rate
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20167
Open Access Rate Overall
2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)
20.4% OA rate
2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al.(http://dx.doi.org/10.1007/s11192-015-1642-2)
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20168
Open Access Rate Overall
2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)
20.4% OA rate
2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al.(http://dx.doi.org/10.1007/s11192-015-1642-2)
61.1% OA rate
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20169
Pre-print v. Final Published
arXiv.org• Average annual operating cost for 2013 - 2017:
$826,000
Final Published• English language STM journals: $10 billion in 2013
http://arxiv.org/help/support/faq#3D“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201610
Role of Publisher
• Entrepreneur• Copyediting• Tagging• Marketer• Distributor• E-Host
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201611
Value of Publisher
“Once you’ve gone through the peer review process, if you look at the article that is actually published in a journal, it looks radically different [to the one submitted due to] that process of transformation, the copy-editing, the database linking, the data visualisation tools, making sure that the metadata for the article is all right, so when people come to [Elsevier database] ScienceDirect or type a search into Google, they can actually find what they are looking for on their platforms.”
Gemma Hershhttp://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201612
Working Assumptions
1. If the publishers’ argument is valid, the text of a pre-print paper should vary significantly from its corresponding post-print version.
2. By applying standard similarity measures, we should be able to detect and quantify such differences.
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201613
Assembling a pre-print corpus
Source: arXiv.org• 1.1 million publication records • Metadata (typical DC, including DOI) obtained
via OAI-PMH interface• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201614
Finding a matching post-print corpus
1. Extract DOIs from arXiv metadata• 44.5% or articles have DOI
2. CrossRef’s Metadata Search API• Match by DOI• Download article & metadata in XML/PDF
Results in:• 11,017 full text articles • Majority published by Elsevier between 2003 and
2015
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201615
Text Comparison Methods
1. Length ratio2. Levenshtein ratio3. Cosine similarity4. Jaccard coefficient5. Sorensen similarity
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201616
Comparison of Sections
“Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015http://dx.doi.org/10.1145/2756406.2756948
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201617
Comparison of Sections
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201618
Title Comparison
Explore our findings at http://sologlo.library.ucla.edu/prepost
Pape
rs
Similarity (1 = most similar)
% of all papers
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201619
Comparison of Sections
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201620
Abstract ComparisonPa
pers
Similarity (1 = most similar)
% of all papers
Explore our findings at http://sologlo.library.ucla.edu/prepost
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201621
10.1016/j.physletb.2006.10.068Physics Letters B
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201622
Comparison of Sections
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201623
Body ComparisonPa
pers
Similarity (1 = most similar)
% of all papers
Explore our findings at http://sologlo.library.ucla.edu/prepost
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201624
Publication DatesPa
pers
Number of days
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201625
Assembling a pre-print corpus
Source: arXiv.org• 1.1 million publication records • metadata (typical DC, including DOI) obtained
via OAI-PMH interface• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
• *Latest version used if multiple available*
• 35% of all arXiv papers have > 1 version• 58% of our matched papers have > 1 version• Repeat experiment with *earliest version*
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201626
Publication Dates of Earliest VersionsPa
pers
Number of days
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201627
Title DeltasPa
pers
% of all papers
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201628
Title DeltasPa
pers
% of all papers
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201629
Title DeltasPa
pers
% of all papers
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201630
Abstract DeltasPa
pers
% of all papers
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201631
Body DeltasPa
pers
% of all papers
Comparing Published Scientific Journal Articles to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201632
Discussion & Future Work
• Single corpus experiment• Pre-print/final published matches based on:
• DOIs• CrossRef API results• UCLA serial subscriptions (majority Elsevier
publications)
• Expand to other disciplines/publishers• Overlay with ISI Impact factor and usage statistics• Refine extraction/comparison of authors and
references• Operate at scale
Comparing Published Scientific Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.eduUniversity of California Los Angeles