comparing published scientific journal articles to their pre-print versions

33
Comparing Published Scientific Journal Articles to Their Pre-print Versions @mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016 1 Comparing Published Scientific Journal Articles to Their Pre-print Versions Martin Klein Peter Broadwell @mart1nkle1n @peterbroadwell with Sharon E. Farb and Todd Grappone @farbthink, @liber8er {martinklein,broadwell,farb,grappone}@libr ary.ucla.edu University of California Los Angeles

Upload: martin-klein

Post on 22-Jan-2017

890 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles

to Their Pre-print Versions

Martin Klein Peter Broadwell@mart1nkle1n @peterbroadwell

with Sharon E. Farb and Todd Grappone@farbthink, @liber8er

{martinklein,broadwell,farb,grappone}@library.ucla.eduUniversity of California Los Angeles

Page 2: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20162

Scientific Output in Numbers

Global STM publishing market > $25 billion• 55% of this from USA• 28% from Europe, Middle East

• Journals core part of scholarly communication process• English language journal revenue: ~ $10 billion

• ~ 70% of that out of libraries’ budget

• > 28k scholarly peer-reviewed journals (+3.5% p.a.)• ~ 2.5 million articles per year (+3% p.a.)• 21% of research papers from USA

“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015

Page 3: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20163

University of California Publication Impact

“Research Performance of the UC System,” Elsevier, March 2015

Page 4: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20164

Open Access by Disciplines

“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010http://dx.doi.org/10.1371/journal.pone.0011273

Page 5: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20165

Open Access Rate Overall

2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)

Page 6: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20166

Open Access Rate Overall

2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)

20.4% OA rate

Page 7: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20167

Open Access Rate Overall

2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)

20.4% OA rate

2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al.(http://dx.doi.org/10.1007/s11192-015-1642-2)

Page 8: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20168

Open Access Rate Overall

2010 “Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.(http://dx.doi.org/10.1371/journal.pone.0011273)

20.4% OA rate

2015 “Open Access and Sources of Full-Text Articles in Google Scholar in Different Subject Fields”, Hammid et al.(http://dx.doi.org/10.1007/s11192-015-1642-2)

61.1% OA rate

Page 9: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/20169

Pre-print v. Final Published

arXiv.org• Average annual operating cost for 2013 - 2017:

$826,000

Final Published• English language STM journals: $10 billion in 2013

http://arxiv.org/help/support/faq#3D“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015

Page 10: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201610

Role of Publisher

• Entrepreneur• Copyediting• Tagging• Marketer• Distributor• E-Host

Page 11: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201611

Value of Publisher

“Once you’ve gone through the peer review process, if you look at the article that is actually published in a journal, it looks radically different [to the one submitted due to] that process of transformation, the copy-editing, the database linking, the data visualisation tools, making sure that the metadata for the article is all right, so when people come to [Elsevier database] ScienceDirect or type a search into Google, they can actually find what they are looking for on their platforms.”

Gemma Hershhttp://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037

Page 12: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201612

Working Assumptions

1. If the publishers’ argument is valid, the text of a pre-print paper should vary significantly from its corresponding post-print version.

2. By applying standard similarity measures, we should be able to detect and quantify such differences.

Page 13: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201613

Assembling a pre-print corpus

Source: arXiv.org• 1.1 million publication records • Metadata (typical DC, including DOI) obtained

via OAI-PMH interface• PDF versions of articles available via Amazon’s

S3 service (using “requester pays” option)

Page 14: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201614

Finding a matching post-print corpus

1. Extract DOIs from arXiv metadata• 44.5% or articles have DOI

2. CrossRef’s Metadata Search API• Match by DOI• Download article & metadata in XML/PDF

Results in:• 11,017 full text articles • Majority published by Elsevier between 2003 and

2015

Page 15: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201615

Text Comparison Methods

1. Length ratio2. Levenshtein ratio3. Cosine similarity4. Jaccard coefficient5. Sorensen similarity

Page 16: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201616

Comparison of Sections

“Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015http://dx.doi.org/10.1145/2756406.2756948

Page 17: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201617

Comparison of Sections

Page 18: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201618

Title Comparison

Explore our findings at http://sologlo.library.ucla.edu/prepost

Pape

rs

Similarity (1 = most similar)

% of all papers

Page 19: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201619

Comparison of Sections

Page 20: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201620

Abstract ComparisonPa

pers

Similarity (1 = most similar)

% of all papers

Explore our findings at http://sologlo.library.ucla.edu/prepost

Page 21: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201621

10.1016/j.physletb.2006.10.068Physics Letters B

Page 22: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201622

Comparison of Sections

Page 23: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201623

Body ComparisonPa

pers

Similarity (1 = most similar)

% of all papers

Explore our findings at http://sologlo.library.ucla.edu/prepost

Page 24: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201624

Publication DatesPa

pers

Number of days

Page 25: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201625

Assembling a pre-print corpus

Source: arXiv.org• 1.1 million publication records • metadata (typical DC, including DOI) obtained

via OAI-PMH interface• PDF versions of articles available via Amazon’s

S3 service (using “requester pays” option)

• *Latest version used if multiple available*

• 35% of all arXiv papers have > 1 version• 58% of our matched papers have > 1 version• Repeat experiment with *earliest version*

Page 26: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201626

Publication Dates of Earliest VersionsPa

pers

Number of days

Page 27: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201627

Title DeltasPa

pers

% of all papers

Page 28: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201628

Title DeltasPa

pers

% of all papers

Page 29: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201629

Title DeltasPa

pers

% of all papers

Page 30: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201630

Abstract DeltasPa

pers

% of all papers

Page 31: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201631

Body DeltasPa

pers

% of all papers

Page 32: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles to Their Pre-print Versions

@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/201632

Discussion & Future Work

• Single corpus experiment• Pre-print/final published matches based on:

• DOIs• CrossRef API results• UCLA serial subscriptions (majority Elsevier

publications)

• Expand to other disciplines/publishers• Overlay with ISI Impact factor and usage statistics• Refine extraction/comparison of authors and

references• Operate at scale

Page 33: Comparing Published Scientific Journal Articles  to Their Pre-print Versions

Comparing Published Scientific Journal Articles

to Their Pre-print Versions

Martin Klein Peter Broadwell@mart1nkle1n @peterbroadwell

with Sharon E. Farb and Todd Grappone@farbthink, @liber8er

{martinklein,broadwell,farb,grappone}@library.ucla.eduUniversity of California Los Angeles