exploring coverage and distribution of scholarly identifiers on the web

1

b

b

www.know-center.at

Exploring Coverage and Distribution of

Scholarly Identifiers on the Web14th International Symposium of Information Science

Zadar, 21 May 2015

Peter Kraker, Asura Enkhbayar & Elisabeth Lex

2Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics

The Emerging Eco-system of Scholarly Services on the Web


The Emerging Eco-system of Scholarly Services on the Web

5

Research Questions

• How are scholarly identifiers distributed in crowd-

sourced systems on the Web?

• Does the provision of different identifiers have an

influence on findability of scientific publications in other

bibliographic and bibliometric sources?

• Who are the top providers of identifiers?

• Which identifier combinations are the most common?

Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics

6

Indentifiers on the Scholarly Web

Article level

Publication levelAuthor level


7

Data & Method

Data Sources: arXiv, CrossRef

and Mendeley

arXiv discipline of quantitative

biology (q-bio), 1992-2014

(n=14,195 metadata records)Data collection: 17/11/2014

arXiv discipline of astrophysics

(astro-ph), 2009-2014

(n=81,814 metdata records)

Data collection: 04/03/2015


arxiv-ID, DOI

DOI

• arXiv-ID

• DOI

• Scopus-ID

• Pubmed-ID

• ISSN

8

Distribution of arXiv Articles Over Time: q-bio


9

Distribution of arXiv Articles Over Time: astro-ph


10

Results of the Data Collection Process


36.1%

50.0%

81.5%

29.8%

76.8%

86.1%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

docs with DOI onarXiv

docs with DOI docs found onMendeley

q-bio (1992-2014; n=14,195) astro-ph (2009-2014; n=81,814)

Change in the

metadata lookup

11

Findability on Mendeley: q-bio (1992-2014)


Total documents

14,195

Documents

with DOI & arXiv-ID

7,100 (50.02%)

Documents

with arXiv-ID

7,095 (49.98%)

Documents retrieved

on Mendeley

with DOI

6,492 (91.44%)

Documents retrieved

on Mendeley

with arXiv-ID

4,896 (69.01%)

Documents retrieved

on Mendeley

with arXiv-ID

5,414 (76.25%)

12

Findability on Mendeley: astro-ph (2009-2014)


Total documents

81,814

Documents

with DOI & arXiv-ID

62,800 (76.8%)

Documents

with arXiv-ID

19,014 (23.2%)

Documents retrieved

on Mendeley

with DOI

54,093 (86.1%)

Documents retrieved

on Mendeley

with arXiv-ID

12,812 (67.4%)

Documents retrieved

on Mendeley

with arXiv-ID

45,586 (72.6%)

13

Findability on Mendeley & DOI Coverage: q-bio


14

Findability on Mendeley & DOI Coverage: astro-ph


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2009 2010 2011 2012 2013 2014

findability

DOI availability

15

Identifier Frequency and Mean Readership on

Mendeley


arxiv doi scopus pmid issn

q-bio (n=11,570)

frequency10,351

(89.5%)

8,321

(71.9%)

8,409

(72.7%)

5,477

(47.3%)

8,119

(70.2%)

mean readership 20.4 25.4 25.4 32.4 25.9

median readership 9 13 13 19 14

astro-ph (n=70,459)

frequency65,719

(93.3%)

61,667

(87.5%)

58,780

(83.4%)

1,454

(2.1%)

60,985

(86.6%)

mean readership 7.2 7.3 7.5 20.6 7.3

median readership 5 5 13 5 6StdDev: 33.4


Distribution of Disciplines on Mendeley (March 2011)

[Kraker et al.

2012]

17

Identifier Combination Frequency on Mendeley


39.2%

22.8%

18.2%

4.4%

4.1%

2.7%

0.0%

0.0%

0.0%

8.5%

1.3%

10.2%

74.5%

0.0%

0.0%

4.5%

2.6%

1.7%

1.5%

3.7%

0% 20% 40% 60% 80% 100%

arxiv-doi-issn-pmid-scopus

arxiv

arxiv-doi-issn-scopus

doi-issn-pmid-scopus

arxiv-scopus

doi-issn-scopus

doi-arxiv-issn

doi-issn

doi-arxiv

Other

astro-ph (2009-2014; n=81,814) q-bio (1992-2014; n=14,195)

18

Conclusions

• As expected, crowd-sourced systems show big differences

in identifier coverage and distribution crowd-sourced

data needs to be amended automatically

• When retrieving arXiv articles from Mendeley, we were able

to obtain more articles using the DOI than the arXiv-ID

• BUT: a single arXiv-ID is the second most popular identifier

combination on Mendeley. This suggests that pre-prints are

being read – if at a lower level

• There is, however, a certain time lag concerning the

adoption of articles from arXiv on Mendeley


19

Conclusions

• The distribution of identifiers in a collection of papers

gives hints at the nature of papers in this collection

• As with citations, field normalization will be essential for

cross-comparison of altmetrics scores


20

Limitations & Future Work

• Choice of arXiv as primary data source introduces a bias

• We only looked at two disciplines from natural sciences

• Using a random sample of 381 articles from Web of Science,

Zahedi et al. (2014) report that they were able to retrieve only

47.7% of articles on Mendeley using the DOI or the title.

Expand this study to further disciplines and fields using

more data sources

Look more deeply into the reasons for disciplinary and

tool-related differences


21

Open Source Crawling Framework


https://github.com/Bubblbu/crawling-framework

https://github.com/Bubblbu/crawling-framework

22

b

SAVE THE DATE: i-KNOW Conference

Special Track on Science 2.0 & Open Science

21-23 October 2015, Graz, Austria

Submission Deadline (extended): 22 June 2015 (Abstracts due: June 8)

exploring coverage and distribution of scholarly identifiers on the web

Education

center gmbh research

datadriven business

big data analytics arxivid

data method data sources

data collection process

arxiv docs

mendeley doi coverage

mendeley qbio