exploring coverage and distribution of scholarly identifiers on the web
TRANSCRIPT
1
b
b
www.know-center.at
Exploring Coverage and Distribution of
Scholarly Identifiers on the Web14th International Symposium of Information Science
Zadar, 21 May 2015
Peter Kraker, Asura Enkhbayar & Elisabeth Lex
2Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
The Emerging Eco-system of Scholarly Services on the Web
3Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
The Emerging Eco-system of Scholarly Services on the Web
5
Research Questions
• How are scholarly identifiers distributed in crowd-
sourced systems on the Web?
• Does the provision of different identifiers have an
influence on findability of scientific publications in other
bibliographic and bibliometric sources?
• Who are the top providers of identifiers?
• Which identifier combinations are the most common?
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
6
Indentifiers on the Scholarly Web
Article level
Publication levelAuthor level
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
7
Data & Method
Data Sources: arXiv, CrossRef
and Mendeley
arXiv discipline of quantitative
biology (q-bio), 1992-2014
(n=14,195 metadata records)Data collection: 17/11/2014
arXiv discipline of astrophysics
(astro-ph), 2009-2014
(n=81,814 metdata records)
Data collection: 04/03/2015
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
arxiv-ID, DOI
DOI
• arXiv-ID
• DOI
• Scopus-ID
• Pubmed-ID
• ISSN
8
Distribution of arXiv Articles Over Time: q-bio
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
9
Distribution of arXiv Articles Over Time: astro-ph
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
10
Results of the Data Collection Process
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
36.1%
50.0%
81.5%
29.8%
76.8%
86.1%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
docs with DOI onarXiv
docs with DOI docs found onMendeley
q-bio (1992-2014; n=14,195) astro-ph (2009-2014; n=81,814)
Change in the
metadata lookup
11
Findability on Mendeley: q-bio (1992-2014)
11Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
Total documents
14,195
Documents
with DOI & arXiv-ID
7,100 (50.02%)
Documents
with arXiv-ID
7,095 (49.98%)
Documents retrieved
on Mendeley
with DOI
6,492 (91.44%)
Documents retrieved
on Mendeley
with arXiv-ID
4,896 (69.01%)
Documents retrieved
on Mendeley
with arXiv-ID
5,414 (76.25%)
12
Findability on Mendeley: astro-ph (2009-2014)
12Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
Total documents
81,814
Documents
with DOI & arXiv-ID
62,800 (76.8%)
Documents
with arXiv-ID
19,014 (23.2%)
Documents retrieved
on Mendeley
with DOI
54,093 (86.1%)
Documents retrieved
on Mendeley
with arXiv-ID
12,812 (67.4%)
Documents retrieved
on Mendeley
with arXiv-ID
45,586 (72.6%)
13
Findability on Mendeley & DOI Coverage: q-bio
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
14
Findability on Mendeley & DOI Coverage: astro-ph
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2009 2010 2011 2012 2013 2014
findability
DOI availability
15
Identifier Frequency and Mean Readership on
Mendeley
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
arxiv doi scopus pmid issn
q-bio (n=11,570)
frequency10,351
(89.5%)
8,321
(71.9%)
8,409
(72.7%)
5,477
(47.3%)
8,119
(70.2%)
mean readership 20.4 25.4 25.4 32.4 25.9
median readership 9 13 13 19 14
astro-ph (n=70,459)
frequency65,719
(93.3%)
61,667
(87.5%)
58,780
(83.4%)
1,454
(2.1%)
60,985
(86.6%)
mean readership 7.2 7.3 7.5 20.6 7.3
median readership 5 5 13 5 6StdDev: 33.4
16Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
Distribution of Disciplines on Mendeley (March 2011)
[Kraker et al.
2012]
17
Identifier Combination Frequency on Mendeley
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
39.2%
22.8%
18.2%
4.4%
4.1%
2.7%
0.0%
0.0%
0.0%
8.5%
1.3%
10.2%
74.5%
0.0%
0.0%
4.5%
2.6%
1.7%
1.5%
3.7%
0% 20% 40% 60% 80% 100%
arxiv-doi-issn-pmid-scopus
arxiv
arxiv-doi-issn-scopus
doi-issn-pmid-scopus
arxiv-scopus
doi-issn-scopus
doi-arxiv-issn
doi-issn
doi-arxiv
Other
astro-ph (2009-2014; n=81,814) q-bio (1992-2014; n=14,195)
18
Conclusions
• As expected, crowd-sourced systems show big differences
in identifier coverage and distribution crowd-sourced
data needs to be amended automatically
• When retrieving arXiv articles from Mendeley, we were able
to obtain more articles using the DOI than the arXiv-ID
• BUT: a single arXiv-ID is the second most popular identifier
combination on Mendeley. This suggests that pre-prints are
being read – if at a lower level
• There is, however, a certain time lag concerning the
adoption of articles from arXiv on Mendeley
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
19
Conclusions
• The distribution of identifiers in a collection of papers
gives hints at the nature of papers in this collection
• As with citations, field normalization will be essential for
cross-comparison of altmetrics scores
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
20
Limitations & Future Work
• Choice of arXiv as primary data source introduces a bias
• We only looked at two disciplines from natural sciences
• Using a random sample of 381 articles from Web of Science,
Zahedi et al. (2014) report that they were able to retrieve only
47.7% of articles on Mendeley using the DOI or the title.
Expand this study to further disciplines and fields using
more data sources
Look more deeply into the reasons for disciplinary and
tool-related differences
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
21
Open Source Crawling Framework
Know-Center GmbH • Research Center for Data-Driven Business and Big Data Analytics
https://github.com/Bubblbu/crawling-framework