bibliographic information sources for computer science ... · databases & information systems...
TRANSCRIPT
![Page 1: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/1.jpg)
Bibliographic Information
Sources for Computer Science
with a Focus on Citations
Ralf Schenkel
![Page 2: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/2.jpg)
Databases & Information Systems Group
2
• Prof. Dr.-Ing. Ralf Schenkel
2003-2013 Max-Planck-Institut für Informatik,Saarbrücken
2013-2016 Universität Passau2016- Universität Trier
• Dr. Michael Ley
• Christin Kreutz
• Tobias Zeimetz
• Christopher Michels
• Lorik Dumani
![Page 3: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/3.jpg)
Dblp core team
• Dr. Michael Ley
• Dr. Marcel R. Ackermann
• Oliver Hoffmann
• Dr. Florian Reitz
• Dr. Michael Wagner
• Stefanie von Keutz
3
![Page 4: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/4.jpg)
Dblp Overview
4
![Page 5: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/5.jpg)
Dblp Overview
5
![Page 6: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/6.jpg)
Dblp Overview
6
~4.2 million publications, ~2.1 million authors,~400.000 new publications per year
Recent activities:* affiliation information* links to other authority providers, esp. ORCID
and WikiData
![Page 7: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/7.jpg)
Preview: Affiliations in dblp
7
![Page 8: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/8.jpg)
Other systems: Semantic Scholar
8
In high school
![Page 9: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/9.jpg)
Other Systems: MS Academic
?
![Page 10: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/10.jpg)
Other Systems: Google Scholar
10
![Page 11: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/11.jpg)
Other Systems
Common weakness:
Data quality issues due to automaticinformation collection
Advantage of dblp:
Manual data curation (with limits)
11
![Page 12: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/12.jpg)
How is publication data added to dblp?
12
?
![Page 13: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/13.jpg)
Dblp Data Ingestion Pipeline
13
publishersData Quality Control:• Selection• Correction• Author
disambiguation
Source Monitoring
Web
Data HarvestingHTML
extractedmeta data
![Page 14: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/14.jpg)
Outline
• Meta Data Harvesting
• Author Disambiguation
• Existing Metadata Collections
• Citations
14
![Page 15: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/15.jpg)
Harvesting is much more difficult now
15
Need to interactwith Web site,parsing staticHTML not enough
![Page 16: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/16.jpg)
Harvesting is much more difficult now
16Successful harvesting needs to implement Javascript
![Page 17: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/17.jpg)
Monitoring & harvesting: OXPath
Extension of XPath by University of Oxford (Georg Gottlob et al.)
• Actions: fill in forms, click buttons
• Extraction: specify what should be harvested
• Transformation: specify target XML format
• Iteration: loops, e.g., for paginated content
17
Michels, C., Fayzrakhmanov, R.R., Ley, M., Sallinger, E., Schenkel, R.: OXPath-based data acquisition for dblp. In: 2017 ACM/IEEE Joint Conference on Digital Libraries, 2017
![Page 18: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/18.jpg)
Example: Navigating Google Scholar
18
![Page 19: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/19.jpg)
Example: Navigating Google Scholar
19
![Page 20: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/20.jpg)
Example: Navigating Google Scholar
20
![Page 21: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/21.jpg)
Example: Navigating Google Scholar
21
![Page 22: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/22.jpg)
Example: Navigating Google Scholar
22
![Page 23: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/23.jpg)
Example: Navigating Google Scholar
23
![Page 24: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/24.jpg)
Example: Navigating Google Scholar
24
![Page 25: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/25.jpg)
Example: Navigating Google Scholar
25
![Page 26: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/26.jpg)
Example: Navigating Google Scholar
26
![Page 27: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/27.jpg)
Example: Navigating Google Scholar
27
![Page 28: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/28.jpg)
Example: Navigating Google Scholar
28
![Page 29: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/29.jpg)
Example: Navigating Google Scholar
29
![Page 30: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/30.jpg)
Advantages of OXPath
• More powerful than plain XPath: actions, extraction, transformation, iteration
• Possible to extract from several pages in one query
• Somewhat robust to changes in layout
Now in productive use at dblp
30
![Page 31: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/31.jpg)
Outline
• Meta Data Harvesting
• Author Disambiguation
• Existing Metadata Collections
• Citations
31
![Page 32: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/32.jpg)
Author Disambiguation: Homonyms
32
Multiple persons with the same namein the same profile
Hard problem for an algorithm (even for a human), may use• paper titles/topics• common coauthors• publication years• publication venues• …
![Page 33: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/33.jpg)
Author Disambiguation: Homonyms
33
Affiliations would be useful,but usuallynot available
![Page 34: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/34.jpg)
Author disambiguation: Homonyms
Heuristic aproach: Coautor Graph
• Nodes are authors of publications
• Edge between authors iff joint publication
Beispiel:
• Paper 1: Autors A, B, C
• Paper 2: Autors A, D, E
• Paper 3: Autors A, F, G
• Paper 4: Autors B, D
How many authors could the profile for author A represent?
• Remove A from coauthor graph
• Every connected subgraph with at least one coauthor of A represents a coauthorcommunity
• In the example, we may potentially have two different persons with name A
2-34
A
B C
D E
F
G
B C
D E
F
G
http://dblp.uni-trier.de/faq/How+does+dblp+detect+coauthor+communities.html
![Page 35: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/35.jpg)
How many authors does a profile represent?
2-35
![Page 36: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/36.jpg)
Author Disambiguation: Synonyms
36
The same person with different namesin different profiles
Identify pairs of candidateprofiles such that• Small name difference• Common coauthors• Common venues• Common topics• …+ manual corrections
![Page 37: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/37.jpg)
Author Disambiguation: Synonyms
37
The same person with different namesin different profiles
Identify pairs of candidateprofiles such that• Small name difference• Common coauthors• Common venues• Common topics• …
Last resort…
![Page 38: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/38.jpg)
Observation:Additional meta data can improve the quality of thedetection of synonyms and homonyms.
38
![Page 39: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/39.jpg)
Example: ORCID• Provides persistent digital identifier for authors
• Includes additional author-provided meta data aboutpublications, affiliations, …
• API & dumps
39
Data often incompleteor not fully correct
Before 1990?
![Page 40: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/40.jpg)
ORCID for Homonym Detection
41
ORCIDs of authors with this name whoclaimed at least one publication in thisprofile
After import of 625,000 ORCIDS:1,000 candidates for homonyms
Top candidate:10 persons in one profile
BUT:
![Page 41: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/41.jpg)
ORCID for Synonym Detection
42
Profiles with common ORCID include papers from the same author (but maybe other papers as well due to homonyms)
After import of 625,000 ORCIDS:4,500 candidates for synonyms
Top candidate:6 profiles with same ORCID
X. Xu, X. W. Xu, X. William Xu,Xun Xu, Xun W. Xu, Xun William Xu
Technical Universityof Catalunia, Barcelona
Universidad Carlos IIIde Madrid
BUT:
![Page 42: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/42.jpg)
Outline
• Meta Data Harvesting
• Author Disambiguation
• Existing Metadata Collections
• Citations
43
![Page 43: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/43.jpg)
Useful information not (always) in dblp
• Author affiliations
• Keywords
• Topics
• Abstracts
• Full texts
• Incoming and outgoing citations
• Performance indicators
• …
44
better disambiguationbetter search
better search
better result rankingbetter conferenceselection
![Page 44: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/44.jpg)
Sources for Bibliographic Metadata
• Dblp.org
• Semantic Scholar
• Aminer Open academic graph(includes Microsoft Academic Graph)
• Springer SciGraph
• CrossRef
• OpenCitations
• …
45
![Page 45: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/45.jpg)
Overview: properties of sources
46
Semantic Scholar
OAG -Aminer
OAG –Microsoft Academic
Springer SciGraph
CrossRef Open Citations
coverage CS universal universal Springer universal universal
# publs 7.2 million 154 million 166 million ~12 million 96 million ~300,000
in dblp 1.45 million 3.46 million 3.57 million ? ? ?
access dump dump dump API, dump API API, dump
size 20 GB 39 GB 103 GB ~200 GB - 3.5 GB
date Oct 2017 Mar 2017 Jun 2017 Nov 2017 live Dec 2017
Keywords
Topics
Abstracts partial
Full-texts
Citations planned partial
DOIs
Author aff. email partial
Funding partial
![Page 46: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/46.jpg)
SpringerNature SciGraph
• Linked Open Data with rich ontology
• funders, research projects, conferences, affiliationsand publications from SpringerNature and partners
• extension to citations, patents, clinical trials and usage numbers planned
• CC BY 4.0 license (NC for abstracts)
47http://www.springernature.com/gp/researchers/scigraph
![Page 47: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/47.jpg)
OpenCitations
• Initiative for Open Citations (I4OC): collaboration between scholarly publishers and researchers to promote the unrestricted availability of scholarly citation data
• As of January 2018, 50% of publications at CrossRef with open references
• OpenCitations:publishes open citations from CrossRef as RDF-basedcollection, using SPAR ontology
• COCI, the OpenCitations Index of Crossref open DOI-to-DOI references:
– 316,243,802 citations
– 45,145,889 bibliographic resources48
http://opencitations.net/index/coci
![Page 48: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/48.jpg)
CrossRef Example
49
"reference":[{"key":"38_CR1","unstructured":"Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets - on the design and usage of void. In: LinkedData on the Web Workshop (LDOW 2009), in Conjunction with WWW 2009 (2009)"},{"key":"38_CR2","unstructured":"Buil-Aranda, C., Corcho, O., Arenas, M.: Semantics and Optimization of the SPARQL 1.1 Federation Extension. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011. LNCS, vol.\u00a06644, pp. 1\u201315. Springer, Heidelberg (2011)","DOI":"10.1007\/978-3-642-21064-8_1","doi-asserted-by":"crossref"}, …]
http://api.crossref.org/works/10.1007/978-3-642-25073-6_38
![Page 49: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/49.jpg)
CrossRef Example
50
https://doi.org/10.1007/978-3-642-16898-7
![Page 50: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/50.jpg)
Problems of these Collections
• Update Frequency
• Data Quality
• Completeness / Coverage / Sparsity
51
![Page 51: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/51.jpg)
Data Quality: Automatic Extraction
52
Strange names, not linked to a profile
@article{liang12keep,
title={How to keep a knowledge base synchronized with its encyclopedia source},
author={Liang12, Jiaqing and Zhang, Sheng and Xiao134, Yanghua}
}
No info on venue, year ,…
Requires data cleaning
![Page 52: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/52.jpg)
Data Quality: What is a Publication?
53
Frequent problem: conference paper + followup journal paper
TOIS?
![Page 53: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/53.jpg)
Data Quality: What is a Publication?
54
Frequent problem: conference paper + followup journal paper
cites TOIS
cites ICTIR
cites TOIS
![Page 54: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/54.jpg)
Towards Quantifying Coverage:
Mapping papers & citations to dblp
Preprocessing: Index all dblp entries in Lucene
55
Authors
Title
Venue
Year
DOI
…
DOI Index
Title Index
dblp keyor
no match
key1, 14.1key2, 12.7key3, 11.5
…
Post-Filter byauthor overlap, venuesimilarity, temporal proximity, …
Mapping quality in general very good
![Page 55: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/55.jpg)
Coverage of dblp and Overlap
56
SemanticScholar
MicrosoftAcademic
AMiner
1.38 mio
1.97 mio
0.02 mio
0.0
4 m
io
0.18 mio
[not drawn to scale]
0.26 mio from dblp missing
![Page 56: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/56.jpg)
Overlap of dblp and CrossRef
DOI-based match in February 2018
• 4 million publications in dblp
• 3.2 million with DOI
• 3.1 million found in CrossRef
• 600,000 with citations (~15%)
– 16 million citation instances
– 4 million mapped based on DOI
– ~1 million mapped based on reference string (using a Cermine parser)
57
![Page 57: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/57.jpg)
Main Observation:
• All collections are too incomplete or too static tobe useful for productive use.
• Initiative for Open Citations has effect, but still limited for computer science
58
![Page 58: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/58.jpg)
Outline
• Meta Data Harvesting
• Author Disambiguation
• Existing Metadata Collections
• Citations
59
![Page 59: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/59.jpg)
Scientific Challenge:
Make bibliometric measures aware ofincompleteness and possible errors
Provide confidence intervals for bibliometricmeasures
62
![Page 60: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/60.jpg)
Possible uses of citations in dblp
• Estimate importance of conferences (to decide ifand when a conference should be added)
• Identify publication venues where coverage in dblp is incomplete (and missing part is important)
• Identify important new publication venues
63
… or if it is „fake science“
![Page 61: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/61.jpg)
DIY-Extraction from PDFs
• ScienceParse by Allen Institute for AI
• Reads (OCR‘ed) PDF as input
• Yields
– Abstract
– Authors with Emails
– Full text with (some) structure
– Citations with (some) structure
64https://github.com/allenai/science-parse
![Page 62: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/62.jpg)
Meta data
65
"name" : "204.pdf",
"metadata" : {
"source" : "CRF",
"title" : "The Odyssey Approach for Optimizing Federated
SPARQL Queries",
"authors" : [ "Gabriela Montoya", "Hala Skaf-Molli",
"Katja Hose" ],
"emails" : [ "[email protected]", "[email protected]",
"[email protected]" ],
"sections" : [ {
"heading" : "1 Introduction",
"text" : "Federated SPARQL query engines [1, 4, 7, …"
Grobid 0.5.1 output
![Page 63: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/63.jpg)
Citations
66
"references" : [ {
"title" : "ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints",
"author" : [ "M. Acosta", "M. Vidal", "T. Lampo", "J. Castillo", "E. Ruckhaus"],
"venue" : "In ISWC’11,",
"citeRegEx" : "1",
"shortCiteRegEx" : "1",
"year" : 2011
}, {
"title" : "Describing Linked Datasets",
"author" : [ "K. Alexander", "R. Cyganiak", "M. Hausenblas", "J. Zhao" ],
"venue" : "In LDOW’09,",
"citeRegEx" : "2",
"shortCiteRegEx" : "2",
"year" : 2009
}, …
Grobid 0.5.1 output
![Page 64: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/64.jpg)
Citation Contexts
67
"referenceMentions" : [ {
"referenceID" : 0,
"context" : "Federated SPARQL query engines [1, 4, 7, 14, 17] answer SPARQL queries over a
federation of SPARQL endpoints.",
"startOffset" : 31,
"endOffset" : 48
}, {
"referenceID" : 0,
"context" : "With limited access to statistics, however, most federated query engines rely
on heuristics [1, 17] to reduce the huge space of possible plans or on dynamic programming (DP)
[5, 7] to produce optimal plans.",
"startOffset" : 92,
"endOffset" : 99
}, …
![Page 65: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/65.jpg)
Fulltext Data Set
68
• Manual collection of availablePDFs from DVDs, Web sites, …
– (Almost) complete set of publications for DB/IR: SIGMOD, SIGIR, (P)VLDB, EDBT, TOIS, TODS, IR, TKDE, CACM, …
– ACL Anthology
– WWW, IJCAI, ISWC, CoRR, …
• Semi-automatic mapping to dblp keys
• Extraction of full text, citations, citation ctxs
~170.000 documents available (plus ~30.000 waiting to be processed)
![Page 66: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/66.jpg)
Evaluating Mapping Quality for Citations
96 papers from PVLDB Volume 10, converted withScienceParse
• 3084 manually annotated citations
• 2700 with well-defined match in dblp
Results: (with best parameter setting, no systematic eval)
• Recall: ~96%
• Precision: ~97.5%
69
A lot worse on old, OCR‘ed publications until ~2000(finding citation & segmentation fails, OCR errors, …)
![Page 67: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/67.jpg)
Evaluating Mapping Quality for Citations
Try to re-find references from Crossref alreadymapped to DOIs
70
{"key":"38_CR2","unstructured":"Buil-Aranda, C., Corcho, O., Arenas, M.: Semantics and Optimization of the SPARQL 1.1 FederationExtension. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011. LNCS, vol.\u00a06644, pp. 1\u201315. Springer, Heidelberg (2011)","DOI":"10.1007\/978-3-642-21064-8_1","doi-asserted-by":"crossref"}
conf/esws/ArandaAC11
DOI match
journals/ws/ArandaACP13
Parse with citationparser & textual match
![Page 68: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/68.jpg)
Evaluating Mapping Quality for Citations
• 2,516 articles from SIGIR, ECIR, SIGMOD, ISWC, TOIS, IR Journal, VLDB Journal from 2013 or newerwith 29,932 matching citations in Crossref
• Citation parsing with Cermine:22282 correctly matched (0.744%),160 incorrect matches, 7283 not matched
• Citation parsing with Grobid:27993 correctly matched (0.935%),317 incorrect matches, 1125 not matched
71
![Page 69: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/69.jpg)
Experiment on CoRR Jan-Jun 2017
72
venue matched not matched overall missing foundcvpr 5120 47 5167 0,91% 99,09%advances in neural information processing systems 4205 66 4271 1,55% 98,45%nips 2795 60 2855 2,10% 97,90%
ieee conference on computer vision and pattern recognition 2806 43 2849 1,51% 98,49%
corr 2327 45 2372 1,90% 98,10%
ieee transactions on information theory 2004 71 2075 3,42% 96,58%ieee trans. inf. theory 2005 61 2066 2,95% 97,05%
iccv 1807 27 1834 1,47% 98,53%eccv 1809 14 1823 0,77% 99,23%
journal of machine learning research 1519 281 1800 15,61% 84,39%icml 1714 70 1784 3,92% 96,08%phd thesis 577 1160 1737 66,78% 33,22%
ieee transactions on pattern analysis and machine intelligence 1553 69 1622 4,25% 95,75%
464 1146 1610 71,18% 28,82%international conference on machine learning 1486 46 1532 3,00% 97,00%ieee trans. wireless commun 1328 62 1390 4,46% 95,54%
technical report 235 1035 1270 81,50% 18,50%ieee 868 350 1218 28,74% 71,26%
ieee trans. signal process 1049 45 1094 4,11% 95,89%neural computation 1046 40 1086 3,68% 96,32%ieee transactions on signal processing 949 73 1022 7,14% 92,86%ieee transactions on automatic control 806 197 1003 19,64% 80,36%ieee transactions on image processing 944 44 988 4,45% 95,55%
Most frequently extracted venues (after some normalization)
![Page 70: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/70.jpg)
Experiment on CoRR Jan-Jun 2017
73
Venues with significant holes in dblpVenue found not foundphd thesis 577 1160
464 1146
technical report 235 1035science 32 578
springer science & business media 104 520
national academy of sciences 87 506
cambridge university press 207 463plos one 215 431
nature 342 427physical review e 113 370
journal of the american statistical association 17 360ieee 868 350
the annals of statistics 36 326
springer 195 320journal of machine learning research 1519 281
ieee transactions on power systems 34 280physical review letters 54 268
crc press 36 203ieee transactions on automatic control 806 197
master’s thesis 17 185ieee signal processing magazine 240 179
mismatches(no dblp coverage)
![Page 71: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/71.jpg)
Experiment on CoRR Jan-Jun 2017
venuethe annals of mathematical statisticspsychological reviewjournal of the royal statistical society. series bjournal of personality and social psychologyjournal of statistical softwareamerican journal of sociology
behavior research methodseconometrica: journal of the econometric society
biglearnwiley online library
naval research logistics quarterly
cognitive psychologythe journal of physiologyannual review of sociologyjournal of marketing researchmonthly weather reviewmathematische annalen
problemy peredachi informatsiibiometrics
74
Not found168152139
968569
6864
5453
49
464545444443
4240
Venues that could not be matched to dblp
Math
Sociology
Psychology
Other Sciences
Missing NIPS workshop (no longer available)
![Page 72: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/72.jpg)
Conclusion
• Open meta data is becoming more important and more available
• Quality and scope of available meta data is still unclear
• Bibliometric measures must take this uncertaintyinto account
75
![Page 73: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/73.jpg)
Future Work for dblp
• Integrate with more data providers (currentlyORCID and WikiData)
• Connect to bibliographic data providers from otherdomains
• Develop model for conference series and events
• Include references to published data (e.g., DataCite)
76
![Page 74: Bibliographic Information Sources for Computer Science ... · Databases & Information Systems Group 2 •Prof. Dr.-Ing. Ralf Schenkel 2003-2013 Max-Planck-Institut für Informatik,](https://reader034.vdocuments.mx/reader034/viewer/2022042921/5f68b5747096a059ed2541cf/html5/thumbnails/74.jpg)
Future Work for Research
• Collect more extensive metadata for conferences
– Organizers
– Members of the program committee
– Reviewers
– Keynote speakers
– …
• Exploit this information for better estimation of thereputation of scientists (and of conferences)
77