S
CiteGraph: A Citation Network System for MEDLINE Articles and Analysis
Qing Zhang1,2, Hong Yu1,3
1University of Massachusetts Medical School, Worcester, MA, USA2University of Wisconsin Milwaukee, Milwaukee, Milwaukee, WI,
USA3VA Central Massachusetts, Leeds, MA, USA
CiteGraph, MedInfo 2013
Outline
Introduction
Background
Method
Evaluation
Analysis
CiteGraph, MedInfo 2013
Introduction
Citation network is important for Information retrieval Journal Impact Factor, H-index
Co-authorship network is important
Few citation networks are available for research
We built CiteGraph
CiteGraph, MedInfo 2013
Background
Citation network analysis Power law distribution in citation networks Article ranking, HITS and PageRank Community structure of physics fields Citation network tool for given legal issue using legal document
citation network
Co-authorship network analysis Research collaboration patterns Author authority : Erdös Number
Literature search CiteSeerX, Google Scholar
CiteGraph, MedInfo 2013
The CiteGraph Data
CiteGraph, MedInfo 2013
Citation Network Example
CiteGraph, MedInfo 2013
Challenges
(1)Yu, H and Lee M. 2006. Accessing Bioscience Images from Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556.
(2) Hong Yu and Minsuk Lee. Accessing Bioscience Images from Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556. 2006.
(3) Yu H, Lee H. 2006. Accessing Bioscience Images from Abstract Sentences. Bioinformatics: 22 (14), e547–e556.
CiteGraph, MedInfo 2013
Methods
Mapping between articles
Mapping articles to the PubMed ID
Author name disambiguation
Methods
If two of the following matching result are true, we consider the two entities (for example the citation and the article) are matched
Title matching the set of tokens contained in one title field is a subset of the tokens in the other,
or the number of tokens common to both fields is more than 80% of the size of the
larger of the two fields.
Author list matching two lists of surnames have one-on-one mapping surnames in one entity (citation) is fully contained in the surname set of the second
(article).
Journal name matching remove stop words such as “of” if the number of common initials in the journal titles was greater than 80% of the
tokens in the longer journal name, they were considered equivalent.
CiteGraph, MedInfo 2013
Evaluation Results
Task Precision
Recall
F1 Inter-Annotator Agreement (Kappa)
Citation Mapping
1 0.96 0.98 1
PMID Mapping 0.99 0.99 0.99 1
• 7 Annotators are invited to annotate the citation mapping and PMID mapping results
• Each annotator is presented with 20 matching results of each task
CiteGraph, MedInfo 2013
The CiteGraph Statistics
1.65 M articles 6.35 M citations
1.37 M authors
CiteGraph, MedInfo 2013
The CiteGraph Statistics
log y = 1.06 – 2.45* log x (p<0.05 t-test)
Livak KJ., Schmittgen TD., Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods. 2001 Dec;25(4):402-8.
CiteGraph, MedInfo 2013
The CiteGraph Statistics
Largest connected component : 1.27 million authors (92.7%)
The second largest connected component: 35 authors
CiteGraph, MedInfo 2013
The CiteGraph Statistics
Co-authorship spans from 1 to 35 years, while 83.7% of author pairs just appear once.
CiteGraph, MedInfo 2013
The CiteGraph Statistics
Measure Mean Median Std Max Min
# of Co-authors 11 6 14 671 0
Co-authorship Year Span 1.521 1 1.576 35 1
* The largest component is excluded when calculating the statistics in the table. Its size is 1.27 million (92.7% authors)
CiteGraph, MedInfo 2013
Trends
CiteGraph, MedInfo 2013
Conclusion
We created a citation/co-authorship networks with biomedical full text literature
Our networks have high accuracy and large scale, and it can benefit biomedical text mining communities Article ranking Research collaboration recommendation Social network analysis
The network database can be downloaded per request
CiteGraph, MedInfo 2013
Acknowledgement
National Institute of Health 1R01GM095476 to Hong Yu
A start-up fund from University of Massachusetts Medical School to Hong Yu
National Center for Advancing Translational Sciences of the National Institute of Health under award number UL1TR000161.