breakthrough paper indicator: early detection and
TRANSCRIPT
©20
12 T
hom
son
Reu
ters
1
I. Ponomarev, D. Williams, B. Lawton, D. Cross, Y. Seger, J. Schnell, L. Haak
Breakthrough Paper Indicator: Early Detection and Measurement of Ground-Breaking Research
11th International Conference on CRIS Prague, June 9th, 2012
©20
12 T
hom
son
Reu
ters
2
Acknowledgements
• Initial phase of this study was supported by US National Institutes of Health contract #HHSN272200900231U
• Laure Haak is currently the Executive Director of ORCID, Washington, DC
©20
12 T
hom
son
Reu
ters
3
What is it all about?
• A CRIS aims at assisting the users in their recording, reporting, decision-making concerning the research process, and
assessing results or transferring technology.• Thursday workshop “Future Trends in Research Information
Management “:
How do we recognize advances in technologies early enough?
• We have enough data to discover precursors of outstanding research and emergent technologies at early stage.
©20
12 T
hom
son
Reu
ters
4
Overview
1. Breakthrough paper Indicator as a multidimensional vector
2. Citation analysis. Identifying breakthrough thresholds: subject categories, ranking.
3. Identifying time-dependent pattern of citation curves
4. Predicting breakthroughs: models and parameters range
5. Statistical validation of results
6. Conclusions
7. Future Directions: quality of citations
©20
12 T
hom
son
Reu
ters
5
Gradual knowledge creation:Frequent, moderate extension of knowledge borders
Intuitive Definition of Breakthrough Papers
Knowledge
Research activity expands knowledge borders
Gradualprogress Breakthroughs
Recognition• Breakthroughs are highly cited
• Recognition by field experts
• Interviews, media coverage
• New key words, coin words
• Citation in systematic reviews, textbooks, policy statements
• Informal citations (Author name in titles, abstracts, etc.)
Breakthroughs :Rare events which reshape science
Increased activity• Increased number of publications in new research direction.
• Conferences, workshops
• More frequent use of related key words
• Increased funding support
Innovation and application• Patents and applications
• Emerging technologies
©20
12 T
hom
son
Reu
ters
6
Long Term Goal:Breakthrough Paper Indicator as a
multidimensional vector
This talk:First 3 dimensions
Citation analysis as a proxy of BP
©20
12 T
hom
son
Reu
ters
7
Data Sets
#1: Web of Science (WOS) 1 year publications sets, 1999-2005 (n =1.67M)
– Very large sample sizes provide good statistics– Selection of subject categories and highly cited– Used as test data sets to develop a model
#2: Known Breakthrough Papers identified by subject matter experts
– Used for validation of our models
#3: Chemistry and biology subsets of 2005 data in #1.
– Used for validation of our models
Year
Total Publication
s1999 264,6832000 273,4142001 295,6932002 313,6792003 330,4792004 350,9802005 375,372
©20
12 T
hom
son
Reu
ters
8
Breakthrough Papers Are…
• 5 years would be enough to distinguish the majority of the BP from all others.
• Comparison of 2005 WoS Data Set (~375k papers) and known BPs (11 papers)
• 73% of known breakthrough papers have more than 100 citations
… highly cited
0
2
4
6
8
10
12
Actual Number of Citations at 60 Months Since Publication
Pe
rce
nta
ge
of
Pu
blic
ati
on
s
©20
12 T
hom
son
Reu
ters
9
What is an Appropriate Citation Threshold?
0 50 100 150 200 250 300 350 400 450 500
Number of Breakthroughs Varies by Citation Threshold
Total Breakthrough Publications at Threshold
Ranking reduces the number of candidate breakthroughs
# of publicatio
ns204,483
77,324
58,35451,72533,54526,96125,20923,13720,11018,83618,55517,64317,39611,7998,7496,2035,6823,9043,149
714187
78
Thresholds should be dependent on the subject categories
©20
12 T
hom
son
Reu
ters
10
BPs in 2005 by Journal Subject Category
Journal Subject Category
Total
Journals
Total
Publications60 months citation
for top 0.1%
Total
Breakthroughs Multidisciplinary 39 6,203 1,023 6Computer Science 43 3,149 341 3Materials Science 28 5,682 322 5Mathematics 21 3,904 308 3Physics 36 17,643 306 17
Molecular Biology & Genetics 420 58,354 296 58
Biology & Biochemistry 532 77,324 282 78Clinical Medicine 1,788 204,483 277 204
Chemistry 169 51,725 262 51Immunology 134 18,555 258 18Engineering 155 20,110 242 20Geosciences 11 714 222 1
Neuroscience & Behavior 284 33,545 189 33Microbiology 101 17,396 176 17Environment/Ecology 75 11,799 156 11
Plant & Animal Science 217 23,137 136 23
Psychiatry/Psychology 315 18,836 136 18Agricultural Sciences 62 8,749 125 8
Pharmacology & Toxicology 222 26,961 122 26
Social Sciences, general 449 25,209 119 25Economics & Business 12 187 71 1Space Science 6 78 40 1
©20
12 T
hom
son
Reu
ters
11
Time Dependence of BP Citation Threshold
2006 2007 2008 2009 2010
Previous years’ threshold dependence helps to forecast a future value of BP citation threshold. For majority subject
categories this dependence is monotonic or independent of time
Projected value ofBP threshold
BP
Thr
esho
ld, #
of c
itatio
ns
©20
12 T
hom
son
Reu
ters
12
Findings:Breakthrough Papers…
• Should be selected within the same subject category• Should be selected from a large set of papers published in the same time range (e.g., during the same year or month) • Should be in the top 0.1% of highly cited papers• Threshold does not change with time for majority of research fields
©20
12 T
hom
son
Reu
ters
13
Why is Time Dependent Analysis of Citation Patterns Important?
Question: If 3 papers have the same citation count after 12 months, which one of three (A,B,C) has the best chance to become a breakthrough?
C
B
A Answer:
C
Breakthrough threshold
©20
12 T
hom
son
Reu
ters
14
Typical Time Dependence
Observations (Statistically tested on thousands of publications):1. There is initial period with slow citation growth (information digesting) defined by2. The cutoff time is publication specific. In the vast majority of cases months 3. After cutoff there are 3 different types of behavior: A,B,C
Long growth
Aging
Plateau
©20
12 T
hom
son
Reu
ters
15
What is the initial time period needed for prediction with certain amount of reliability?
Cumulative Citations
Fre
qu
en
cy
Fre
qu
en
cy
Cumulative Citations
©20
12 T
hom
son
Reu
ters
16
Observations
• Citation patterns prior to six months have not yet developed.
• 12 to 24 months provides sufficient time to identify key citation patterns (keeping ~90% of BP)
©20
12 T
hom
son
Reu
ters
17
Linear and Non-linear models
Goal: using data for the first n months, predict whether a paper has a breakthrough potential (will be above a threshold at 60 months)
©20
12 T
hom
son
Reu
ters
18
Parameter Ranges for Statistical Analysis
• Time Range:– 6 Months or more
• Minimum Citation Count:– Only consider publications with >4 citations at 6 months
(eliminated 98% papers including 40% BP)
©20
12 T
hom
son
Reu
ters
19
Assessing the Models:Precision and Recall
Recall: are we capturing all of the actual breakthroughs?
number of actual breakthroughs detected /
total number of actual breakthroughs in data set
Precision: what percentage of our predictions are actually breakthroughs?
number of actual breakthroughs detected /
total number of predicted breakthroughs
0
assess
forecast
Predicted actual BPs
Predicted false BPs
Missing BPs
1 year 5 years
©20
12 T
hom
son
Reu
ters
20
Assessing the Models: Chemistry
Months of Data 6 12 24 36 60
1 # Papers Predicted 1 5 13 22 27
2 # Correctly Predicted 1 4 12 20 26
3 # Incorrectly Predicted 0 1 1 2 1
4 Precision (Row#2/Row#1) 100% 80% 92% 91% 96%
5 Recall (Row #2/30 Actual) 3% 13% 40% 67% 87%
Linear Fitting
Non-linear Fitting
794 Chemistry publications from 2005 had > 5 citations at 6 months.
30 were breakthroughs (0.1% highest ranked chemistry papers by 60 month citation count)
Months of Data 6 12 24 36 60
1 # Papers Predicted 266 51 34 32 29
2 # Correctly Predicted 12 19 23 25 28
3 # Incorrectly Predicted 254 32 11 7 1
4 Precision (Row#2/Row#1) 5% 37% 68% 78% 97%
5 Recall (Row #2/30 Actual) 40% 63% 77% 83% 93%
©20
12 T
hom
son
Reu
ters
21
Intermediate Conclusions
• Optimal breakthrough threshold value of cumulative citations varies by subject category.
• Number of breakthrough papers (above threshold) is determined by the top 0.1% of highly cited publications, and is dependent on the number of publications in the research area (Clinical medicine > Mathematics).
• We observed three typical citation patterns over time.
• We developed a method that predicts potential BP candidates and estimates the probability to hit a threshold (based on time since publications).
• Two fitting models were tested (linear and non-liner). Linear model gives better results at earlier stages (less than 12 months). Exponential predicts better for longer times of monitoring.
©20
12 T
hom
son
Reu
ters
22
Proposed BP indicator approaches
1. Time since publications to start monitoring should be longer than 6 months. (12 or longer is more reliable).
2. Instantaneous list of BPs candidates:
• Using at least 6 months of citation data, we can provide a candidate list of BP papers in each journal subject category
• Number of papers in the list may be a variable (top 0.1% ranked paper) rather than a fixed number.
3. Long term updated list of BPs candidates:
•. 24 or longer months of monitoring
•. Selected from a larger pool 24<t_pub<60 publications based on 24 months of citation data.
©20
12 T
hom
son
Reu
ters
23
Future directions: Quality of Citations
1. Subject interdisciplinarity (how many disciplines?)
2. Geographical diversity (how many different countries?)
3. Prestige vs Popularity (how many in high impact journals?)
• The number of citations is an approximate proxy for scientific impact. Some papers can receive high citation count but will not recognized by community as ground breaking research
• Supplementary information is necessary to distinguish true path-finding discovery from incremental knowledge generation
©20
12 T
hom
son
Reu
ters
24
Future work: Interdisciplinarity
Meaning: D – effective number of SCs
About N=260 different subject categories in WoS
Rao C. R. (1982)Stirling A. (1998)Porter A. L., et. al. (2006)Rafols I., M. Meyer (2010)
©20
12 T
hom
son
Reu
ters
25
Future work: Interdisciplinarity
( )
1
1
1
1
1
2
0
1
2
1
exp log
1
N
Nq
q
i i
i
N
ii j
q i
i
j
i j
N
ii
D p
S
D p p
D
I p p
D N
p
=
�
−
=
−
=
= −
=
= −
=
=
� � �
� ��
� �
��
�� �
�
�
�
�
�
�
, proportions of and subject categoriesth thi jp p i j−
- Richness
- exp(Shannon Entropy)
- Inverse Simpson Index
- Rao-Stirling quadratic diversity
Similarity matrix
©20
12 T
hom
son
Reu
ters
26
Future work: Interdisciplinarity
• All diversity measures are highly correlated between each other
• One needs to measure difference between references and citations diversity indices
Distribution of normalized frequencies for citations and references for all similar publications
in “Distel” similarity set [left], for citations and references of “Distel” paper only[Right]