breakthrough paper indicator: early detection and

29
©2012 Thomson Reuters 1 I. Ponomarev, D. Williams, B. Lawton, D. Cross, Y. Seger, J. Schnell, L. Haak Breakthrough Paper Indicator: Early Detection and Measurement of Ground-Breaking Research 11th International Conference on CRIS Prague, June 9th, 2012 [email protected]

Upload: others

Post on 16-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

©20

12 T

hom

son

Reu

ters

1

I. Ponomarev, D. Williams, B. Lawton, D. Cross, Y. Seger, J. Schnell, L. Haak

Breakthrough Paper Indicator: Early Detection and Measurement of Ground-Breaking Research

11th International Conference on CRIS Prague, June 9th, 2012

[email protected]

©20

12 T

hom

son

Reu

ters

2

Acknowledgements

• Initial phase of this study was supported by US National Institutes of Health contract #HHSN272200900231U

• Laure Haak is currently the Executive Director of ORCID, Washington, DC

©20

12 T

hom

son

Reu

ters

3

What is it all about?

• A CRIS aims at assisting the users in their recording, reporting, decision-making concerning the research process, and

assessing results or transferring technology.• Thursday workshop “Future Trends in Research Information

Management “:

How do we recognize advances in technologies early enough?

• We have enough data to discover precursors of outstanding research and emergent technologies at early stage.

©20

12 T

hom

son

Reu

ters

4

Overview

1. Breakthrough paper Indicator as a multidimensional vector

2. Citation analysis. Identifying breakthrough thresholds: subject categories, ranking.

3. Identifying time-dependent pattern of citation curves

4. Predicting breakthroughs: models and parameters range

5. Statistical validation of results

6. Conclusions

7. Future Directions: quality of citations

©20

12 T

hom

son

Reu

ters

5

Gradual knowledge creation:Frequent, moderate extension of knowledge borders

Intuitive Definition of Breakthrough Papers

Knowledge

Research activity expands knowledge borders

Gradualprogress Breakthroughs

Recognition• Breakthroughs are highly cited

• Recognition by field experts

• Interviews, media coverage

• New key words, coin words

• Citation in systematic reviews, textbooks, policy statements

• Informal citations (Author name in titles, abstracts, etc.)

Breakthroughs :Rare events which reshape science

Increased activity• Increased number of publications in new research direction.

• Conferences, workshops

• More frequent use of related key words

• Increased funding support

Innovation and application• Patents and applications

• Emerging technologies

©20

12 T

hom

son

Reu

ters

6

Long Term Goal:Breakthrough Paper Indicator as a

multidimensional vector

This talk:First 3 dimensions

Citation analysis as a proxy of BP

©20

12 T

hom

son

Reu

ters

7

Data Sets

#1: Web of Science (WOS) 1 year publications sets, 1999-2005 (n =1.67M)

– Very large sample sizes provide good statistics– Selection of subject categories and highly cited– Used as test data sets to develop a model

#2: Known Breakthrough Papers identified by subject matter experts

– Used for validation of our models

#3: Chemistry and biology subsets of 2005 data in #1.

– Used for validation of our models

Year

Total Publication

s1999 264,6832000 273,4142001 295,6932002 313,6792003 330,4792004 350,9802005 375,372

©20

12 T

hom

son

Reu

ters

8

Breakthrough Papers Are…

• 5 years would be enough to distinguish the majority of the BP from all others.

• Comparison of 2005 WoS Data Set (~375k papers) and known BPs (11 papers)

• 73% of known breakthrough papers have more than 100 citations

… highly cited

0

2

4

6

8

10

12

Actual Number of Citations at 60 Months Since Publication

Pe

rce

nta

ge

of

Pu

blic

ati

on

s

©20

12 T

hom

son

Reu

ters

9

What is an Appropriate Citation Threshold?

0 50 100 150 200 250 300 350 400 450 500

Number of Breakthroughs Varies by Citation Threshold

Total Breakthrough Publications at Threshold

Ranking reduces the number of candidate breakthroughs

# of publicatio

ns204,483

77,324

58,35451,72533,54526,96125,20923,13720,11018,83618,55517,64317,39611,7998,7496,2035,6823,9043,149

714187

78

Thresholds should be dependent on the subject categories

©20

12 T

hom

son

Reu

ters

10

BPs in 2005 by Journal Subject Category

Journal Subject Category

Total

Journals

Total

Publications60 months citation

for top 0.1%

Total

Breakthroughs Multidisciplinary 39 6,203 1,023 6Computer Science 43 3,149 341 3Materials Science 28 5,682 322 5Mathematics 21 3,904 308 3Physics 36 17,643 306 17

Molecular Biology & Genetics 420 58,354 296 58

Biology & Biochemistry 532 77,324 282 78Clinical Medicine 1,788 204,483 277 204

Chemistry 169 51,725 262 51Immunology 134 18,555 258 18Engineering 155 20,110 242 20Geosciences 11 714 222 1

Neuroscience & Behavior 284 33,545 189 33Microbiology 101 17,396 176 17Environment/Ecology 75 11,799 156 11

Plant & Animal Science 217 23,137 136 23

Psychiatry/Psychology 315 18,836 136 18Agricultural Sciences 62 8,749 125 8

Pharmacology & Toxicology 222 26,961 122 26

Social Sciences, general 449 25,209 119 25Economics & Business 12 187 71 1Space Science 6 78 40 1

©20

12 T

hom

son

Reu

ters

11

Time Dependence of BP Citation Threshold

2006 2007 2008 2009 2010

Previous years’ threshold dependence helps to forecast a future value of BP citation threshold. For majority subject

categories this dependence is monotonic or independent of time

Projected value ofBP threshold

BP

Thr

esho

ld, #

of c

itatio

ns

©20

12 T

hom

son

Reu

ters

12

Findings:Breakthrough Papers…

• Should be selected within the same subject category• Should be selected from a large set of papers published in the same time range (e.g., during the same year or month) • Should be in the top 0.1% of highly cited papers• Threshold does not change with time for majority of research fields

©20

12 T

hom

son

Reu

ters

13

Why is Time Dependent Analysis of Citation Patterns Important?

Question: If 3 papers have the same citation count after 12 months, which one of three (A,B,C) has the best chance to become a breakthrough?

C

B

A Answer:

C

Breakthrough threshold

©20

12 T

hom

son

Reu

ters

14

Typical Time Dependence

Observations (Statistically tested on thousands of publications):1. There is initial period with slow citation growth (information digesting) defined by2. The cutoff time is publication specific. In the vast majority of cases months 3. After cutoff there are 3 different types of behavior: A,B,C

Long growth

Aging

Plateau

©20

12 T

hom

son

Reu

ters

15

What is the initial time period needed for prediction with certain amount of reliability?

Cumulative Citations

Fre

qu

en

cy

Fre

qu

en

cy

Cumulative Citations

©20

12 T

hom

son

Reu

ters

16

Observations

• Citation patterns prior to six months have not yet developed.

• 12 to 24 months provides sufficient time to identify key citation patterns (keeping ~90% of BP)

©20

12 T

hom

son

Reu

ters

17

Linear and Non-linear models

Goal: using data for the first n months, predict whether a paper has a breakthrough potential (will be above a threshold at 60 months)

©20

12 T

hom

son

Reu

ters

18

Parameter Ranges for Statistical Analysis

• Time Range:– 6 Months or more

• Minimum Citation Count:– Only consider publications with >4 citations at 6 months

(eliminated 98% papers including 40% BP)

©20

12 T

hom

son

Reu

ters

19

Assessing the Models:Precision and Recall

Recall: are we capturing all of the actual breakthroughs?

number of actual breakthroughs detected /

total number of actual breakthroughs in data set

Precision: what percentage of our predictions are actually breakthroughs?

number of actual breakthroughs detected /

total number of predicted breakthroughs

0

assess

forecast

Predicted actual BPs

Predicted false BPs

Missing BPs

1 year 5 years

©20

12 T

hom

son

Reu

ters

20

Assessing the Models: Chemistry

Months of Data 6 12 24 36 60

1 # Papers Predicted 1 5 13 22 27

2 # Correctly Predicted 1 4 12 20 26

3 # Incorrectly Predicted 0 1 1 2 1

4 Precision (Row#2/Row#1) 100% 80% 92% 91% 96%

5 Recall (Row #2/30 Actual) 3% 13% 40% 67% 87%

Linear Fitting

Non-linear Fitting

794 Chemistry publications from 2005 had > 5 citations at 6 months.

30 were breakthroughs (0.1% highest ranked chemistry papers by 60 month citation count)

Months of Data 6 12 24 36 60

1 # Papers Predicted 266 51 34 32 29

2 # Correctly Predicted 12 19 23 25 28

3 # Incorrectly Predicted 254 32 11 7 1

4 Precision (Row#2/Row#1) 5% 37% 68% 78% 97%

5 Recall (Row #2/30 Actual) 40% 63% 77% 83% 93%

©20

12 T

hom

son

Reu

ters

21

Intermediate Conclusions

• Optimal breakthrough threshold value of cumulative citations varies by subject category.

• Number of breakthrough papers (above threshold) is determined by the top 0.1% of highly cited publications, and is dependent on the number of publications in the research area (Clinical medicine > Mathematics).

• We observed three typical citation patterns over time.

• We developed a method that predicts potential BP candidates and estimates the probability to hit a threshold (based on time since publications).

• Two fitting models were tested (linear and non-liner). Linear model gives better results at earlier stages (less than 12 months). Exponential predicts better for longer times of monitoring.

©20

12 T

hom

son

Reu

ters

22

Proposed BP indicator approaches

1. Time since publications to start monitoring should be longer than 6 months. (12 or longer is more reliable).

2. Instantaneous list of BPs candidates:

• Using at least 6 months of citation data, we can provide a candidate list of BP papers in each journal subject category

• Number of papers in the list may be a variable (top 0.1% ranked paper) rather than a fixed number.

3. Long term updated list of BPs candidates:

•. 24 or longer months of monitoring

•. Selected from a larger pool 24<t_pub<60 publications based on 24 months of citation data.

©20

12 T

hom

son

Reu

ters

23

Future directions: Quality of Citations

1. Subject interdisciplinarity (how many disciplines?)

2. Geographical diversity (how many different countries?)

3. Prestige vs Popularity (how many in high impact journals?)

• The number of citations is an approximate proxy for scientific impact. Some papers can receive high citation count but will not recognized by community as ground breaking research

• Supplementary information is necessary to distinguish true path-finding discovery from incremental knowledge generation

©20

12 T

hom

son

Reu

ters

24

Future work: Interdisciplinarity

Meaning: D – effective number of SCs

About N=260 different subject categories in WoS

Rao C. R. (1982)Stirling A. (1998)Porter A. L., et. al. (2006)Rafols I., M. Meyer (2010)

©20

12 T

hom

son

Reu

ters

25

Future work: Interdisciplinarity

( )

1

1

1

1

1

2

0

1

2

1

exp log

1

N

Nq

q

i i

i

N

ii j

q i

i

j

i j

N

ii

D p

S

D p p

D

I p p

D N

p

=

=

=

= −

=

= −

=

=

� � �

� ��

� �

��

�� �

, proportions of and subject categoriesth thi jp p i j−

- Richness

- exp(Shannon Entropy)

- Inverse Simpson Index

- Rao-Stirling quadratic diversity

Similarity matrix

©20

12 T

hom

son

Reu

ters

26

Future work: Interdisciplinarity

• All diversity measures are highly correlated between each other

• One needs to measure difference between references and citations diversity indices

Distribution of normalized frequencies for citations and references for all similar publications

in “Distel” similarity set [left], for citations and references of “Distel” paper only[Right]

©20

12 T

hom

son

Reu

ters

27

Future work

b) Geographical diversity of citations

©20

12 T

hom

son

Reu

ters

28

Future work

c) Prestige vs Popular

vs

vs

vs

©20

12 T

hom

son

Reu

ters

29

Thank you!