towards evidence-based discovery

46
Towards Evidence-Based Discovery Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill http://www.ils.unc.edu/~cablake [email protected]

Upload: arden-glover

Post on 04-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Towards Evidence-Based Discovery. Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill http://www.ils.unc.edu/~cablake [email protected]. Motivation. Relentless increase in electronically available text Life Sciences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Evidence-Based Discovery

Towards Evidence-Based Discovery

Catherine Blake

School of Information and Library Science

University of North Carolina at Chapel Hill

http://www.ils.unc.edu/[email protected]

Page 2: Towards Evidence-Based Discovery

2

Motivation• Relentless increase in electronically available

text– Life Sciences

• 17 millionth entry added in April 2007• 5,200 journals indexed• 12,000 new articles each week !

– Chemistry – more than 110,000 articles in 1 year alone

• Consequences:– Hundreds of thousands of relevant articles– Implicit connections between literature go unnoticed

Shift from Retrieval to Synthesis

Page 3: Towards Evidence-Based Discovery

3

Information Overload

“One of the diseases of this age is the multiplicity of books; they doth so overcharge the world that it is not able to digest the abundance of idle matter that is every day hatched and brought forth into the world”

- Barnaby Rich, 1613

Page 4: Towards Evidence-Based Discovery

Evidence-Based Discovery

4

If I have seen further than

others, it is by standing upon the

shoulders of giants.

Sir Isaac Newton

We can't solve problems using the

same kindof thinking we used when we created them.

Albert Einstein

1 American Heritage Dictionary

Goal: Facilitate Discovery from Text

To make easy or easier1 A productive insight1

Page 5: Towards Evidence-Based Discovery

5

EducationDiscovery Science

Evidence-based Practice

Natural LanguageProcessing

Human Discovery and

Synthesis

Human-assisted

Discovery and

Synthesis

Heterogeneous Literature

Core

Chemistry

Breast Cancer

Genomics

Synthesis andDiscovery Work

Practices

News

DocSouth

Page 6: Towards Evidence-Based Discovery

Outline

• Motivation• Case Studies

– METIS• Human synthesis• Natural language processing

– Claim Jumping through Scientific Literature

• Next Steps• Summary

6

Page 7: Towards Evidence-Based Discovery

Systematic Review Process

– Formulate the problem– Locate and select studies– Assess quality of studies– Collect data – Analyze and present results– Interpret results– Improve and update review

28 months frominitial idea topublication

Increased demand due to evidence-

based medicine

Page 8: Towards Evidence-Based Discovery

I teration

Co llaboration

A n alysisE xtraction

Con textIn form ation

H ypothesisP ro jection

R etrieval Corpus

M E D L IN E

E m base V erifi cationFacts

Manual Synthesis

Select Extract AnalyzeVerify

Guesswork guided by scientifically trained intuition

Rescher (1978)

Page 9: Towards Evidence-Based Discovery

Context Information

• Study Information– e.g. date, location, ...

• Population Information– e.g. gender, age, ...

• Risk Factor or Intervention– e.g. duration of exposure, confounders

• Disease– e.g. stage, confounders

Loosely coupledto review focus

Tightly coupledto review focus

Page 10: Towards Evidence-Based Discovery

I teration

Co llaboration

ExternalD ata

A n alysisE xtraction

Con textIn form ation

H ypothesisP ro jection

R etrieval Corpus

M E D L IN E

E m base V erifi cationFacts

Collaborative Information Synthesis

Page 11: Towards Evidence-Based Discovery

Key: Estimate Missing Information

What are people with Breast Cancer exposed to?

What are people in a similar population exposed to?

Are these rates significantly different?

Studies with Breast Cancer patients

Database of risk factorsBRFSS

Facts for each study•number of patients•age of patients •geographic location•risk-factor exposure …

Codebook•question asked•age, gender•% responses

1 2

3

T. Tengs & N. D. Osgood (2001) “The link between smoking and Impotence: Two Decades of Evidence”, Preventive Medicine, 32:447-52

Page 12: Towards Evidence-Based Discovery

More than Automated Meta-Analysis

Systematic Review

External database

Entire study

Main topicSecondary Information

Key

Information SynthesisInformation Synthesis

• Traditional analysis– same study design– medicine = RCT– epidemiology =

cohort

• Information Synthesis– any study that

includes required information

– augment missing information

Page 13: Towards Evidence-Based Discovery

13

EducationDiscovery Science

Evidence-based Practice

Natural LanguageProcessing

Human Discovery and

Synthesis

Human-assisted

Discovery and

Synthesis

Heterogeneous Literature

Core

Chemistry

Breast Cancer

Genomics

Synthesis andDiscovery Work

Practices

News

DocSouth

Natural LanguageProcessing

Page 14: Towards Evidence-Based Discovery

14

METIS Information Extractor

• Semantic Grammar• Features: words, numbers, and semantic types in

the Unified Medical Language System (UMLS)

• Information extracted :• risk factor exposure (tobacco and alcohol ) gender• age (min, max, mean) start and end dates• number of subjects with medical condition geographical

location

{term;’age’} {term:’of’} {number;10<n2<110}{term;’to’}{number;10<n2<110}

The age of breast cancer subjects ranged between 20 to 64 years old.

{semantic type: neoplastic process, or disease}

Page 15: Towards Evidence-Based Discovery

METIS Info Extractor – Evaluation

• Diverse text corpus– epidemiology, surgery, biology, ...– cohort studies, case-control trials, ...

• Evaluation– Metrics (precision, recall)– Annotators (developer, domain expert,

expert annotator, novice) – Primary topic (breast cancer, impotence)– Secondary information (tobacco and

alcohol consumption)

Page 16: Towards Evidence-Based Discovery

METIS Info Extractor – Recall

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5Rank

Rec

all

Development

Domain Expert

Expert Annotator

Novice Annotator

Page 17: Towards Evidence-Based Discovery

METIS Info Extractor – Precision

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5Rank

Pre

cis

ion

Development

Domain Expert

Expert Annotator

Novice Annotator

Page 18: Towards Evidence-Based Discovery

Verify information extracted

Electronic version of article

Converted Article

METIS Verifier

Page 19: Towards Evidence-Based Discovery

METIS Verifier

Page 20: Towards Evidence-Based Discovery

METIS Analyzer

• Meta-Analysis– Developed for agricultural application– Requires empirical studies with a

quantitative outcome– Unit of study is an article - not a person– Result – a unitless metric called an effect size

• Two common meta-analysis techniques– Fixed effects– Randomized-effects model

Evaluation: Compared generated effect size with examples in text books and published articles,

Result: Same effect size

Page 21: Towards Evidence-Based Discovery

Synthetic Estimate Evaluation

0

0.2

0.4

0.6

0.8

1

1 2 3 4 Average

Article Identifier

Co

ntr

ol R

ate

Actual

Estimated

TobaccoConsumption

0

0.2

0.4

0.6

0.8

1

1 2 3 4 AverageArticle Identifier

Co

ntr

ol R

ate

Actual

Estimated

AlcoholConsumption

Page 22: Towards Evidence-Based Discovery
Page 23: Towards Evidence-Based Discovery
Page 24: Towards Evidence-Based Discovery

Outline

• Motivation• Case Studies

– METIS– Claim Jumping

• Human discovery• Natural language processing• Human-assisted discovery and synthesis

• Next Steps• Summary

24

Page 25: Towards Evidence-Based Discovery

25

EducationDiscovery Science

Evidence-based Practice

Natural LanguageProcessing

Human Discovery and

Synthesis

Human-assisted

Discovery and

Synthesis

Heterogeneous Literature

Core

Chemistry

Breast Cancer

Genomics

Synthesis andDiscovery Work

Practices

News

DocSouth

Human Discovery and

Synthesis

Page 26: Towards Evidence-Based Discovery

Human Discovery

• Day-to-day activities of scientists reflect– the complex socio-technical

environments in which successful creativity tools will eventually be embedded

– the human cognitive processing surrounding creativity

• Unit of analysis: a paper or grant proposal

How do chemists transform an idea into a publication ?

How do chemists arrive at their research question ?

Page 27: Towards Evidence-Based Discovery

Approach• Recruitment

– experienced scientists (7-45 yrs)– local chemists and chemical engineers– response rate 84% (21/25)

• Semi-structured interviews

• Critical incident technique1.seminal paper in their field2.recent paper authored by the participant3.paper authored by the participant that

they were particularly proud of

Page 28: Towards Evidence-Based Discovery

Interview Questions• Discovery Questions

– What is your definition of discovery ?– What evidence convinced you that the paper addressed the initial research

questions ?– What factors limited the adoption and deployment of the discovery ?– How did you arrive at the research question ?– What if any existing evidence prompted the study/experiment ?– Were there any alternative explanations ?

• Information Usage questions– Other than the scientific literature, what information resources do you draw

from to aid in your research processes ?– How many articles did you read last month that related to each of those

projects ? – Is that typical of how many articles you read in a month for research projects ?– Do you read articles for another purpose ? If so what?– How many hours do you spend reading journal articles for research projects?– Which journals do you typically read and draw from ?– How would you characterize the journals that you read- are they only within

your domain, or do you read journals that would be considered non-traditional in your research ?

– If you only have a few minutes to read an article, what parts would you read? – What do you do with the article once you have read it ?

Page 29: Towards Evidence-Based Discovery

Chemists and Chemical Engineers

• Compared with other scientists chemists and chemical engineers– read more (Brown,1999)– have more personal subscriptions to journals

(Noble & Coughlin, 1997)– spend more time reading (Tenopir & King, 2003)– visit the library more often (Brown, 1999)

• Consequences– information disseminated quickly– information has a relative short lifespan

Page 30: Towards Evidence-Based Discovery

Human Discovery Findings

• Discovery definition– Novelty - Balance theory and

experimentation– Build on existing ideas - Practical application– Simplicity

• Hypothesis generation– Discussion - Previous experiments– Combine expertise - Read literature

• Hypothesis validation– Iterative - Tightly coupled

Page 31: Towards Evidence-Based Discovery

31

EducationDiscovery Science

Evidence-based Practice

Natural LanguageProcessing

Human Discovery and

Synthesis

Human-assisted

Discovery and

Synthesis

Heterogeneous Literature

Core

Chemistry

Breast Cancer

Genomics

Synthesis andDiscovery Work

Practices

News

DocSouth

Natural LanguageProcessing

Page 32: Towards Evidence-Based Discovery

Causal Relationships

• Newspaper genre– Causal relationships (Khoo, Chan, & Niu,

1998)

• Biomedical genre– Causes and treats (Price & Delcambre, 2005)– Causal knowledge (Khoo, Chan, Niu, 2000)

• Universal Grammar – Causatives (Comrie, 1974, 1981)– Action verbs (Thomson, 1987)

32

Page 33: Towards Evidence-Based Discovery

Claim Definition

• “To assert in the face of possible contradiction”

• Example sentence reporting a claim– “This study showed that Tamoxifen reduces the

breast cancer risk”

• Example Claim Framework– Tamoxifenagent

– reduceschange

– [breast cancer risk] object

33

Page 34: Towards Evidence-Based Discovery

The Claim Framework

• Goal– go beyond genes and proteins– differentiate between different levels of

confidence in the claim– consider claims made in the full text

• Working hypothesis– literature will report findings using

constructs within the Claim Framework– human annotators will agree on facets

34

Page 35: Towards Evidence-Based Discovery

Preliminary Results

• 29 articles from TREC Genomics – Total number of sentences: 5535 – Sentences with >=1 claim: 1250 (22.6%)– Total number of claims: 3228– Average claims per sentence: 2.51 – Claims that did not fit in the Framework: 31

• Per article– Average number of sentences: 191 – Average number of sentences with >=1

claim:4335

Page 36: Towards Evidence-Based Discovery

Distribution of Claim Categories

36

Category Total (%) Pilot(%) Main(%)

Explicit 2489 77.11 332 83.42 215776.6

3Implicit 87 2.70 3 0.75 84 2.98Observation 298 9.23 24 6.03 274 9.73Correlation 174 5.39 12 3.02 162 5.75Comparison 165 5.11 27 6.85 138 4.9

Total 3228 100 398 100 2830 100

Page 37: Towards Evidence-Based Discovery

37

All DocumentsAnnotation Total (%) Words (Avg)Agent 2894 89.65 5221 1.80Agent Direction 285 8.83 291 1.02Agent Modifier 1246 38.60 4448 3.57Object 3197 99.04 6849 2.14Object Direction 271 8.40 283 1.04Object Modifier 1561 48.36 5383 3.44Change 1897 58.77 1953 1.03Change Direction 1337 41.42 1358 1.02Change Modifier 1147 35.53 1618 1.41Claim Basis 165 5.11 394 2.39Claim Basis Dir. 42 1.30 43 1.02Claim Basis Mod. 86 2.66 266 3.09

Total 3228   28107 8.70

Page 38: Towards Evidence-Based Discovery

Inter Annotator Agreement

Information Facet KappaAgreement

Agent 0.71 substantial

Object 0.77 substantial

Change 0.57 moderateChange+ChangeDir 0.88

almost perfect38

Page 39: Towards Evidence-Based Discovery

Location of Claims

39

Total Sentences  With % %

SectionClaim

Total

section

claim

Abstract 98 309 31.72 7.84

Introduction 357 979 36.4728.5

6Method 6 1121 0.54 0.48

Result 293 1829 16.0223.4

4

Discussion 539 1406 38.3443.1

2

Total 1250 5535 22.58100.

00

Page 40: Towards Evidence-Based Discovery

40

EducationDiscovery Science

Evidence-based Practice

Natural LanguageProcessing

Human Discovery and

Synthesis

Human-assisted

Discovery and

Synthesis

Heterogeneous Literature

Core

Chemistry

Breast Cancer

Genomics

Synthesis andDiscovery Work

Practices

News

DocSouth

Human-assisted

Discovery and

Synthesis

Page 41: Towards Evidence-Based Discovery

User StudyTimothy S. Carey, MD, MPHSarah Graham Kenan Professor of MedicineDirector, Cecil G Sheps Center for Health Services

Research Ila Cote, PhD, DABTActing Division DirectorUS Environmental Protection AgencyNational Center for Environmental Assessment Michael T Crimmins PhD.Mary Ann Smith Distinguished Professor of

Chemistry UNC and Department Chair, Department of Chemistry

 Paul JonesClinical Associate ProfessorSchool of Information and Library ScienceDirector of ibiblio.org Rudy L Juliano PhD.Boshamer Distinguished Professor of PharmacologyPrincipal Investigator, Carolina Center of Cancer

Nanotechnology Excellence 

41

Steven W. Matson Ph.D.Professor and ChairDepartment of Biology Robert C Millikan DVM PhDBarbara Sorenson Hulka Distinguished ProfessorDepartment of EpidemiologySchool of Public Health Dr. Rosa Perelmuter, PhDDirector, Moore Undergraduate ResearchApprentice ProgramProfessor of Spanish and Assistant Dean, Academic Advising Program Jan F. Prins PhD.Professor of Computer Science andChairman, Department of Computer Science Alexander Tropsha, Ph.D.Professor and ChairDirector, Laboratory for Molecular Modeling Suzanne West, PhDResearcherHealth, Social and Economics ResearchRTI International

Page 42: Towards Evidence-Based Discovery

42

EducationDiscovery Science

Evidence-based Practice

Natural LanguageProcessing

Human Discovery and

Synthesis

Human-assisted

Discovery and

Synthesis

Heterogeneous Literature

Core

Chemistry

Breast Cancer

Genomics

Synthesis andDiscovery Work

Practices

News

DocSouth

Page 43: Towards Evidence-Based Discovery

Closing Comments• Accelerate synthesis

• Breast cancer study without METIS would take >13 years

• Without synthetic estimate = systematic review

• Accelerate discovery– Connections between literature– Speculative and orthogonal views

• Human discovery and synthesis – As important if not more so than automation

43

“Tap the vast reservoir of human knowledge”Louis Round Wilson, 1929

Page 44: Towards Evidence-Based Discovery

AcknowledgementsMETIS

• Funded in part by– California Breast Cancer Research

program– University of California, Irvine

• Thanks to user groups – Particularly to Dr. Adams and Dr.

Tengs• Academic mentoring

– Primary Advisor: Dr. Wanda Pratt– Medical Mentor: Dr. Catherine

Carpenter – Co-Advisors: Dr Dennis Kibler and Dr

Michael Pazzani– Committee Member: Dr Paul Dourish

Claim Jumping

• Funded in part by– Faculty fellowship from the

Renaissance Computing Institute

– UNC Faculty Award• Thanks to collaborators

• Nassib Nassar and Mats Rynge  (RENCI)

• Amol Bapat and Ryan Jones (SILS)

Chemists and Chemical Engineers Study

• Funded in part by– NSF Center for

Environmentally Responsible Solvents and Processes

Page 45: Towards Evidence-Based Discovery

Questions and Comments Welcome

Catherine [email protected]

School of Information and Library Science

University of North Carolina at Chapel Hill

http://www.ils.unc.edu/~cablake

Page 46: Towards Evidence-Based Discovery

Publication Bias

• Studies that find a correlation between a risk factor and disease are more likely to be published (Easterbrook et al, 1991, Ingelfinger et al, 1994)

• METIS provides a new way to explore this bias Bias introduced by authors, editors, funding, ...