microtask crowdsourcing for annotating diseases in pubmed abstracts (ashg 2014)

Post on 02-Jul-2015

806 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation on "Microtask crowdsourcing for annotating diseases in PubMed abstracts" at ASHG14 session on "Cloudy with a chance of big data".

TRANSCRIPT

Microtask crowdsourcing for

annotating diseases in

PubMed abstracts

Andrew Su, Ph.D.@andrewsu

asu@scripps.edu

http://sulab.org

October 20, 2014

ASHG

Slides: slideshare.net/andrewsu

OK

OK

OK

Potential conflicts of interest

• Novartis

• Assay Depot

• Avera Health

2

3

Condition A Condition B

Candidate

genes/

proteins

RNA-seqExome seq

Whole

genome seq

ProteomicsGenotyping

Copy-number

analysis

Genome-scale profiling

ChIP-seqMethylation

Functional

genomics

4

Candidate

genes/

proteins

Related

diseases

Related

drugs

Related

pathways

Databases are fragmented and incomplete5

KEGG

(4)

OMIM

(6)

PharmGKB

(10)

HuGE

Navigator

(517)

0

2

0

20

0

0

0

0

0

x

2

507

1

6

Disease links for Apolipoprotein E

6

7

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

8

9

http://www.flickr.com/photos/portland_mike/6140660504/

Harnessing

the crowd…

10

… to organize

information

http://www.flickr.com/photos/45697441@N00/6629580443

Information extraction for a Network of BioThings11

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Genes/

proteins

Diseases

DrugsPathways

The NCBI Disease corpus12

• 793 PubMed abstracts

• 12 expert annotators (2 annotate each

abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical

Natural Language Processing. Association for Computational Linguistics.

Question: Can a group of non-scientists

collectively perform concept

recognition in biomedical texts?

13

Experimental design

Task: Identify the disease mentions in the

PubMed abstracts from the NCBI disease

corpus

– 5 non-scientists annotate each abstract

– The details:

• Recruit workers using Amazon Mechanical Turk

• Pay $0.066 per Human Intelligence Task (HIT)

• HIT = annotate one abstract from PubMed

14

Instructions to workers15

• Highlight all diseases and disease abbreviations

• “...are associated with Huntington disease ( HD )... HD patients

received...”

• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…”

• Highlight the longest span of text specific to a disease

• “... contains the insulin-dependent diabetes mellitus locus …”

• Highlight disease conjunctions as single, long spans.

• “... a significant fraction of familial breast and ovarian cancer , but

undergoes…”

• Highlight symptoms - physical results of having a

disease

– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss,

and visual impairment.

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Aggregation function based on simple voting16

1 or more votes (K=1)This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Comparison to gold standard17

F score = 0.81Precision

Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard18

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard19

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard20

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard21

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard22

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

F = 0.76 – score of single Ph.D. annotator

F = 0.87 – agreement between multiple Ph.D. annotators

23

Crowd-based biocuration

• 7 days

• 17 workers

• $192.90

Professional biocuration

• Many months

• 12 experts

• $150,000+

In aggregate, our worker

ensemble is faster, cheaper

and as accurate as a single

expert annotator for disease

concept recognition.

Information extraction for a Network of BioThings24

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Genes/

proteins

Diseases

DrugsPathways

Vision-based Citizen Science

• Galaxy Zoo (galaxy classification; 110M+

classifications, 300k+ volunteers)

• Foldit (protein folding; 350k+ players)

• Eterna (RNA folding; 80k players)

• Eyewire (3D neuron structure determination;

130k volunteers)

• Phylo (multiple sequence alignment; 30k+

players, 285k alignments)

• …

25

Language-based Citizen Science26

http://mark2cure.org

`

27

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)

The Su Lab

Chunlei Wu

Ben Good

Salvatore Loguercio

Max Nanis

Louis Gioia

Ramya Gamini

Greg Stupp

Ginger Tsueng

Erick Scott

Vyshakh Babji

Karthik Gangavarapu

Adam Mark

Key Alumni

Katie Fisch

Tobias Meissner

Key Collaborators

Andra Waagmeester

Lynn Schriml

Peter Robinson

Contact

http://sulab.org

asu@scripps.edu

@andrewsu

+Andrew Su

We are recruiting

programmers,

postdocs, and

awesome people of

all kinds!

bit.ly/SuLabJobs

We are hosting a hackathon

Nov 7-9 for the Network of

BioThingsbit.ly/hackNoB

top related