piwowar amia 2008: identifying data sharing in biomedical literature

69
Identifying data sharing in the biomedical literature Heather Piwowar and Wendy Chapman Department of Biomedical Informatics, U of Pittsburgh

Upload: heather-piwowar

Post on 01-Nov-2014

1.453 views

Category:

Health & Medicine


1 download

DESCRIPTION

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

TRANSCRIPT

Page 1: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Identifying data sharingin the biomedical literature

Heather Piwowar and Wendy Chapman

Department of Biomedical Informatics, U of Pittsburgh

Page 2: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Visualized as a “Wordle” (font size ~ word frequency, location and orientation are random)

Our full paper:

Page 3: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Created at IBM’s data sharing and visualization site Many Eyes

Page 4: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our aim:

Identify research articles for which the authors have shared their datasets

For this research:

sharing = submitted to centralized databases

Page 5: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 6: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 7: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Links between article and dataare important

Page 8: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

The data provides detail for the results of the article

Page 9: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

The article provides detail for the data

Page 10: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Specialized searching methods help us find articles OR data...

but what about when we want articles WITH data?

Page 11: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

How can we find articles that have shared their datasets?

Page 12: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Sometimes the links are easy to discover

Page 13: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

1. Through database citations:

When authors upload data to a database, they have the opportunity to cite the paper that describes the data collection

Page 14: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 15: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 16: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Text

Unfortunately, the citation is often left blankbecause the data is submitted before

the paper is published

Page 17: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

2. Through hyperlink urls in the text

Authors often reference their datasets within their paper with a website url

Page 18: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 19: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been

accessed, rather than submitted.

Page 20: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

But the meaning of the hyperlinks is ambiguous. Sometimes they point to datasets that have been

accessed, rather than submitted.

Page 21: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

And often the text contains no hyperlinks at all:

Page 22: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

3. Through text mining

Page 23: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

What if we could extract phrases like

“data of the experiment can be accessed at”

Page 24: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

full-text phrases containing “... accessed”

Page 25: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

“can be accessed” suggests data is shared

Page 26: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

BUT “was/were accessed” suggests data reuse!

Page 27: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

full-text phrases containing “... downloaded”

Page 28: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

“was/were downloaded” suggests data reuse

Page 29: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

while “can be downloaded” suggests data sharing

Page 30: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our aim:

Identify research articles for which the authors have shared their raw datasets.

Proposed approach:

Develop a system to identify statements of shared data from an article’s full text.

Page 31: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Materials:

Full text from a subset of the open access literature

Database submission citations from five databases:

• Genbank

• Protein Data Bank

• Gene Expression Omnibus

• ArrayExpress

• Stanford Microarray Database

Page 32: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our Gold standard:

An article was considered to have a “shared dataset” if the article was cited within the primary submission field of a database entry

(+ a small amount of manual screening to find additional positives based on full text)

Page 33: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Approach:

For those articles that mention database names,

• Extract a 300-character window around every mention of a database name

• Apply various mining algorithms to decide if there is evidence that the authors deposited data from this study in the database

Page 34: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Results:

• queried 24 000 articles across 27 journals

• 25% of all open access articles mentioned one of the database names (50% Genbank)

• development set of 4434 articlestraining set of 2000test set of 1028

Page 35: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

True positives:

23% of the articles that mentioned a database were cited from within a database submission field

= evidence that article shared its data!

Page 36: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Three simple methods for identifying sharing

Does the excerpt surrounding the database name contain:

1. the word “accession”

2. an accession number

3. a URL

Page 37: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Two complex methods:

4. A manually-derived regular expression to match lexical cues that suggest sharing

5. An automatically-derived bag of words decision tree

Page 38: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Snippet of manually-developed regular expression

wehavehasisarewaswerebebeen

+

accessionedaddedarchivedassigneddepositedenteredimportedincludedinsertedloadedlodgedplacedpostedprovidedregisteredreported tostoredsubmitteduploaded to

Page 39: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

How accurately were these methods able to identify papers with evidence of public database submissions?

Page 40: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

Page 41: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

Best method for

recall depends on

database

Page 42: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

“accession”good for

some, <url> for others

Page 43: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall: % of papers cited in database submission fields that were found by our methods

lexical regular

expressions do well overall

Page 44: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

Page 45: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

lexical regular

expressions do well overall,

bag-of-words doeseven better

Page 46: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

Precision of simple

patterns depends on

database

Page 47: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision: % of papers found by our methods that were cited in database submissions fields

Simple patterns do poorly on the most popular

databases (those with the most

statements of reuse?)

Page 48: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision vs. Recall plot of all methods for each database.

Diverse!

Page 49: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

<url>

bag of words

“accession”

<accession>

<lexical patterns>

Relative strength of methods for this taskacross databases

Page 50: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Limitations:

• bias due to manual screening of negatives

• database-centric classifier

• approach requires computational access to literature full text!

Page 51: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Impact:

• A recent version that runs in PubMed Central:

• could increase GEO article links by 2.6%

• by 5.5% annually when all NIH in PMC

• double the recall (to 80%), double these estimates

• 40 links already added by GEO staff!

Page 52: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Ongoing work:

1. Continue focusing on methods that use existing full-text query interfaces, like PubMed Central

2. Use this tool to evaluate the patterns and prevalence ofbiomedical research data sharing and reuse

Page 53: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Thanks to

the Dept of Biomedical Informatics at the U of Pittsburgh,

the NLM for funding through training grant 5 T15 LM007059-22,

and everyone who publishes “gold” open access, thereby facilitates reuse of article full text for studies like this.

My shared data: www.dbmi.pitt.edu/piwowarShare your research data too!

Page 54: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature
Page 55: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Our manual filter for additional positive classifications identified more cases in some databases than others: we

reclassified 19% of [article,database] cases from ArrayExpress as positive despite an omitted literature

link, compared to 11%, 7%, 2%, and 1% for GEO, Genbank, PDB, and SMD respectively (see Table 2 for raw number of cases). The most common situations included: the

database entry listed a citation for another paper by the same authors, the entry listed an erroneous PubMed ID,

the entry included a citation without a PubMed ID, or the entry had a blank citation field.

Page 56: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Usage?

• scientists looking for datasets for reuse

• curators looking for primary citations

• researchers studying data sharing behaviour

Page 57: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Regular expression

• Precise one +

• "(\b(accession.{0,20}(for|at).{0,100}(is|are)))",

• r"(\b(raw|original|our|complete|detailed).{0,20}data)",

• r"(\b(we|have|is|was|were|is|are|be|have|has|been).(exported|gave|given|listed|provided|reported))"

• ]) + ")"

Page 58: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precise Regular expression

• wehavehasisarewaswerebebeen

accessioned|added|archived|assigned|deposited|entered|imported|included|inserted|loaded|lodged|placed|posted|provided|registered|reported.to|stored|submitted|uploaded.to))",

is|are|will.be|made).{0,20}(available|accessible)

(be).(accessed|browsed|downloaded|found|obtained|queried|retrieved|searched|viewed)

(through|under|as).{0,20}accession

(given)|new|received|assigned).{0,20}(accession)

(data.{0,20}availability|for public distribution|for.{0,20}release upon publication|for the.{0,20}data.{0,20}generated|from this study have.{0,20}accession|data.{0,10}from this study|access to.{0,20}data.

Page 59: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Stopwords are important!

Page 60: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Recall

Page 61: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Precision

Page 62: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

• queried 24 000 articles across 27 journals

• 25% mentioned one of the database names

• development set of 4434 training set of 2000test set of 1028

Evaluation

Page 63: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 64: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 65: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 66: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 67: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 68: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature

PAST MEDICAL HISTORY:Past medical history showed she had superficial phlebitis times two in the past, had non-insulin

dependent diabetes mellitus for four years.She had been hypothyroid for three years.

HISTORY OF PRESENT ILLNESS:The patient is a 58-year-old female, …

Research data

http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;

http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441

Page 69: Piwowar AMIA 2008:  Identifying data sharing in biomedical literature