gene wiki at phenotype rcn annual meeting

Post on 10-May-2015

1.545 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Gene Wiki: Synthesizing knowledge about human

genes with Wikipedia

Benjamin Good

Feb. 26, 2013

http://www.slideshare.net/goodb

“Knowledge about human genes”2

“Knowledge about human genes”3

2) It is scattered

1) There is a lot

Biological knowledge is growing, rapidly4

• More than 22 million articles indexed in PubMed

• Growing at about million/year and rising

Scattered genomic knowledge is a problem

• Scientists faced with new and unfamiliar genes on a daily basis

5

• Public faced with unfamiliar genes on a daily basis

HitsIFITM3TFE3BEX1ST8SIA1TFEBBEX2SKP1A....

GNF Robotics

Knowledge synthesis

“the pulling together of ideas or information to develop a common framework for understanding”

6

Knowledge synthesis in biology, aka biocuration

• The production of structured data

7

Unstructured Structured

Gene Ontology

“Tool for the unification of biology”[1]

8

[1] Nature Genetics. 2000 May;25(1):25-9.

A shared, controlled vocabulary for describing gene function

Molecular Function, Biological Process, Cellular Component

> 10,550 Citations in Google Scholar

Gene Ontology Annotation Database (‘GOA’)

• Records gene function using gene ontology terms

• Expert synthesis of the knowledge from thousands of articles

9

33k articles become 31 gene annotations10

31 function annotations for human gene

Gene Ontology Curators

11

Great!

12

BUT

13

GO annotation is not complete

Many genes are not thoroughly annotated14

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Biological Process only

15

1 million articles per year....

16

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

The Long Tail is a prolific source of content17

ShortHead

Long Tail

Content produced

Contributors (sorted)

News reporting:Video:

Product reviews:Food reviews:

Gene annotation:

NewspapersTV/Hollywood

Consumer reportsFood criticsbio-curators

BlogsYouTube

Amazon reviewsYelp

????????????

Wikipedia successfully harnesses the long tail

• Within top 10 most visited websites

• 14 million+ registered users

18

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Words/ article

Wikipedia Britannica Online

Wikipedia is reasonably accurate19

20

“We can harness the Long Tail of scientists to directly participate in

the gene annotation process.”

-Andrew Su

The Gene Wiki Hypothesis

Goal of the Gene Wiki project

• Enable the creation of a collaboratively written, continuously updated, high quality review article for every human gene.

21

Filtering, extracting, and summarizing PubMed

Success depends on a positive feedback loop23

Value of service

Number ofusers

Number ofcontributors

1001

2002

24

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Gene “stubs” seed community contributions

A review article for every gene is powerful25

Hyperlinks to related concepts

References to the literature

68 editors, 543 edits (as of July 2010)

The Gene Wiki project – 2010 stats26

10,300 articles1.2 million words67MB text

(about 1,000 PloS Biology research articles)

Value of service

Number ofusers

Number ofcontributors

55 million page views

3,500 editors17,000 edits

Monthly growth of words in Gene Wiki articles, page views per month and edits per month between 1 September 2009 and 1 September 2011.

Good B M et al. Nucl. Acids Res. 2012;40:D1255-D1261

© The Author(s) 2011. Published by Oxford University Press.

Why is it working?

28

Google loves Wikipedia29

• ...

• 1.86 million results from Google

• courses

• products

• databases

The Gene Wiki hitches a ride on Wikipedia30

CC photo by ff137 on flickr

Take home messages31

• Where possible, try to hitch a ride

Value

userscontributors

• Success depends on a positive feedback loop

But still, many genes lack structured annotation…32

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Biological Process only

Can we generate structured annotations from the text of the gene wiki?

33

Great for people to read

?

Great for building software for people to use

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Document- and concept-centric text mining35

Subject Object

Predicate

Simple text mining for gene annotations36

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

Good, BMC Genomics, 2011.

Finding concepts

• NCBO Annotator Web Service – Gene Ontology– Human Disease Ontology

• Annotator service selected for:– Speed, easy API, precision

Clement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open Biomedical Annotator. AMIA Summit on Translational Bioinformatics. 56-60 http://bioportal.bioontology.org/annotator

Mining workflowGene Wiki Articles

(10,271)

Filtering, cleanup

Extract concepts(NCBO)

11,022 matched gene ontology

terms

2,983 matched disease ontology

terms

ResultsCompared to current dbs Manual evaluation

on random sample

DO

GO

GO problems

False match (e.g., “Olfactory receptors .. are responsible for the transduction of odorant signals.  The system incorrectly identifies ‘transduction’ (GO:0009293) defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector No support in sentence (e.g., "The protein is composed ... including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus.”  Such sentences may lead to incorrect annotations of 'Golgi apparatus' and 'Posttranslational modification’.)

Applications

• Enrichment analysis • even with false positives, text-mined annotations can

improve statistical analyses that are tolerant to noise.

• GeneWiki+

Gene Wiki+ for integrative queries

42http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Dynamic queries across genes, diseases, SNPs

43Good, J Biomed Semantics, 2012.

Gene Wiki+ for integrative queries

44http://genewikiplus.org

mwsync

{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}

OMIMPharmGKB

Good, J Biomed Semantics, 2012.

OMIMPharmGKB

Gene Wiki+ for integrative queries

45http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Text mining take home

46

• Approach depends on corpus• concept-centric text has advantages

• Depends a lot on the ontology• (same text, same algorithm,

completely different results)

• Approach depends on purpose• high false positive rates are common

but may be acceptable – e.g. enrichment analysis

Can we skip text mining?

http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/

Wikidata

48

Provide a database of the world’s knowledge that

anyone can edit

- Denny Vrandečić

Wikidata

49

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Wikidata

50

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

53

“We can harness the Long Tail of scientists to directly participate in

the gene annotation process.”

-Andrew Su

Gene Wiki acknowledgements..54

“The Gene Wiki in 2011: community intelligence applied to human gene annotation” Nucleic Acids Research 2012

“Mining the Gene Wiki for Functional Genomic Knowledge” BMC Genomics 2011

“Linking genes to diseases with a SNPedia-Gene Wiki mashup” Journal of Biomedical Semantics 2012

“Building a biomedical semantic network in Wikipedia with Semantic Wiki Links” Database: The Journal of Biological Databases and Curation 2012

“A gene wiki for community annotation of gene function” PloS Biology 2008

“The Gene Wiki: community intelligence applied to human gene annotation” Nucleic Acids Research 2009

http://wordle.comMany Wikipedia editors WP:MCB Project

55

Funding and Support

NIH / NIGMS (Gene Wiki: GM089820)

bgood@scripps.edu@bgoodi9606.blogspot.comslideshare/goodb

My sister Erin has a PhD in linguististics, lives in Raleigh and is looking for work in research or teaching..

Help her out!

Gene Wiki content improves enrichment analysis56

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Good, BMC Genomics, 2011.

top related