challenges in biocuration philippe lamesch, phd carnegie institution of washington stanford ca

72
Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Upload: elizabeth-mcfarland

Post on 27-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Challenges in Biocuration

Philippe Lamesch, PhDCarnegie Institution of Washington

Stanford CA

Page 2: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Introduction

Matt Duffin: The Library

Page 3: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Introduction

Matt Duffin: The Library

• PubMed contains 18,792,257 entries• 50,000 papers indexed per month

Page 4: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Introduction

Matt Duffin: The Library

• In Feb 2009:67,406,898 interactive pubmed searches

done92,216,786 entries were viewed

Page 5: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA
Page 6: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

> 900 complete genomes to date

from www.genomesonline.org

Page 7: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Model Organism Databases (MODs)

Page 8: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Nucleotide Sequence DatabasesInternational Nucleotide Sequence Database Collaboration 3Coding and non-coding DNA 42 Gene structure, introns and exons, splice sites 25Transcriptional regulator sites and transcription factors 63 RNA sequence databases 72Protein sequence databases:General Sequence Databases 15Protein properties 16Protein localization and targeting 23Protein sequence motifs and active sites 25Protein domain databases; protein classification 38Databases of individual protein families 73Structure Databases Small molecules 18Carbohydrates 9Nucleic acid structure 15Protein structure 84Immunological databases 27Plant Databases 104

1440 Databases inNAR 2009

Genomics Databases (non-vertebrate) 2Genome annotation terms, ontologies and nomenclature 2Taxonomy and identification 11General genomics databases 12Viral genome databases 28Prokaryotic genome databases 68Unicellular eukaryotes genome databases 19Fungal genome databases 31Invertebrate genome databases 54Metabolic and Signaling Pathways Enzymes and enzyme nomenclature 13Metabolic pathways 23Protein-protein interactions 77Signalling pathways 6Human and other Vertebrate Genomes Model organisms, comparative genomics 68Human genome databases, maps and viewers 16Human ORFs 28Human Genes and Diseases General human genetics databases 15General polymorphism databases 32Cancer gene databases 25Gene-, system- or disease-specific databases 56Microarray Data and other Gene Expression Databases 67Proteomics Resources 20Other Molecular Biology DatabasesDrugs and drug design 22Molecular probes and primers 10Organelle databasesGeneral 8Mitochondrial genes and proteins 16

http://www.oxfordjournals.org/nar/database/cap/

Page 9: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Responsibilities of a MOD curator

• Gene function curation• Gene structure annotation• Integration of new data types into database• Implementation of new tools on the website• Improve website• Community support• Grant writing• Community outreach

Page 10: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Responsibilities of a MOD curator

Gene function

Gene structure

curation

Page 11: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Functional genome annotation

Page 12: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

12

It’s defined as the process of collecting information about a gene’s biological identity:

• molecular function (transcription factor)• biological roles (trichome development)• subcellular localization (nucleus)

• mutant phenotype• expression domain• interaction with other genes and gene products

What is functional annotation?

Page 13: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

A long way to go : Human functional genome annotation

Number of Human Genes in Uniprot:20,331

Timeline of manual functional GO annotationof human genes

57% of all human genes manually annotated

Page 14: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

A long way to go : Arabidopsis functional genome annotation

Number of Arabidopsis genes in TAIR9:33,518 genes

26% of Arabidopsis genesmanually annotated

Page 15: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Functional annotation: step-by-step

Page 16: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Prioritizing journals

0

500

1000

1500

2000

2500

1999 2000 2001 2002 2003 2004 2005 2006

Year

# of

art

icle

s

• 200 papers/month for 2.5 curators• High priority journal list was established

Page 17: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Too much data, not enough curators

Page 18: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Prioritizing journals

• CELL• CURRENT BIOLOGY• DEVELOPMENT• GENES AND DEVELOPMENT• NATURE• NATURE CELL BIOLOGY• NATURE GENETICS

• NUCLEIC ACIDS RESEARCH• PLoS biology• PNAS• SCIENCE• THE EMBO JOURNAL• THE PLANT CELL• THE PLANT JOURNAL• TRENDS IN PLANT SCIENCE

Based on Journal High Priority list

Gene based

• prioritize papers with unannotated genes• prioritize papers with novel genes

Page 19: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Functional annotation: step-by-step

Page 20: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Identifying the gene/organism of interest can be hard

• Nomenclature standards and collaborative efforts strive to give ortholog genes the same symbol.

Example: BRCA1 exists in > 12 species

• Same symbol for genes within a species. Example: PAP1 in A. thaliana Purple Acid Phosphatase I Phosphatidic Acid Phosphatase I Production of anthocyanin pigment I Phytochrome Associated Protein I

• Gene duplicates sharing a root symbol term Example: wnt8 in Zebrafish wnt8a and want8b

Page 21: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Solution• Submit Sequence Identifier and other useful

information clarifying what genes is discussed in the publication

• Authors need to be aware of nomenclature process• Publishers and reviewers not to be more stringent

about gene names in the paper are approved and that necessary sequence identifiers are provided

Identifying the gene/organism of interest can be hard

Page 22: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Identifying relevant data • Goal: identify every novel experimental result, add it

to appropriate section in database, and connect it to already existing data

• Distinguish experimentally supported from speculative assertions

• Gather experimental results, not censor them!

Page 23: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Identifying relevant data • Goal: identify every novel experimental result, add it

to appropriate section in database, and connect it to already existing data

• Distinguish experimentally supported from speculative assertions

• Gather experimental results, not censor them!

Page 24: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Example of a gene pages at TAIR

Computationaldescription

Page 25: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Example of a gene page at TAIR

Summary

Page 26: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Example of a gene page at TAIR

GO annotations

Page 27: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Example of a gene page at TAIR

Page 28: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

28

An annotation is a statement that a gene product ……has a particular molecular function …is involved in a particular biological process…is located in a certain cellular component

…as determined by a particular method …as described in a particular reference

Annotations have four key components:

What is an Gene Ontology annotation?

Adapted from Harold J Drabkin, The Jackson Laboratory

Page 29: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

29Adapted from Harold J Drabkin, The Jackson Laboratory

Smith et al. (2006) determined by an enzyme assay that ABC2 has protein kinase activity.

Smith et al. (2006) determined by an enzyme assay that ABC2 has protein kinase activity.

ReferenceReference

MethodMethod

TermTerm

Gene product

Gene product

Page 30: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

30

Same name, different concept

Cell

Page 31: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

31

• glucose biosynthesis• glucose synthesis• glucose formation• glucose anabolism• gluconeogenesis

Different name, same concept

noncarbohydrate precursors(pyruvate, amino acids and glycerol)

glucose

Page 32: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

32

The solution: Controlled vocabularies

• A standardized, restricted set of defined terms designed to reduce ambiguity in describing a concept.

e.g.

= Gluconeogenesis

• Applicable to many organisms, thus allowing cross-species comparisons.

• glucose biosynthesis• glucose synthesis• glucose formation• glucose anabolism• gluconeogenesis

Page 33: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure Annotation

• Arabidopsis genome sequenced almost 9 years ago• High quality sequence with few gaps• TIGR did initial genome annotation• TAIR took over responsibility in 2005• Current stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs

Page 34: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in Arabidopsis

NEW: 282 genes; 1056 exonsUPDATED: 1254 models; 1144 exons

NEW: 1291 genes; 683 exonsUPDATED: 3811 models; 4007 exons

NEW: 681 genes; 828 exonsUPDATED: 10,792 models and 14,050 exons

TAIR6

Page 35: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in ArabidopsisNovel genes

Page 36: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in Worm

> 600 C. elegans gene models added since 2004

> 6000 gene model structure updates from 2003-2009

Page 37: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in Worm

> 600 C. elegans gene models added since 2004

> 6000 gene model structure updates from 2003-2009

Number of gene structure updates

0

200

400

600

800

1000

1200

1400

1600

1800

2000

WS158-168

WS168-178

WS178-188

WS188-198

WS198-2005

Number of genestructure updates

Page 38: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in HumanThe Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality.

Collaborators:European Bioinformatics Institute (EBI)National Center for Biotechnology Information (NCBI)Wellcome Trust Sanger Institute (WTSI)University of California, Santa Cruz (UCSC)

Page 39: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in Human

1.18 splice-variants/gene identified by the CCDS project

Page 40: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation in Human

Page 41: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene structure annotation of model organisms:Remaining challenges

• Updating exon-intron structures of existing gene models• Identifying all splice-variants of known loci• Annotating specific gene types: Small genes Pseudogenes Transposable element genes RNA coding genes Anti-sens genes Genes withing the UTR of other genes …

Page 42: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 43: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 44: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Automated pipeline at TAIRProgram for aligned sequence(PASA)

Clustered transcripts

NCBI

Page 45: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Automated pipeline at TAIRProgram for aligned sequence(PASA)

Clustered transcripts

Resulting gene model

Previous gene model

NCBI

Page 46: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Automated pipeline at TAIRProgram for aligned sequence(PASA)

Clustered transcripts

Resulting gene model

Previous gene model

NCBI

comparison

Page 47: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Automated pipeline at TAIRProgram for aligned sequence(PASA)

Clustered transcripts

Resulting gene model

Previous gene model

Based on a set of rules a decision is made

comparison

NCBI

Page 48: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 49: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 50: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Manual annotation at different MODs

Genomeediting

tool

Evidenceset

Set of annotation

rules+ +

Page 51: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Manual annotation at different MODs

Genomeediting

tool

Evidenceset

Set of annotation

rules+ +

Nucleotide sequenceShort peptidesProtein similarityAlternative predictions…

Apollo (Arabidopsis, Fly)Aceview (Worm)Zmap/Otterlace (Human)Artemis (Pathogen Project)…

Exon sizeIntron sizeNumber of UTRsCoding/Non-coding ratioSplice-junctions…

Page 52: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

ESTs

cDNAs

Radish sequence alignmentsEugene

predictiondicot sequence alignments

monocot sequence alignments

Aceview genepredictions

2 gene isoforms

Manual annotation at TAIR: Apollo

Short MS peptide

Page 53: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Recent genome annotation projects at TAIR

• Comparing TAIR models to those of 4 alternative prediction tools

• Integrating newly published large-scale datasets into the annotation: Short MS peptide sequences (Baerenfaller et al, Castellana et al) Short single-exon genes (Hanada et al) Transposable elements (Quesneville et al)

• Development of a ‘Gene Confidence’ Ranking• Improve pseudogene annotation

Page 54: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Gene confidence ranking

Page 55: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Other responsibilities of gene structure curator

• Analyse large datasets submitted by community• Represent data in a useful manner• Update the genome assembly based on newly

found indels/contaminations• Generate downloadable datasets for users• Implement new tools • Do community outreach at conferences and in

schools

Page 56: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Too much data, not enough curators

• More papers are published than curators can read

• Many databases have 2 or 3 curators to analyze tens of thousands of genes

• For many newly sequenced genomes no database exists to annotate genes

Page 57: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in scientific curation

• Have publishers become more involved (PlantPhys)• Direct data submission from user to database• Designate experts for specific genes/families• Wikis• Get community and students involved in annotating

genomes• New tools such as Biolit, Microsoft Plugin to markup

original publication• Other new ways of disseminating data see Scivee

Page 58: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:Direct submission by the community

Submit data in standardized format using MOD submission forms

Requires a lot of work from community

Page 59: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:Partnership between journals and databases

TAIR: collaboration with Plant Physiology

Page 60: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:Direct editing of the database by registered experts

Page 61: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:Wikis

Page 62: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Wikis

Page 63: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:Gene structure annotation in the classroom

Page 64: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:New tools: Microsoft Ontology Add-in

Page 65: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:New tools: Biolit

Page 66: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Involving the community in genome annotation:New tools: Scivee and ‘Pubcasts’

Page 67: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

The Constituents are Changing

Page 68: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Acknowledgments

PIsEva HualaSue Rhee

CuratorsDavid SwarbreckDonghui LiTanya BerardiniKate DreherPeifen Zhang

TAIR Tech Team:Vanessa KirkuoChris WilksTom MeyerCindy LeeRaymond ChettyBob Muller

All my colleagues from other MODs

Page 69: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Establish semantic consistencyStart to provide semantic enrichment of the literature in

a way that is consistentGiven that author is expert on the own work, they should

annotate their own data. Work with microsoft work to Create plugin, that creates

semantic consistency in the authoring process, a bit like a spellchecker. Every word types is checked against the onotlogy or if your common term should be changed to the systematic name, or tag the term with the systematic name while still using the common term

Page 70: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

A curator is not a reviewer

• While we do not control the quality of the data, thorough annotation and user-friendly database tools are the keys to making the database useful.

Page 71: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Availability of web servers

Page 72: Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA

Introduction

• PubMed contains 18,792,257 entries

• 50,000 papers indexed per month

• In Feb 2009:– 67,406,898 interactive

searches were done– 92,216,786 entries were

viewed