functional annotation of proteins via the cafa challenge lee tien duncan renfrow-symon shilpa...

13
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Upload: howard-clark

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Functional Annotation of Proteins via the CAFA ChallengeLee TienDuncan Renfrow-SymonShilpa NadimpalliMengfei Cao

COMP150PBT | Fall 2010

Page 2: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

What’s the problem?1. Huge bottleneck = finding a protein’s

function when given a protein sequence

1. Incomplete, inaccurate, or inconsistent annotations are difficult to work with and can propagate

1. No good way to measure the accuracy of an annotation predictor

Page 3: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

What is the CAFA Challenge?

Page 4: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

What are Gene Ontology (GO) terms?•GO = controlled vocabulary of “gene

ontologies”

•Cover three domains:▫Cellular component▫Molecular function▫Biological process

•Hierarchy:▫Broad/general (e.g. “catalytic activity”)▫Specific (e.g. “leukotriene-C4-synthase

activity”)

Page 5: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Outline of Our Approach

CAFA targets(FASTA

sequences)

GO ids for each CAFA

target

SMURF?

Betawrap Pro?

Other Secondary Structure Predictor?

BLAST

PFAM

Page 6: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Pfam: Protein Family Database• Collection of protein

families represented by: ▫Multiple sequence

alignments▫Hidden Markov Models

• Two sections of Pfam:▫A: high-quality,

manually-curated▫B: large, automatically-

generated

Sample Multiple Sequence Alignment

Sample Hidden Markov Model

Page 7: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

BLAST: Basic Local Align’t Search Tool•Goal: find homologous (i.e. derived from a

common ancester) sequences from a database

•Various BLAST programs:▫blastp = query: protein, database: protein▫blastn = query: nucleotide, database:

nucleotide▫blastx = query: translated nucleotide,

database: protein▫tblastn = query: protein, database: translated

nucleotide▫tblastx = query: translated nucleotide,

database: translated nucleotide

Page 8: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

SMURF: Structural Motifs Using Random Fields

•Determines whether a protein sequence contains one of the following super secondary structures:▫6-bladed propeller▫7-bladed propeller▫8-bladed propeller▫Double blades (i.e. 6-6, 6-7,6-8…)

•Developed at Tufts!•Some propeller functions:

▫Often WD40 repeat –protein-protein interaction

▫Signaling, transcription, cell cycle

Smurf!

7-bladed propeller

Page 9: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Final Database Structure

cafa_targets

cafa_id

uniprot_id

gi_access_idblast_results

cafa_id

pdb_id

refseq_id

e_value_score

pfam_results

cafa_id

pfam_id

smurf_results

cafa_id

template_id

p_value_score

pdb_id

go_id

refseq_id

uniprot_id

uniprot_id

go_id

pfam_id

go_id

template_id

go_idgo_results

cafa_id

go_id

source

confidence

INPUT RESULTS MAPPING OUTPUT

Page 10: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Final Results Statistics

789

69

12

19

4

3,445

1,356

Distribution of sequence hits by method

Of 8,904 unknown sequences… 4,265 had at least one hit in PDB BLAST 4,824 had at least one hit in Pfam 104 had at least one hit in SMURF

In total, 5,694 unique sequences had at least one hit, a 63.9% success

Page 11: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Example ResultT38114MDLDMNGGNKRVFQRLGGGSNRPTTDSNQKVCFHWRAGRCNRYPCPYLHRELPGPGSGPVAASSNKRVADESGFAGPSHR

RGPGFSGTANNWGRFGGNRTVTKTEKLCKFWVDGNCPYGDKCRYLHCWSKGDSFSLLTQLDGHQKVVTGIALPSGSDKLY

TASKDETVRIWDCASGQCTGVLNLGGEVGCIISEGPWLLVGMPNLVKAWNIQNNADLSLNGPVGQVYSLVVGTDLLFAGT

QDGSILVWRYNSTTSCFDPAASLLGHTLAVVSLYVGANRLYSGAMDNSIKVWSLDNLQCIQTLTEHTSVVMSLICWDQFL

LSCSLDNTVKIWAATEGGNLEVTYTHKEEYGVLALCGVHDAEAKPVLLCSCNDNSLHLYDLPSFTERGKILAKQEIRSIQ

IGPGGIFFTGDGSGQVKVWKWSTESTPILS

•BLAST: matches with PDB structures 2OVP, 3MKS, 2CNX, 1P22, 1NEX, 3N0E

▫Transcription, mitosis, methylation, protein binding

•Pfam: match to family PF00642▫Zinc ion binding, nucleic acid binding

•SMURF: match to 7-bladed β-propeller template

▫WD domain (protein binding)

Page 12: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Possible Future Directions• Improving functional annotation for β-

propellers identified by SMURF▫Analyze training set of propeller proteins with

known function to build probabilistic model of protein function based on propeller type

•Addition of other structural prediction tools for motifs with known function▫G-coupled receptors, membrane bound proteins

•Expansion of BLAST search to include full nr database

Page 13: Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010

Questions?