protein sequence databases - cbs.umn.edu · fasta protein sequence •name and origin •&asta...

31
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279 © 2015 Regents of the University of Minnesota. All rights reserved. Protein Sequence Databases …and your Mass Spectrometry-based Proteomics Experiment

Upload: others

Post on 21-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Protein Sequence Databases

…and your Mass Spectrometry-based Proteomics Experiment

Page 2: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Outline

• Protein Database (DB) • Origin • Sources • Format • Size • Composition

• Selecting a database for mass spec search

• Effect of DB on mass spec search results

• Post MS analysis: protein annotation, ontology, alignment

Terminology

• FASTA

• Database repository

• NCBI database

• UniProtKB

• Swiss Prot

• Ref Seq (reference sequence)

• Homology

• Contaminants DB

• Ontology

Page 3: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

FASTA Protein Sequence • Name and Origin

• FASTA (pronounced ‘fast-aye’)

• ORIGIN: for sequence similarity alignment tool (1985)

• REF: DJ Lipman, WR Pearson (1985) PMID: 2983426 "The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC)."

• Stands for “fast all” – the file format worked with ‘all’ alphabets (amino acid and nucleotide)

Page 4: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

FASTA Protein Sequence Format

• Structure: TEXT file

• Line 1: description line with sequence identifier

• Line 2: single amino acid letter protein sequence 80 characters wide

• Allowed characters: • AMINO ACID ONE-LETTER CODE • X • * • - • Custom one-letter amino acid codes

Page 5: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Line 1: description line with sequence identifier FASTA Format Header Line Sequence Identifiers

Page 6: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Line 2 FASTA Protein Sequence from NCBI- example

Line 1

Line 2

NOTE: In Sept 2016, gi numbers were replaced with accession.version identifiers

Page 7: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Selecting a Protein Sequence Database • Public repositories, such as

• NCBI • UniProtKB

• Swiss Prot: manually annotated and reviewed • TrEMBL: Automatically annotated and not reviewed

• Custom (from customer) • NOTE: format is important!

• Represent species (1 or more) from which protein sample originated • Example: Mouse protein expressed in E. coli

• Ideal size range ~ 2000 to < 1 million entries

Page 8: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Selecting a Protein Database: UniProtKB repository

Page 9: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Selecting a Protein Database: NCBI Ref Seq repository

Page 10: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Choose Your Taxonomy or Taxonomies

NOTES:

• If recombinant protein expressed in host cell, include host proteins & expressed protein(s)

• If protein database for your species has <2000 proteins, merge with another protein database (yeast) for statistical reasons

• Protein sequence headers must be parsed correctly

Page 11: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Taxonomy specification - UniProtKB

(19996)

Page 12: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Taxonomy specification - NCBI

Page 13: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Protein Database repository content for Thirteen-lined Ground Squirrel

Database Source Number of Proteins

Swiss-Prot* reviewed 20

TrEMBL* unreviewed 20,076

UniProt Reference Proteome 19,966

NCBI (‘non-redundant’) 30,130

NCBI Reference Sequence 29,842

* From UniProt

Page 14: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Protein Database Characteristics

…related to your mass spectrometry experiment

Page 15: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

SPLICES FORM variants Sequence alignments: Protein Cytochrome P450 2D6

Page 16: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Protein Sequence Variants

SNP’s (single nucleotide polymporphisms)

https://hive.biochemistry.gwu.edu

Natural variants)

Page 17: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

In silico trypsin digest, ‘native’ protein

Page 18: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

In silico trypsin digest, with VARIANTS

1

2

Page 19: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Effect of Variant on Peptide Mass

Peptide example Peptide Mass * Peptide Sequence

1 – native 1730.8443 SELEEQLTPVAEETR

1 – variant (Q -> K) 1730.8806 SELEEKLTPVAEETR

1 – variant (Q -> K) 734.3566 SELEEK

1 – variant (Q -> K) 1015.5418 LTPVAEETR

2 – native 830.4366 EQVAEVR

2 – variant (V -> E) 860.4108 EQEAEVR

* Monoisotopic [M + H]+1

Page 20: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Proteomics Search Program Meets Protein Sequence Database • Protein sequence file is downloaded to local computer

• Merge with common lab contaminants (keratins and more) database • http://www.thegpm.org/crap/

• Protein database is imported or indexed in the proteomics search program (sequence format is critical)

• REVERSED sequences are generated for False Discovery Rate (FDR) calculations

• Protein sequences are digested with enzymes in silico

Page 21: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Database search > Protein List

• Database search algorithm matches spectrum > peptide > protein

• RESULTS: List of protein identifications with accession numbers

• POST Database search options (outside CMSP): 1. Protein annotation

2. Sequence alignment

3. Obtain related Gene Ontology information

Page 22: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

POST Database search options

What you can do with your protein list.

Page 23: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

1) Protein Annotation from UniProtKB

Page 24: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

2) Sequence alignment with UniProt alignment tool

Page 25: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

2) Sequence alignment with UniProt alignment tool: numerous amino acid labeling options

* (asterisk) indicates positions which have a single, fully conserved residue. : (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the Gonnet PAM 250 matrix. . (period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the Gonnet PAM 250 matrix.

Page 26: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

2) Sequence alignment with NCBI BLAST

Page 27: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

3) Link Gene Ontology information to Proteins • Define: “The Gene Ontology (GO) project is a

collaborative effort to address the need for consistent descriptions of gene products across databases.”

• Ontologies/Vocabularies • molecular function: molecular activities of gene

products • cellular component: where gene products are active • biological process: pathways and larger processes made

up of the activities of multiple gene products

(http://geneontology.org/page/documentation)

Page 28: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Molecular Function Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)

Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171

Page 29: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Protein Class Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)

Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171

Page 30: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Biological Process Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)

Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171

Page 31: Protein Sequence Databases - cbs.umn.edu · FASTA Protein Sequence •Name and Origin •&ASTA (pronounced fast-aye) •ORIGIN: for sequence similarity alignment tool (1985) •REF:

Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279

© 2015 Regents of the University of Minnesota. All rights reserved.

Database Tools for Proteins

• http://geneontology.org/

• http://string-db.org/

• http://www.pantherdb.org/

• http://www.ingenuity.com/products/ipa (licensed at UM via MSI)

ALSO:

Match mass spec data to your RNA Seq data with:

• https://galaxyp.msi.umn.edu/