Protein analysis and proteomics
July 29, 2009August 5, 2009
BioinformaticsM.E:800.707J. Pevsner
Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby J Pevsner (copyright © 2009 by Wiley-Blackwell).
These images and materials may not be usedwithout permission from the publisher.
Visit http://www.bioinfbook.org
Copyright notice
Outline for tonight: Protein analysis and proteomics
Individual proteins Perspective 1: Protein families (domains and motifs) Perspective 2: Physical properties (3D structure) Perspective 3: Localization Perspective 4: Function
protein
Page 389
RNADNA
protein
[1] Protein families
[4] Protein function
[2] Physical properties
Page 389
[3] Protein localization
Gene ontology (GO):--cellular component--biological process--molecular function
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI)
Goals: defining standards for proteomic data representation to facilitate the comparison, exchange, and verification of data
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI)
Work groups
# Gel Electrophoresis# Mass Spectrometry# Molecular Interactions# Protein Modifications# Proteomics Informatics# Sample Processing
Themes
# Controlled vocabularies# MIAPE: Minimum information about a proteomics experiment
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI)
http://www.psidev.info/
Perspective 1: Protein domains and motifs
Page 389
Definitions
Signature: • a protein category such as a domain or motif
Page 390
Definitions
Signature: • a protein category such as a domain or motif
Domain: • a region of a protein that can adopt a 3D structure• a fold• a family is a group of proteins that share a domain• examples: zinc finger domain immunoglobulin domain
Motif (or fingerprint):• a short, conserved region of a protein• typically 10 to 20 contiguous amino acid residues
Page 390
15 most common domains (human)
Zn finger, C2H2 type 1093 proteinsImmunoglobulin 1032EGF-like 471Zn-finger, RING 458Homeobox 417Pleckstrin-like 405RNA-binding region RNP-1400SH3 394Calcium-binding EF-hand 392Fibronectin, type III 300PDZ/DHR/GLGF 280Small GTP-binding protein 261BTB/POZ 236bHLH 226Cadherin 226
Page 391Source: Integr8 at EBI website
15 most common domains (various species)
The European Bioinformatics Institute (EBI) offers many key proteomics resources at the Integr8 site:
http://www.ebi.ac.uk/proteome/
Page 391
1. Go to the Integr8 site: http://www.ebi.ac.uk/proteome/
2. Browse species; choose Homo sapiens.
3. Click “Proteome analysis”
4. Obtain a variety of statistics, such as common repeats, domains, average protein length
Source: Integr8 at EBI website (updated 7/09)
amino acid
fre
que
ncy
Amino acid composition
Source: Integr8 at EBI website (updated 7/09)
Re
lativ
e fr
eq
uen
cy
protein length
Average protein length : 468+/- 522 amino acid residues Size range: 4 - 34350 amino acid residues
Definition of a domain
According to InterPro at EBI (http://www.ebi.ac.uk/interpro/):
A domain is an independent structural unit, found aloneor in conjunction with other domains or repeats.Domains are evolutionarily related.
According to SMART (http://smart.embl-heidelberg.de):
A domain is a conserved structural entity with distinctivesecondary structure content and a hydrophobic core.Homologous domains with common functions usuallyshow sequence similarities.
Page 390
Varieties of protein domains
Page 393
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
Example of a protein with domains: Methyl CpG binding protein 2 (MeCP2)
MBD
Page 393
TRD
The protein includes a methylated DNA binding domain(MBD) and a transcriptional repression domain (TRD).MeCP2 is a transcriptional repressor.
Mutations in the gene encoding MeCP2 cause RettSyndrome, a neurological disorder affecting girlsprimarily.
Page 393
Result of an MeCP2 blastp search:A methyl-binding domain shared by several proteins
domain
Page 393
Are proteins that share only a domain homologous?
Example of a multidomain protein: HIV-1 pol
Pol (NP_789740), 995 amino acids longGag-Pol (NP_057849), 1435 amino acids
• cleaved into three proteins with distinct activities:-- aspartyl protease-- reverse transcriptase-- integrase
We will explore HIV-1 pol and other proteins at theExpert Protein Analysis System (ExPASy) server.
Visit www.expasy.org/
Page 394
Page 394
http://www.expasy.ch/
1. Go to ExPASy (http://www.expasy.ch/)2. If you know the SwissProt accession of your protein, enter it at top.3. Otherwise go into Swiss-Prot/TrEMBL, click SRS (Sequence Retrieval System), click Start, then click continue, then search for your protein of interest.
SwissProt: which HIV-1 pol is “correct”?
Use P04587
The SwissProt entry for any protein provideshighly useful information…
Page 395
SwissProt entry for HIV-1 pol links to many databases
Page 395
ProDom entry for HIV-1 pol shows many related proteins
www.uniprot.org
Three protein databases recently merged to form UniProt:
• SwissProt
• TrEMBL (translated European Molecular Biology Lab)
• Protein Information Resource (PIR)
You can search for information on your favorite protein there; a BLAST server is provided.
Proteins can have both domains and motifs (patterns)
Domain(aspartylprotease)
Domain(reversetranscriptase)
Motif(severalresidues)
Motif(severalresidues)
Page 396
Definition of a motif
A motif (or fingerprint) is a short, conserved region of a protein. Its size is often 10 to 20 amino acids.
Simple motifs include transmembrane domains andphosphorylation sites. These do not imply homologywhen found in a group of proteins.
PROSITE (www.expasy.org/prosite) is a dictionary of motifs (there are currently 1600 entries). In PROSITE,a pattern is a qualitative motif description (a proteineither matches a pattern, or not). In contrast, a profileis a quantitative motif description. We will encounterprofiles in Pfam, ProDom, SMART, and other databases.
Page 394
Summary of Perspective 1: Protein domains and motifs
A signature is a protein category such as a domain or motif.
You can learn about domains at Integr8, and at databases such as InterPro and Pfam.
A motif (or fingerprint) is a short, conserved sequence. You can study motifs at Prosite at ExPASy.
Perspective 2: Physical properties of proteins
Page 397
Page 398
Physical properties of proteins
Many websites are available for the analysis ofindividual proteins. ExPASy and ISREC are twoexcellent resources.
The accuracy of these programs is variable. Predictions based on primary amino acid sequence (such as molecular weight prediction) are likely to be more trustworthy. For many other properties (such asposttranslational modification of proteins by specific sugars), experimental evidence may be required rather than prediction algorithms.
Page 399
Page 399
Access a variety of protein analysis programsfrom the top right of the ExPASy home page
Page 400
Page 400
Protein secondary structure
Protein secondary structure is determined by the amino acid side chains.
Myoglobin is an example of a protein having many-helices. These are formed by amino acid stretches4-40 residues in length.
Thioredoxin from E. coli is an example of a proteinwith many sheets, formed from strands composedof 5-10 residues. They are arranged in parallel orantiparallel orientations.
Page 425
Fig. 11.3Page 427
Myoglobin(John Kendrew, 1958)
Thioredoxin
Fig. ~11.3Page 427
Secondary structure prediction
Chou and Fasman (1974) developed an algorithmbased on the frequencies of amino acids found in helices, -sheets, and turns.
Proline: occurs at turns, but not in helices.
GOR (Garnier, Osguthorpe, Robson): related algorithm
Modern algorithms: use multiple sequence alignmentsand achieve higher success rate (about 70-75%)
Page 427
Secondary structure prediction
Web servers:
GOR4JpredNNPREDICTPHDPredatorPredictProteinPSIPREDSAM-T99sec
Table 11-3Page 429
Go to http://pbil.univ-lyon1.fr/,click “Secondary structure prediction”to access this prediction tool
Tertiary protein structure: protein folding
Main approaches:
[1] Experimental determination (X-ray crystallography, NMR)
[2] Prediction
► Comparative modeling (based on homology)
► Threading
► Ab initio (de novo) prediction (Ingo Ruczinski at JHSPH)
Page 430
Experimental approaches to protein structure
[1] X-ray crystallography-- Used to determine 80% of structures-- Requires high protein concentration-- Requires crystals-- Able to trace amino acid side chains-- Earliest structure solved was myoglobin
[2] NMR-- Magnetic field applied to proteins in solution-- Largest structures: 350 amino acids (40 kD)-- Does not require crystallization
Page 430
Steps in obtaining a protein structure
Target selection
Obtain, characterize protein
Determine, refine, model the structure
Deposit in repositoryFig 11.5page 431
Priorities for target selection for protein structures
Page 432
Historically, small, soluble, abundant proteins werestudied (e.g. hemoglobin, cytochromes c, insulin).
Modern criteria:• Represent all branches of life• Represent previously uncharacterized families• Identify medically relevant targets• Some are attempting to solve all structures within an individual organism (Methanococcus jannaschii, Mycobacterium tuberculosis)
The Protein Data Bank (PDB)
Page 434
• PDB is the principal repository for protein structures• Established in 1971• Accessed at http://www.rcsb.org/pdb or simply http://www.pdb.org• Currently contains 59,000 structure entities
Updated 7/28/09
Fig. 11.7Page 435
stru
ctur
es
Fig. 9.6Page 281
yearUpdated 12/08
PDB content growth (www.pdb.org)
1972198019802000
yearly
total
10,000
2008
50,000
30,000
PDB holdings (12/08)
50,621 proteins, peptides2,225 protein/nucl. complexes1,946 nucleic acids33 other; carbohydrates54,825 total
Table 11-4Page 435
Fig. ~11.9Page 436
Fig. ~11.10Page 436
Viewing hemoglobin (accession 2H35) at PDB
Viewing structures at PDB: WebMol
Fig. 11.11Page 437
Protein Data Bank
Swiss-Prot, NCBI, EMBL
CATH, Dali, SCOP, FSSP
gateways to access PDB files
databases that interpret PDB files
The CATH Hierarchy
Fig. 11.18Page 444
Access to PDB through NCBI
Page 437
You can access PDB data at the NCBI several ways.
• Go to the Structure site, from the NCBI homepage• Use Entrez• Perform a BLAST search, restricting the output to the PDB database
Fig. ~11.12Page 438
Do a blastp search;set the database to pdb (Protein Data Bank)
Structure links
Structure accession(e.g. 2JTZ)
Fig. 9.13 Page 288
Fig. 9.14 Page 289
Access to PDB structures through NCBI
Page 291
Molecular Modeling DataBase (MMDB)
Cn3D (“see in 3D” or three dimensions):structure visualization software
Vector Alignment Search Tool (VAST):view multiple structures
Fig. 9.15 Page 290
Fig. 9.15 Page 290
Fig. 9.16 Page 291
Fig. 9.17 Page 292
Access to structure data at NCBI: VAST
Page 294
Vector Alignment Search Tool (VAST) offers a varietyof data on protein structures, including
-- PDB identifiers-- root-mean-square deviation (RMSD) values to describe structural similarities-- NRES: the number of equivalent pairs of alpha carbon atoms superimposed-- percent identity
Fig. 9.18Page 293
Predicting protein structure: CASP8 competition
8th Community Wide Experiment on theCritical Assessment of Techniques for Protein Structure Prediction
http://predictioncenter.gc.ucdavis.edu/http://predictioncenter.org/casp8/index.cgi
Percent of residues (C)
Dis
tan
ce c
uto
ff (
Å)most groups had the right prediction for this structure (but not those at arrow 2)
most groups had the wrong prediction for this structure(but arrow 3 did better)
half the groups got it wrong(arrow 4), half got it right (arrow 5); the key difference is the multiple sequence alignment
2
1
3
4
5
Protein structure and human disease
In some cases, a single amino acid substitutioncan induce a dramatic change in protein structure.For example, the F508 mutation of CFTR altersthe helical content of the protein, and disruptsintracellular trafficking.
Other changes are subtle. The E6V mutation in thegene encoding hemoglobin beta causes sickle-cell anemia. The substitution introduces a hydrophobic patch on the protein surface,leading to clumping of hemoglobin molecules.
Page 311
Protein structure and human disease
Disease ProteinCystic fibrosis CFTRSickle-cell anemia hemoglobin beta“mad cow” disease prion proteinAlzheimer disease amyloid precursor protein
Table 9.5Page 312
Introduction to Perspectives 3 and 4: Gene Ontology (GO) Consortium
Page 237
The Gene Ontology Consortium
An ontology is a description of concepts. The GOConsortium compiles a dynamic, controlled vocabularyof terms related to gene products.
There are three organizing principles: Molecular functionBiological processCellular compartment
You can visit GO at http://www.geneontology.org.There is no centralized GO database. Instead, curatorsof organism-specific databases assign GO termsto gene products for each organism.
Page 237
Page 241
GO terms are assigned to Entrez Gene entries
Page 241
Page 241
The Gene Ontology Consortium: Evidence Codes
IC Inferred by curatorIDA Inferred from direct assayIEA Inferred from electronic annotationIEP Inferred from expression patternIGI Inferred from genetic interactionIMP Inferred from mutant phenotypeIPI Inferred from physical interactionISS Inferred from sequence or structural similarityNAS Non-traceable author statementND No biological dataTAS Traceable author statement
Page 240
Perspective 3: Protein localization
Page 242
protein
Protein localization
Page 242
Protein localization
Proteins may be localized to intracellular compartments,cytosol, the plasma membrane, or they may be secreted. Many proteins shuttle between multiple compartments.
A variety of algorithms predict localization, but thisis essentially a cell biological question.
Page 240
Page 242
Page 244
Page 244
Localization of 2,900 yeast proteins
Michael Snyder and colleagues incorporated epitopetags into thousands of S. cerevisiae cDNAs,and systematically localized proteins (Kumar et al., 2002).
See http://ygac.med.yale.edu for the TRIPLES database including ~3,000 fluorescence micrographs.
Page 243
Perspective 4: Protein function
Page 243
Protein function
Function refers to the role of a protein in the cell.We can consider protein function from a varietyof perspectives.
Page 243
1. Biochemical function(molecular function)
RBP binds retinol,could be a carrier
Page 245
2. Functional assignmentbased on homology
RBPcould bea carrier
too
Othercarrier proteins
Page 245
3. Functionbased on structure
RBP forms a calyx
Page 245
4. Function based onligand binding specificity
RBP binds vitamin A
Page 245
5. Function based oncellular process
DNA RNA
RBP is abundant,soluble, secreted
Page 245
6. Function basedon biological process
RBP is essential for vision
Page 245
7. Function based on “proteomics”or high throughput “functional genomics”
High throughput analyses show...
RBP levels elevated in renal failureRBP levels decreased in liver disease
Page 245
Functional assignment of enzymes:the EC (Enzyme Commission) system
Oxidoreductases 1,003Transferases 1,076Hydrolases 1,125Lyases 356Isomerases 156Ligases 126
Page 246
Functional assignment of proteins:Clusters of Orthologous Groups (COGs)
Information storage and processing
Cellular processes
Metabolism
Poorly characterized
Page 247