Download - Protein analysis and proteomics July 29, 2009 August 5, 2009 Bioinformatics M.E:800.707 J. Pevsner [email protected]

Protein analysis and proteomics

July 29, 2009August 5, 2009

BioinformaticsM.E:800.707J. Pevsner

[email protected]

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby J Pevsner (copyright © 2009 by Wiley-Blackwell).

These images and materials may not be usedwithout permission from the publisher.

Visit http://www.bioinfbook.org

Copyright notice

Outline for tonight: Protein analysis and proteomics

Individual proteins Perspective 1: Protein families (domains and motifs) Perspective 2: Physical properties (3D structure) Perspective 3: Localization Perspective 4: Function

protein

Page 389

RNADNA

protein

[1] Protein families

[4] Protein function

[2] Physical properties

Page 389

[3] Protein localization

Gene ontology (GO):--cellular component--biological process--molecular function

The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI)

Goals: defining standards for proteomic data representation to facilitate the comparison, exchange, and verification of data


Work groups

# Gel Electrophoresis# Mass Spectrometry# Molecular Interactions# Protein Modifications# Proteomics Informatics# Sample Processing

Themes

# Controlled vocabularies# MIAPE: Minimum information about a proteomics experiment


http://www.psidev.info/

Perspective 1: Protein domains and motifs

Page 389

Definitions

Signature: • a protein category such as a domain or motif

Page 390

Definitions

Signature: • a protein category such as a domain or motif

Domain: • a region of a protein that can adopt a 3D structure• a fold• a family is a group of proteins that share a domain• examples: zinc finger domain immunoglobulin domain

Motif (or fingerprint):• a short, conserved region of a protein• typically 10 to 20 contiguous amino acid residues

Page 390

15 most common domains (human)

Zn finger, C2H2 type 1093 proteinsImmunoglobulin 1032EGF-like 471Zn-finger, RING 458Homeobox 417Pleckstrin-like 405RNA-binding region RNP-1400SH3 394Calcium-binding EF-hand 392Fibronectin, type III 300PDZ/DHR/GLGF 280Small GTP-binding protein 261BTB/POZ 236bHLH 226Cadherin 226

Page 391Source: Integr8 at EBI website

15 most common domains (various species)

The European Bioinformatics Institute (EBI) offers many key proteomics resources at the Integr8 site:

http://www.ebi.ac.uk/proteome/

Page 391

1. Go to the Integr8 site: http://www.ebi.ac.uk/proteome/

2. Browse species; choose Homo sapiens.

3. Click “Proteome analysis”

4. Obtain a variety of statistics, such as common repeats, domains, average protein length

Source: Integr8 at EBI website (updated 7/09)

amino acid

fre

que

ncy

Amino acid composition

Source: Integr8 at EBI website (updated 7/09)

Re

lativ

e fr

eq

uen

cy

protein length

Average protein length : 468+/- 522 amino acid residues Size range: 4 - 34350 amino acid residues

Definition of a domain

According to InterPro at EBI (http://www.ebi.ac.uk/interpro/):

A domain is an independent structural unit, found aloneor in conjunction with other domains or repeats.Domains are evolutionarily related.

According to SMART (http://smart.embl-heidelberg.de):

A domain is a conserved structural entity with distinctivesecondary structure content and a hydrophobic core.Homologous domains with common functions usuallyshow sequence similarities.

Page 390

Varieties of protein domains

Page 393

Extending along the length of a protein

Occupying a subset of a protein sequence

Occurring one or more times

Example of a protein with domains: Methyl CpG binding protein 2 (MeCP2)

MBD

Page 393

TRD

The protein includes a methylated DNA binding domain(MBD) and a transcriptional repression domain (TRD).MeCP2 is a transcriptional repressor.

Mutations in the gene encoding MeCP2 cause RettSyndrome, a neurological disorder affecting girlsprimarily.

Result of an MeCP2 blastp search:A methyl-binding domain shared by several proteins

domain

Are proteins that share only a domain homologous?

Example of a multidomain protein: HIV-1 pol

Pol (NP_789740), 995 amino acids longGag-Pol (NP_057849), 1435 amino acids

• cleaved into three proteins with distinct activities:-- aspartyl protease-- reverse transcriptase-- integrase

We will explore HIV-1 pol and other proteins at theExpert Protein Analysis System (ExPASy) server.

Visit www.expasy.org/

Page 394

http://www.expasy.ch/

1. Go to ExPASy (http://www.expasy.ch/)2. If you know the SwissProt accession of your protein, enter it at top.3. Otherwise go into Swiss-Prot/TrEMBL, click SRS (Sequence Retrieval System), click Start, then click continue, then search for your protein of interest.

SwissProt: which HIV-1 pol is “correct”?

Use P04587

The SwissProt entry for any protein provideshighly useful information…

SwissProt entry for HIV-1 pol links to many databases

ProDom entry for HIV-1 pol shows many related proteins

www.uniprot.org

Three protein databases recently merged to form UniProt:

• SwissProt

• TrEMBL (translated European Molecular Biology Lab)

• Protein Information Resource (PIR)

You can search for information on your favorite protein there; a BLAST server is provided.

Proteins can have both domains and motifs (patterns)

Domain(aspartylprotease)

Domain(reversetranscriptase)

Motif(severalresidues)

Motif(severalresidues)

Definition of a motif

A motif (or fingerprint) is a short, conserved region of a protein. Its size is often 10 to 20 amino acids.

Simple motifs include transmembrane domains andphosphorylation sites. These do not imply homologywhen found in a group of proteins.

PROSITE (www.expasy.org/prosite) is a dictionary of motifs (there are currently 1600 entries). In PROSITE,a pattern is a qualitative motif description (a proteineither matches a pattern, or not). In contrast, a profileis a quantitative motif description. We will encounterprofiles in Pfam, ProDom, SMART, and other databases.

Page 394

Summary of Perspective 1: Protein domains and motifs

A signature is a protein category such as a domain or motif.

You can learn about domains at Integr8, and at databases such as InterPro and Pfam.

A motif (or fingerprint) is a short, conserved sequence. You can study motifs at Prosite at ExPASy.

Perspective 2: Physical properties of proteins

Page 397

Physical properties of proteins

Many websites are available for the analysis ofindividual proteins. ExPASy and ISREC are twoexcellent resources.

The accuracy of these programs is variable. Predictions based on primary amino acid sequence (such as molecular weight prediction) are likely to be more trustworthy. For many other properties (such asposttranslational modification of proteins by specific sugars), experimental evidence may be required rather than prediction algorithms.

Page 399

Access a variety of protein analysis programsfrom the top right of the ExPASy home page

Protein secondary structure

Protein secondary structure is determined by the amino acid side chains.

Myoglobin is an example of a protein having many-helices. These are formed by amino acid stretches4-40 residues in length.

Thioredoxin from E. coli is an example of a proteinwith many sheets, formed from strands composedof 5-10 residues. They are arranged in parallel orantiparallel orientations.

Page 425

Fig. 11.37

Myoglobin(John Kendrew, 1958)

Thioredoxin

Fig. ~11.3Page 427

Secondary structure prediction

Chou and Fasman (1974) developed an algorithmbased on the frequencies of amino acids found in helices, -sheets, and turns.

Proline: occurs at turns, but not in helices.

GOR (Garnier, Osguthorpe, Robson): related algorithm

Modern algorithms: use multiple sequence alignmentsand achieve higher success rate (about 70-75%)

Page 427

Secondary structure prediction

Web servers:

GOR4JpredNNPREDICTPHDPredatorPredictProteinPSIPREDSAM-T99sec

Table 11-3Page 429

Go to http://pbil.univ-lyon1.fr/,click “Secondary structure prediction”to access this prediction tool

Tertiary protein structure: protein folding

Main approaches:

[1] Experimental determination (X-ray crystallography, NMR)

[2] Prediction

► Comparative modeling (based on homology)

► Threading

► Ab initio (de novo) prediction (Ingo Ruczinski at JHSPH)

Page 430

Experimental approaches to protein structure

[1] X-ray crystallography-- Used to determine 80% of structures-- Requires high protein concentration-- Requires crystals-- Able to trace amino acid side chains-- Earliest structure solved was myoglobin

[2] NMR-- Magnetic field applied to proteins in solution-- Largest structures: 350 amino acids (40 kD)-- Does not require crystallization

Page 430

Steps in obtaining a protein structure

Target selection

Obtain, characterize protein

Determine, refine, model the structure

Deposit in repositoryFig 11.5page 431

Priorities for target selection for protein structures

Page 432

Historically, small, soluble, abundant proteins werestudied (e.g. hemoglobin, cytochromes c, insulin).

Modern criteria:• Represent all branches of life• Represent previously uncharacterized families• Identify medically relevant targets• Some are attempting to solve all structures within an individual organism (Methanococcus jannaschii, Mycobacterium tuberculosis)

The Protein Data Bank (PDB)

Page 434

• PDB is the principal repository for protein structures• Established in 1971• Accessed at http://www.rcsb.org/pdb or simply http://www.pdb.org• Currently contains 59,000 structure entities

Updated 7/28/09

Fig. 11.7Page 435

stru

ctur

es

Fig. 9.6Page 281

yearUpdated 12/08

PDB content growth (www.pdb.org)

1972198019802000

yearly

total

10,000

2008

50,000

30,000

PDB holdings (12/08)

50,621 proteins, peptides2,225 protein/nucl. complexes1,946 nucleic acids33 other; carbohydrates54,825 total

Table 11-4Page 435

Fig. ~11.9Page 436

Fig. ~11.10Page 436

Viewing hemoglobin (accession 2H35) at PDB

Viewing structures at PDB: WebMol

Fig. 11.11Page 437

Protein Data Bank

Swiss-Prot, NCBI, EMBL

CATH, Dali, SCOP, FSSP

gateways to access PDB files

databases that interpret PDB files

The CATH Hierarchy

Fig. 11.18Page 444

Access to PDB through NCBI

Page 437

You can access PDB data at the NCBI several ways.

• Go to the Structure site, from the NCBI homepage• Use Entrez• Perform a BLAST search, restricting the output to the PDB database

Fig. ~11.12Page 438

Do a blastp search;set the database to pdb (Protein Data Bank)

Structure links

Structure accession(e.g. 2JTZ)

Fig. 9.13 Page 288

Fig. 9.14 Page 289

Access to PDB structures through NCBI

Page 291

Molecular Modeling DataBase (MMDB)

Cn3D (“see in 3D” or three dimensions):structure visualization software

Vector Alignment Search Tool (VAST):view multiple structures

Fig. 9.15 Page 290

Fig. 9.16 Page 291

Fig. 9.17 Page 292

Access to structure data at NCBI: VAST

Page 294

Vector Alignment Search Tool (VAST) offers a varietyof data on protein structures, including

-- PDB identifiers-- root-mean-square deviation (RMSD) values to describe structural similarities-- NRES: the number of equivalent pairs of alpha carbon atoms superimposed-- percent identity

Fig. 9.18Page 293

Predicting protein structure: CASP8 competition

8th Community Wide Experiment on theCritical Assessment of Techniques for Protein Structure Prediction

http://predictioncenter.gc.ucdavis.edu/http://predictioncenter.org/casp8/index.cgi

Percent of residues (C)

Dis

tan

ce c

uto

ff (

Å)most groups had the right prediction for this structure (but not those at arrow 2)

most groups had the wrong prediction for this structure(but arrow 3 did better)

half the groups got it wrong(arrow 4), half got it right (arrow 5); the key difference is the multiple sequence alignment

2

1

3

4

5

Protein structure and human disease

In some cases, a single amino acid substitutioncan induce a dramatic change in protein structure.For example, the F508 mutation of CFTR altersthe helical content of the protein, and disruptsintracellular trafficking.

Other changes are subtle. The E6V mutation in thegene encoding hemoglobin beta causes sickle-cell anemia. The substitution introduces a hydrophobic patch on the protein surface,leading to clumping of hemoglobin molecules.

Page 311

Protein structure and human disease

Disease ProteinCystic fibrosis CFTRSickle-cell anemia hemoglobin beta“mad cow” disease prion proteinAlzheimer disease amyloid precursor protein

Table 9.5Page 312

Introduction to Perspectives 3 and 4: Gene Ontology (GO) Consortium

Page 237

The Gene Ontology Consortium

An ontology is a description of concepts. The GOConsortium compiles a dynamic, controlled vocabularyof terms related to gene products.

There are three organizing principles: Molecular functionBiological processCellular compartment

You can visit GO at http://www.geneontology.org.There is no centralized GO database. Instead, curatorsof organism-specific databases assign GO termsto gene products for each organism.

Page 237

GO terms are assigned to Entrez Gene entries

The Gene Ontology Consortium: Evidence Codes

IC Inferred by curatorIDA Inferred from direct assayIEA Inferred from electronic annotationIEP Inferred from expression patternIGI Inferred from genetic interactionIMP Inferred from mutant phenotypeIPI Inferred from physical interactionISS Inferred from sequence or structural similarityNAS Non-traceable author statementND No biological dataTAS Traceable author statement

Page 240

Perspective 3: Protein localization

Page 242

protein

Protein localization

Page 242

Protein localization

Proteins may be localized to intracellular compartments,cytosol, the plasma membrane, or they may be secreted. Many proteins shuttle between multiple compartments.

A variety of algorithms predict localization, but thisis essentially a cell biological question.

Page 240

Localization of 2,900 yeast proteins

Michael Snyder and colleagues incorporated epitopetags into thousands of S. cerevisiae cDNAs,and systematically localized proteins (Kumar et al., 2002).

See http://ygac.med.yale.edu for the TRIPLES database including ~3,000 fluorescence micrographs.

Page 243

Perspective 4: Protein function

Page 243

Protein function

Function refers to the role of a protein in the cell.We can consider protein function from a varietyof perspectives.

Page 243

1. Biochemical function(molecular function)

RBP binds retinol,could be a carrier

Page 245

2. Functional assignmentbased on homology

RBPcould bea carrier

too

Othercarrier proteins

Page 245

3. Functionbased on structure

RBP forms a calyx

Page 245

4. Function based onligand binding specificity

RBP binds vitamin A

Page 245

5. Function based oncellular process

DNA RNA

RBP is abundant,soluble, secreted

Page 245

6. Function basedon biological process

RBP is essential for vision

Page 245

7. Function based on “proteomics”or high throughput “functional genomics”

High throughput analyses show...

RBP levels elevated in renal failureRBP levels decreased in liver disease

Page 245

Functional assignment of enzymes:the EC (Enzyme Commission) system

Oxidoreductases 1,003Transferases 1,076Hydrolases 1,125Lyases 356Isomerases 156Ligases 126

Page 246

Functional assignment of proteins:Clusters of Orthologous Groups (COGs)

Information storage and processing

Cellular processes

Metabolism

Poorly characterized

Page 247

Download - Protein analysis and proteomics July 29, 2009 August 5, 2009 Bioinformatics M.E:800.707 J. Pevsner [email protected]

Top Related