computational methods and bioinformatics in proteomic studies bioinformatics: building bridges april...

Post on 19-Dec-2015

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computational Methods and Bioinformatics in Proteomic Studies

Bioinformatics: Building Bridges April 14, 2005Tim GriffinDept. Biochemistry, Molecular Biology and Biophysicstgriffin@umn.edu

Interdisciplinary biology in the 21st century

Genome-era biology: system-wide studies

The yeast genome on a chip

DeRisi et al, 1997, Science 278:680

The simple view: one gene, one protein

The reality: biological systems are complex

Protein interaction network in DrosophilaScience (2003) 302, p. 1727

DNADNA mRNAmRNA

ProteinProtein

mRNAmRNA

Inactive Inactive proteinprotein

Inactive Inactive mRNAmRNA

Primary Primary RNA RNA

transcripttranscript

NucleusNucleus CytosolCytosol

TrancriptionaTrancriptionall controlcontrol

RNA RNA processinprocessing controlg control

RNA RNA transport transport controlcontrol

Translational Translational controlcontrol

Translational Translational controlcontrol

Active Active proteinprotein

ProteiProtein n

activitactivity y

controcontroll

Why analyze at the protein level?

control of eukaryotic gene expression

What is proteomics?

“Proteomics includes not only the identification

and quantification of proteins, but also the

determination of their localization, modifications,

interactions, activities, and, ultimately, their

function.”

-Stan Fields in Science, 2001.

Alternatively: proteomics = “fast biochemistry”

Proteomics: a complement to genomics

• measurement of protein response, which is not always indicated by mRNA response

• post-translational modifications

• macromolecular interactions

• sub-cellular location

• high-resolution structural and molecular characterization

What proteomic analysis has to offer:

Genomics, Proteomics, and Systems Biology

mat

ure

pr

otot

ype

em

ergi

nggenomic

DNAmRNA

sequencingarrays

genomics

proteincataloguing

protein products

functionalprotein

quantitativeprofiling

protein phosphorylation

Protein dynamics

ProteinModifications

sub cellularlocation

catalytic activity

descriptive proteininteraction maps

3D structure

proteomics

measure and defineproperties

system

identifysystem

components

interactionsbetween

components

computational biology

Proteomics technologies and methods

• Two-dimensional gel electrophoresis

• mass spectrometry

• protein chips

• yeast 2-hybrid

• phage display

• antibody engineering

• high-throughput protein expression

• high-throughput X-ray crystallography

The 1990’s revolution: mass spectrometry

Development of physical methods to mass analyze large biomolecules

ionization

+-+

+

+ -

--

detectionseparation by m/z

+ +

quadrupole ion trap time-of-flight

MALDI Electrospray:

liquid chromatography nanospray

mass analysis of proteins, peptides, DNA

Electrospray ionization (ESI)

• protein and peptide analysis, multiply charged ions

• quadrupole and TOF detection• tandem mass spectrometry• solution phase ionization – enables online coupling with liquid chromatography (LC)

200 m

+

Separations of complex mixtures: crowd control

• Enables the processing of the many components

in big protein mixtures

1 2 3....

turnstile

peptides

trypsin

Protein mixture

Identification of protein mixtures by tandem mass spectrometry

3. CID3. CID 4. detect fragments4. detect fragments

200200 400400 600600 800800 1000100012001200m/zm/z

tandem mass spectrumtandem mass spectrum(MS/MS)(MS/MS)

fragment peptide

µLCµLC

2. select specific peptide2. select specific peptide

ArAr

Ar

Ar

ESIESI

1. MS “survey” scan

600600 800800 10001000 12001200 1400140000

100100

5050

Rela

tive

Abun

danc

e

600600 800800 10001000 12001200 1400140000

100100

5050

Rela

tive

Abun

danc

e*

200200 400400 600600 800800 10001000 12001200m/zm/z

Rel

ativ

e A

bund

ance

Rel

ativ

e A

bund

ance

Peptide sequence determination from MS/MS spectra

H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOHb2 b3 b4 b5 b6 b7 b8 b9 b10b11 b12 b13 b14b1

y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1y14

Collision-induced dissociation (CID) creates two prominent ion series:

y-series:

b-series:

theoretical(DNA or protein database)

observed

protein protein identificationidentification

200200 400400 600600 8008001000100012001200m/zm/z

200200 400400 600600 8008001000100012001200m/zm/z peptide peptide

identificationidentification

Identification of protein mixtures by mass spectrometry

1. De novo (i.e. manually)

2. Database searching:

HH22NN-NSGDIVNLGSIAGR--NSGDIVNLGSIAGR-COOHCOOH

200200 400400 600600 800800 10001000 12001200m/zm/z

Rel

ativ

e A

bund

ance

Rel

ativ

e A

bund

ance

LGSIAGRLGSIAGR

GSIAGRGSIAGR

SIAGRSIAGR

IAGRIAGR

AGRAGRGRGRRR

NLGSIAGRNLGSIAGRVNLGSIAGRVNLGSIAGRIVNLGSIAGRIVNLGSIAGRDIVNLGSIAGRDIVNLGSIAGRGDIVNLGSIAGRGDIVNLGSIAGR

Peptide sequence identifies the protein

YMR134W, yeast protein involved in iron metabolism

High-throughput protein identification by LC-MS/MS and automated sequence database searching

Protein sequence and/or DNA sequence database search

HH22NN--NSGDIVNLGSIAGRNSGDIVNLGSIAGR--COOHCOOH

200200 400400 600600 800800 10001000 12001200m/zm/z

Rela

tive A

bu

nd

ance

Rela

tive A

bu

nd

ance

LGSIAGRLGSIAGR

GSIAGRGSIAGR

SIAGRSIAGRIAGRIAGR

AGRAGRGRGRRR

NLGSIAGRNLGSIAGRVNLGSIAGRVNLGSIAGRIVNLGSIAGRIVNLGSIAGRDIVNLGSIAGRDIVNLGSIAGRGDIVNLGSIAGRGDIVNLGSIAGR

200200 400400 600600 800800 10001000 12001200m/zm/z

Rela

tive A

bun

dance

Rela

tive A

bun

dance

200200 400400 600600 800800 10001000 12001200m/zm/z

Rela

tive A

bun

dance

Rela

tive A

bun

dance

Raw MS/MS spectrum

Peptide sequence match

Direct identification of 1000+ proteins from complex mixtures

Protein identification

Case Study: Proteomic Analysis of Oral Cancer Progression

• Mouth cancer, tongue cancer, throat cancer

• In USA, ~30,000 people are newly diagnosed with oral cancer each year, a person dies from oral cancer every hour of every day

• 350,000 to 400,000 new cases annually worldwide

• Less than half will be alive in 5 yrs; 20x higher risk of producing second, primary tumors

• However, 80 to 90% cure rate when found early. Unfortunately, at this time, the majority are found as latter stage cancers

Progression of oral cancer

??Can we find molecular markers that predict this transition?

insult or injury

Malignancy transformation rate = 5-17%

(adapted from Dr. Nelson Rhodus, U of M Dental School)

Saliva as a diagnostic fluid in oral cancer progression

• Readily available, non-invasive collection

• Heterogeneous human fluid with large dynamic range of protein abundances – requires fractionation

• Many post-translational modified proteins

• Currently only 100-150 proteins have been identified in whole saliva (LC-MS/MS)

First step: obtain a comprehensive profile of the protein components from a normal individual saliva sample

insult or injury

Multidimensional separations followed by mass spectrometry

………

Whole saliva protein mixture

FFE fractionation (70 fractions)

ESI-MS/MS – (500,000+ spectra)

RP-capLC

Protein sequence and/or DNA sequence database search Protein identification

Raw data processing: Automated database searching

Computational algorithms for searching MS/MS spectra against protein sequence databases, mRNA sequences, DNA sequences

ProFound Mascot PepSea MS-Fit MOWSE Peptident Multident Sequest PepFrag MS-Tag

200200 400400 600600 800800 10001000 12001200m/zm/z

Rel

ativ

e A

bund

ance

Rel

ativ

e A

bund

ance

Protein identification

Choosing a sequence database

• National Center for Biotechnology Information (NCBI)

• Swiss-Prot/TrEMBL

• Protein Information Resource (PIR)

• European Biotechnology Institute (EBI)

Considerations: organism-specificity, redundancy, annotation

Analysis of processed data: quality control of protein matches

Unfiltered – 105+ matches (lots of noise and junk)

Filtered – thousands of “true” matches

filtering

Probability of sequence match via statistical modeling

Keller et al (2002) Analytical Chemistry 74, 5383

Sequence matches automatically assigned a P score between 0 and 1

Collating and interpreting the data: Interact software tool

http://www.systemsbiology.org/Default.aspx?pagename=proteomicssoftware

Result: Processed and Filtered Data

Saliva example: 433 unique proteins identified

Interpreting the data: annotated protein databases

National Center for Biotechnology Information (NCBI)

ExPASy/Uniprot

European Bioinformatics Institute (EBI)

Organism/biology specific:

Saccharomyces Genome Database (SGD)

Human Mitochondrial Protein Database

Human Proteome Organization (HUPO)

Mining databases for data interpretation: Example 1

Mining databases for data interpretation: Example 1

Mining databases for data interpretation: Example 2

Mining databases for data interpretation: Example 2

Classification of interpreted data: subcellular localizationSubcellular Localization

Unknown (56)

Secreted/Extracellular (132)

Cytoplasmic (87)

Structure/Cytoskeleton (47)

Cytoplasmic/Nuclear (30)

Membrane (25)

Nuclear (17)

Endoplasmic (9)

Lysosomal (9)

Mitochondrial (7)

Ribosomal (6)

Desmosomal (3)

Endosomal (3)

Peroxisomal (2)

Peroxisomal(2) 0.5%

Unknown(56) 12.9%

Desmosomal(3) 0.7% Endosomal(3)

0.7%Ribosomal(6) 1.4%

Mitochondrial(7) 1.6%

Lysosomal(9) 2.1%

Endoplasmic(9) 2.1%

Nuclear(17)3.9%

Membrane(25) 5.8%

Cytoplasmic/Nuclear(30)

6.9%

Structure/Cytoskeleton(47)10.8%

Cytoplasmic(87) 20.1%

Secreted/Extracellular(132) 30.5%

Classification of interpreted data: functional characterization

Biological Function

Unknown (39)

Transport (41)

Signaling (37)

RNA Binding/Modification (10)

Redox (16)

Protein Synthesis (14)

Protein Modification/Polymerization (6)

Protein Folding/Repair (28)

Protein Degradation/Inhibitor (57)

Metabolism-Other (23)

Metabolism-Glycolysis/Carbohydrates(33)DNA Binding/Transcription (6)

Defense/Immunoresponse (43)

Structural/Cytoskeletal (65)

Cell Grownth/Differentiation (3)

Cell Adhesion/Communication (12)

Defense/Immunoresponse(43)

9.9%

Structural/Cytoskeletal(65)

15.0%

Cell Growth/Differentiation(3)

0.7% Cell Adhesion/Communication(12)

2.8% Unknown(39) 9.0%

Transport(41) 9.5%

Signaling(37) 8.5%

RNA Binding/Modification(10)

2.3%

Redox(16) 3.7%

Protein Synthesis(14)

3.2%Protein Modification/Polymerization(6)

1.4%

Protein Folding/Repair(28)

6.5%Protein Degradation/

Inhibitor(57) 13.2%

Metobalism-Other(23) 5.3%

Metabolism-Glycolysis/Carbohydrates(33)

7.6%

DNA Binding/Trascription(6)

1.4%

What about quantitative measurements?

??Can we find molecular markers that predict this transition?

insult or injury

Malignancy transformation rate = 5-17%

(adapted from Dr. Nelson Rhodus, U of M Dental School)

Stable-isotope labeling of proteins for quantitative profiling

HHLL

HHLLLLLL

Inte

nsity

m/z

Inte

nsity

m/z

State 1 State 2State 1 State 2

label with “light” (-L) or “heavy” (-H) reagent

combine and proteolyze

analyze by MS

HH

LLLLLL

intensity[light] intensity[heavy]

relative protein abundance = intensity[light] intensity[heavy]

relative protein abundance =

-L and –H labels are chemically identical, but isotopically different due to incorporation of stable isotopes (i.e. 2H, 15N, 13C…)

20° vs. 37°

Chemically identical but isotopically different peptides ionize with same efficiency, act as mutual internal standards

Quantitative analysis of mRNA data

DeRisi et al, 1997, Science 278:680

Sample 1 Sample 2Sample 1 Sample 2

Automated Quantitative Proteomics

combine combine and and

proteolyzeproteolyze

multi-dimensional multi-dimensional separationseparation

550550 560560 570570 580580m/zm/z

100100

200200 400400 600600 800800m/zm/z

00

100100

NHNH22-EACDPLR--EACDPLR-COOHCOOH

lightlight heavyheavy

mixture 2 (heavy)mixture 2 (heavy)

mixture 1 (light)mixture 1 (light)

massmassanalysisanalysis

quantify

Identify(MS/MS)

Quantitative analysis

+TOF MS: 20 MCA scans from mm_sample.wiffa=3.56145059693694800e-004, t0=6.89652636903192620e+001

Max. 274.0 counts.

1914 1916 1918 1920 1922 1924 1926 1928 1930 1932 1934m/z, amu

0

20

40

60

80

100

120

140

160

180

200

220

240

260

274

In

te

ns

ity

, c

ou

nts

1926.0240

1927.0231

1928.0203

1917.9946 1929.03221916.9909

1918.99241930.01761920.0007

1924.98031931.00771921.0165

Sample 1

Sample 2

Relative intensity =

relative protein abundance

Disease proteomics: androgen-induced effects in prostate cancer

306 peptides, 79 differentially expressed (26%)(d0/d8 > 1.5 or < 0.67)

- androgen + androgen

Dealing with the data

Data acquisition

Raw data processing(Database searching)

Analysis of processed data(Statistical filtering, quantitative analysis)

Data organization and interpretation

Archiving and databasing

Modeling(Computational Biology)

Need for better data archives and respositories

http://proteomics.jhu.edu/dl/pathidb.php

Archiving challenges: different data formats

http://sashimi.sourceforge.net/software_glossolalia.html

DNADNA mRNAmRNA

ProteinProtein

mRNAmRNA

Inactive Inactive proteinprotein

Inactive Inactive mRNAmRNA

Primary Primary RNA RNA

transcripttranscript

NucleusNucleus CytosolCytosol

TrancriptionaTrancriptionall controlcontrol

RNA RNA processinprocessing controlg control

RNA RNA transport transport controlcontrol

Translational Translational controlcontrol

Translational Translational controlcontrol

Active Active proteinprotein

ProteiProtein n

activitactivity y

controcontroll

control of eukaryotic gene expression

Computational Biology: Integrating proteomics and genomics data

mRNA versus protein abundance ratios, Gal/Eth

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3mRNA abundance ratio (log10)

pro

tein

ab

un

dan

ce r

atio

(lo

g1

0)

1

4

A

2

3

MLS1

ICL1

PCK1

GAL5

FBP1

GAL7

GAL10

CDC19

PDC1PDC1

GAL1GAL3

PFK1

TPI1

FBA1

ACO1

Integrating proteomics and genomics data:Elucidating gene expression regulatory networks

Griffin TJ et al (2002) Mol Cell Proteomics 1: 323

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

mRNA abundance ratio (log10)

pro

tein

ab

un

dan

ce r

atio

(lo

g 10)

1

4

2

3

Protein synthesis

Mitochondriall located proteinsrRNA processing

Post-transcriptionally regulated proteins?

Computational biology: integrating information to assign function

Cytoscape: http://www.cytoscape.org/

Modeling cellular circuitry based on genomic and proteomic data

Is the virtual human on the horizon???

Acknowledgements

University of Minnesota

Griffin LaboratoryMikel RoeSri BandhakaviHongwei XieClive Nyauncho

FundingMinnesota Medical FoundationNIH

U of M Dental SchoolDr. Nelson Rhodus

MSIPatton Fast

top related