computational methods and bioinformatics in proteomic studies bioinformatics: building bridges april...
Post on 19-Dec-2015
221 views
TRANSCRIPT
Computational Methods and Bioinformatics in Proteomic Studies
Bioinformatics: Building Bridges April 14, 2005Tim GriffinDept. Biochemistry, Molecular Biology and [email protected]
Interdisciplinary biology in the 21st century
Genome-era biology: system-wide studies
The yeast genome on a chip
DeRisi et al, 1997, Science 278:680
The simple view: one gene, one protein
The reality: biological systems are complex
Protein interaction network in DrosophilaScience (2003) 302, p. 1727
DNADNA mRNAmRNA
ProteinProtein
mRNAmRNA
Inactive Inactive proteinprotein
Inactive Inactive mRNAmRNA
Primary Primary RNA RNA
transcripttranscript
NucleusNucleus CytosolCytosol
TrancriptionaTrancriptionall controlcontrol
RNA RNA processinprocessing controlg control
RNA RNA transport transport controlcontrol
Translational Translational controlcontrol
Translational Translational controlcontrol
Active Active proteinprotein
ProteiProtein n
activitactivity y
controcontroll
Why analyze at the protein level?
control of eukaryotic gene expression
What is proteomics?
“Proteomics includes not only the identification
and quantification of proteins, but also the
determination of their localization, modifications,
interactions, activities, and, ultimately, their
function.”
-Stan Fields in Science, 2001.
Alternatively: proteomics = “fast biochemistry”
Proteomics: a complement to genomics
• measurement of protein response, which is not always indicated by mRNA response
• post-translational modifications
• macromolecular interactions
• sub-cellular location
• high-resolution structural and molecular characterization
What proteomic analysis has to offer:
Genomics, Proteomics, and Systems Biology
mat
ure
pr
otot
ype
em
ergi
nggenomic
DNAmRNA
sequencingarrays
genomics
proteincataloguing
protein products
functionalprotein
quantitativeprofiling
protein phosphorylation
Protein dynamics
ProteinModifications
sub cellularlocation
catalytic activity
descriptive proteininteraction maps
3D structure
proteomics
measure and defineproperties
system
identifysystem
components
interactionsbetween
components
computational biology
Proteomics technologies and methods
• Two-dimensional gel electrophoresis
• mass spectrometry
• protein chips
• yeast 2-hybrid
• phage display
• antibody engineering
• high-throughput protein expression
• high-throughput X-ray crystallography
The 1990’s revolution: mass spectrometry
Development of physical methods to mass analyze large biomolecules
ionization
+-+
+
+ -
--
detectionseparation by m/z
+ +
quadrupole ion trap time-of-flight
MALDI Electrospray:
liquid chromatography nanospray
mass analysis of proteins, peptides, DNA
Electrospray ionization (ESI)
• protein and peptide analysis, multiply charged ions
• quadrupole and TOF detection• tandem mass spectrometry• solution phase ionization – enables online coupling with liquid chromatography (LC)
200 m
+
Separations of complex mixtures: crowd control
• Enables the processing of the many components
in big protein mixtures
1 2 3....
turnstile
peptides
trypsin
Protein mixture
Identification of protein mixtures by tandem mass spectrometry
3. CID3. CID 4. detect fragments4. detect fragments
200200 400400 600600 800800 1000100012001200m/zm/z
tandem mass spectrumtandem mass spectrum(MS/MS)(MS/MS)
fragment peptide
µLCµLC
2. select specific peptide2. select specific peptide
ArAr
Ar
Ar
ESIESI
1. MS “survey” scan
600600 800800 10001000 12001200 1400140000
100100
5050
Rela
tive
Abun
danc
e
600600 800800 10001000 12001200 1400140000
100100
5050
Rela
tive
Abun
danc
e*
200200 400400 600600 800800 10001000 12001200m/zm/z
Rel
ativ
e A
bund
ance
Rel
ativ
e A
bund
ance
Peptide sequence determination from MS/MS spectra
H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOHb2 b3 b4 b5 b6 b7 b8 b9 b10b11 b12 b13 b14b1
y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1y14
Collision-induced dissociation (CID) creates two prominent ion series:
y-series:
b-series:
theoretical(DNA or protein database)
observed
protein protein identificationidentification
200200 400400 600600 8008001000100012001200m/zm/z
200200 400400 600600 8008001000100012001200m/zm/z peptide peptide
identificationidentification
Identification of protein mixtures by mass spectrometry
1. De novo (i.e. manually)
2. Database searching:
HH22NN-NSGDIVNLGSIAGR--NSGDIVNLGSIAGR-COOHCOOH
200200 400400 600600 800800 10001000 12001200m/zm/z
Rel
ativ
e A
bund
ance
Rel
ativ
e A
bund
ance
LGSIAGRLGSIAGR
GSIAGRGSIAGR
SIAGRSIAGR
IAGRIAGR
AGRAGRGRGRRR
NLGSIAGRNLGSIAGRVNLGSIAGRVNLGSIAGRIVNLGSIAGRIVNLGSIAGRDIVNLGSIAGRDIVNLGSIAGRGDIVNLGSIAGRGDIVNLGSIAGR
Peptide sequence identifies the protein
YMR134W, yeast protein involved in iron metabolism
High-throughput protein identification by LC-MS/MS and automated sequence database searching
Protein sequence and/or DNA sequence database search
HH22NN--NSGDIVNLGSIAGRNSGDIVNLGSIAGR--COOHCOOH
200200 400400 600600 800800 10001000 12001200m/zm/z
Rela
tive A
bu
nd
ance
Rela
tive A
bu
nd
ance
LGSIAGRLGSIAGR
GSIAGRGSIAGR
SIAGRSIAGRIAGRIAGR
AGRAGRGRGRRR
NLGSIAGRNLGSIAGRVNLGSIAGRVNLGSIAGRIVNLGSIAGRIVNLGSIAGRDIVNLGSIAGRDIVNLGSIAGRGDIVNLGSIAGRGDIVNLGSIAGR
200200 400400 600600 800800 10001000 12001200m/zm/z
Rela
tive A
bun
dance
Rela
tive A
bun
dance
200200 400400 600600 800800 10001000 12001200m/zm/z
Rela
tive A
bun
dance
Rela
tive A
bun
dance
Raw MS/MS spectrum
Peptide sequence match
Direct identification of 1000+ proteins from complex mixtures
Protein identification
Case Study: Proteomic Analysis of Oral Cancer Progression
• Mouth cancer, tongue cancer, throat cancer
• In USA, ~30,000 people are newly diagnosed with oral cancer each year, a person dies from oral cancer every hour of every day
• 350,000 to 400,000 new cases annually worldwide
• Less than half will be alive in 5 yrs; 20x higher risk of producing second, primary tumors
• However, 80 to 90% cure rate when found early. Unfortunately, at this time, the majority are found as latter stage cancers
Progression of oral cancer
??Can we find molecular markers that predict this transition?
insult or injury
Malignancy transformation rate = 5-17%
(adapted from Dr. Nelson Rhodus, U of M Dental School)
Saliva as a diagnostic fluid in oral cancer progression
• Readily available, non-invasive collection
• Heterogeneous human fluid with large dynamic range of protein abundances – requires fractionation
• Many post-translational modified proteins
• Currently only 100-150 proteins have been identified in whole saliva (LC-MS/MS)
First step: obtain a comprehensive profile of the protein components from a normal individual saliva sample
insult or injury
Multidimensional separations followed by mass spectrometry
………
Whole saliva protein mixture
FFE fractionation (70 fractions)
ESI-MS/MS – (500,000+ spectra)
RP-capLC
Protein sequence and/or DNA sequence database search Protein identification
Raw data processing: Automated database searching
Computational algorithms for searching MS/MS spectra against protein sequence databases, mRNA sequences, DNA sequences
ProFound Mascot PepSea MS-Fit MOWSE Peptident Multident Sequest PepFrag MS-Tag
200200 400400 600600 800800 10001000 12001200m/zm/z
Rel
ativ
e A
bund
ance
Rel
ativ
e A
bund
ance
Protein identification
Choosing a sequence database
• National Center for Biotechnology Information (NCBI)
• Swiss-Prot/TrEMBL
• Protein Information Resource (PIR)
• European Biotechnology Institute (EBI)
Considerations: organism-specificity, redundancy, annotation
Analysis of processed data: quality control of protein matches
Unfiltered – 105+ matches (lots of noise and junk)
Filtered – thousands of “true” matches
filtering
Probability of sequence match via statistical modeling
Keller et al (2002) Analytical Chemistry 74, 5383
Sequence matches automatically assigned a P score between 0 and 1
Collating and interpreting the data: Interact software tool
http://www.systemsbiology.org/Default.aspx?pagename=proteomicssoftware
Result: Processed and Filtered Data
Saliva example: 433 unique proteins identified
Interpreting the data: annotated protein databases
National Center for Biotechnology Information (NCBI)
ExPASy/Uniprot
European Bioinformatics Institute (EBI)
Organism/biology specific:
Saccharomyces Genome Database (SGD)
Human Mitochondrial Protein Database
Human Proteome Organization (HUPO)
Mining databases for data interpretation: Example 1
Mining databases for data interpretation: Example 1
Mining databases for data interpretation: Example 2
Mining databases for data interpretation: Example 2
Classification of interpreted data: subcellular localizationSubcellular Localization
Unknown (56)
Secreted/Extracellular (132)
Cytoplasmic (87)
Structure/Cytoskeleton (47)
Cytoplasmic/Nuclear (30)
Membrane (25)
Nuclear (17)
Endoplasmic (9)
Lysosomal (9)
Mitochondrial (7)
Ribosomal (6)
Desmosomal (3)
Endosomal (3)
Peroxisomal (2)
Peroxisomal(2) 0.5%
Unknown(56) 12.9%
Desmosomal(3) 0.7% Endosomal(3)
0.7%Ribosomal(6) 1.4%
Mitochondrial(7) 1.6%
Lysosomal(9) 2.1%
Endoplasmic(9) 2.1%
Nuclear(17)3.9%
Membrane(25) 5.8%
Cytoplasmic/Nuclear(30)
6.9%
Structure/Cytoskeleton(47)10.8%
Cytoplasmic(87) 20.1%
Secreted/Extracellular(132) 30.5%
Classification of interpreted data: functional characterization
Biological Function
Unknown (39)
Transport (41)
Signaling (37)
RNA Binding/Modification (10)
Redox (16)
Protein Synthesis (14)
Protein Modification/Polymerization (6)
Protein Folding/Repair (28)
Protein Degradation/Inhibitor (57)
Metabolism-Other (23)
Metabolism-Glycolysis/Carbohydrates(33)DNA Binding/Transcription (6)
Defense/Immunoresponse (43)
Structural/Cytoskeletal (65)
Cell Grownth/Differentiation (3)
Cell Adhesion/Communication (12)
Defense/Immunoresponse(43)
9.9%
Structural/Cytoskeletal(65)
15.0%
Cell Growth/Differentiation(3)
0.7% Cell Adhesion/Communication(12)
2.8% Unknown(39) 9.0%
Transport(41) 9.5%
Signaling(37) 8.5%
RNA Binding/Modification(10)
2.3%
Redox(16) 3.7%
Protein Synthesis(14)
3.2%Protein Modification/Polymerization(6)
1.4%
Protein Folding/Repair(28)
6.5%Protein Degradation/
Inhibitor(57) 13.2%
Metobalism-Other(23) 5.3%
Metabolism-Glycolysis/Carbohydrates(33)
7.6%
DNA Binding/Trascription(6)
1.4%
What about quantitative measurements?
??Can we find molecular markers that predict this transition?
insult or injury
Malignancy transformation rate = 5-17%
(adapted from Dr. Nelson Rhodus, U of M Dental School)
Stable-isotope labeling of proteins for quantitative profiling
HHLL
HHLLLLLL
Inte
nsity
m/z
Inte
nsity
m/z
State 1 State 2State 1 State 2
label with “light” (-L) or “heavy” (-H) reagent
combine and proteolyze
analyze by MS
HH
LLLLLL
intensity[light] intensity[heavy]
relative protein abundance = intensity[light] intensity[heavy]
relative protein abundance =
-L and –H labels are chemically identical, but isotopically different due to incorporation of stable isotopes (i.e. 2H, 15N, 13C…)
20° vs. 37°
Chemically identical but isotopically different peptides ionize with same efficiency, act as mutual internal standards
Quantitative analysis of mRNA data
DeRisi et al, 1997, Science 278:680
Sample 1 Sample 2Sample 1 Sample 2
Automated Quantitative Proteomics
combine combine and and
proteolyzeproteolyze
multi-dimensional multi-dimensional separationseparation
550550 560560 570570 580580m/zm/z
100100
200200 400400 600600 800800m/zm/z
00
100100
NHNH22-EACDPLR--EACDPLR-COOHCOOH
lightlight heavyheavy
mixture 2 (heavy)mixture 2 (heavy)
mixture 1 (light)mixture 1 (light)
massmassanalysisanalysis
quantify
Identify(MS/MS)
Quantitative analysis
+TOF MS: 20 MCA scans from mm_sample.wiffa=3.56145059693694800e-004, t0=6.89652636903192620e+001
Max. 274.0 counts.
1914 1916 1918 1920 1922 1924 1926 1928 1930 1932 1934m/z, amu
0
20
40
60
80
100
120
140
160
180
200
220
240
260
274
In
te
ns
ity
, c
ou
nts
1926.0240
1927.0231
1928.0203
1917.9946 1929.03221916.9909
1918.99241930.01761920.0007
1924.98031931.00771921.0165
Sample 1
Sample 2
Relative intensity =
relative protein abundance
Disease proteomics: androgen-induced effects in prostate cancer
306 peptides, 79 differentially expressed (26%)(d0/d8 > 1.5 or < 0.67)
- androgen + androgen
Dealing with the data
Data acquisition
Raw data processing(Database searching)
Analysis of processed data(Statistical filtering, quantitative analysis)
Data organization and interpretation
Archiving and databasing
Modeling(Computational Biology)
Need for better data archives and respositories
http://proteomics.jhu.edu/dl/pathidb.php
Archiving challenges: different data formats
http://sashimi.sourceforge.net/software_glossolalia.html
DNADNA mRNAmRNA
ProteinProtein
mRNAmRNA
Inactive Inactive proteinprotein
Inactive Inactive mRNAmRNA
Primary Primary RNA RNA
transcripttranscript
NucleusNucleus CytosolCytosol
TrancriptionaTrancriptionall controlcontrol
RNA RNA processinprocessing controlg control
RNA RNA transport transport controlcontrol
Translational Translational controlcontrol
Translational Translational controlcontrol
Active Active proteinprotein
ProteiProtein n
activitactivity y
controcontroll
control of eukaryotic gene expression
Computational Biology: Integrating proteomics and genomics data
mRNA versus protein abundance ratios, Gal/Eth
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3mRNA abundance ratio (log10)
pro
tein
ab
un
dan
ce r
atio
(lo
g1
0)
1
4
A
2
3
MLS1
ICL1
PCK1
GAL5
FBP1
GAL7
GAL10
CDC19
PDC1PDC1
GAL1GAL3
PFK1
TPI1
FBA1
ACO1
Integrating proteomics and genomics data:Elucidating gene expression regulatory networks
Griffin TJ et al (2002) Mol Cell Proteomics 1: 323
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
mRNA abundance ratio (log10)
pro
tein
ab
un
dan
ce r
atio
(lo
g 10)
1
4
2
3
Protein synthesis
Mitochondriall located proteinsrRNA processing
Post-transcriptionally regulated proteins?
Computational biology: integrating information to assign function
Cytoscape: http://www.cytoscape.org/
Modeling cellular circuitry based on genomic and proteomic data
Is the virtual human on the horizon???
Acknowledgements
University of Minnesota
Griffin LaboratoryMikel RoeSri BandhakaviHongwei XieClive Nyauncho
FundingMinnesota Medical FoundationNIH
U of M Dental SchoolDr. Nelson Rhodus
MSIPatton Fast