computational methods and bioinformatics in proteomic studies bioinformatics: building bridges april...

Computational Methods and Bioinformatics in Proteomic Studies

Bioinformatics: Building Bridges April 14, 2005Tim GriffinDept. Biochemistry, Molecular Biology and Biophysicstgriffin@umn.edu

Interdisciplinary biology in the 21st century

Genome-era biology: system-wide studies

The yeast genome on a chip

DeRisi et al, 1997, Science 278:680

The simple view: one gene, one protein

The reality: biological systems are complex

Protein interaction network in DrosophilaScience (2003) 302, p. 1727

DNADNA mRNAmRNA

ProteinProtein

mRNAmRNA

Inactive Inactive proteinprotein

Inactive Inactive mRNAmRNA

Primary Primary RNA RNA

transcripttranscript

NucleusNucleus CytosolCytosol

TrancriptionaTrancriptionall controlcontrol

RNA RNA processinprocessing controlg control

RNA RNA transport transport controlcontrol

Translational Translational controlcontrol

Active Active proteinprotein

ProteiProtein n

activitactivity y

controcontroll

Why analyze at the protein level?

control of eukaryotic gene expression

What is proteomics?

“Proteomics includes not only the identification

and quantification of proteins, but also the

determination of their localization, modifications,

interactions, activities, and, ultimately, their

function.”

-Stan Fields in Science, 2001.

Alternatively: proteomics = “fast biochemistry”

Proteomics: a complement to genomics

• measurement of protein response, which is not always indicated by mRNA response

• post-translational modifications

• macromolecular interactions

• sub-cellular location

• high-resolution structural and molecular characterization

What proteomic analysis has to offer:

Genomics, Proteomics, and Systems Biology

nggenomic

DNAmRNA

sequencingarrays

genomics

proteincataloguing

protein products

functionalprotein

quantitativeprofiling

protein phosphorylation

Protein dynamics

ProteinModifications

sub cellularlocation

catalytic activity

descriptive proteininteraction maps

3D structure

proteomics

measure and defineproperties

system

identifysystem

components

interactionsbetween

components

computational biology

Proteomics technologies and methods

• Two-dimensional gel electrophoresis

• mass spectrometry

• protein chips

• yeast 2-hybrid

• phage display

• antibody engineering

• high-throughput protein expression

• high-throughput X-ray crystallography

The 1990’s revolution: mass spectrometry

Development of physical methods to mass analyze large biomolecules

ionization

detectionseparation by m/z

quadrupole ion trap time-of-flight

MALDI Electrospray:

liquid chromatography nanospray

mass analysis of proteins, peptides, DNA

Electrospray ionization (ESI)

• protein and peptide analysis, multiply charged ions

• quadrupole and TOF detection• tandem mass spectrometry• solution phase ionization – enables online coupling with liquid chromatography (LC)

Separations of complex mixtures: crowd control

• Enables the processing of the many components

in big protein mixtures

1 2 3....

turnstile

peptides

trypsin

Protein mixture

Identification of protein mixtures by tandem mass spectrometry

3. CID3. CID 4. detect fragments4. detect fragments

200200 400400 600600 800800 1000100012001200m/zm/z

tandem mass spectrumtandem mass spectrum(MS/MS)(MS/MS)

fragment peptide

µLCµLC

2. select specific peptide2. select specific peptide

ESIESI

1. MS “survey” scan

600600 800800 10001000 12001200 1400140000

100100

600600 800800 10001000 12001200 1400140000

100100

200200 400400 600600 800800 10001000 12001200m/zm/z

Peptide sequence determination from MS/MS spectra

H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOHb2 b3 b4 b5 b6 b7 b8 b9 b10b11 b12 b13 b14b1

y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1y14

Collision-induced dissociation (CID) creates two prominent ion series:

y-series:

b-series:

theoretical(DNA or protein database)

observed

protein protein identificationidentification

200200 400400 600600 8008001000100012001200m/zm/z

200200 400400 600600 8008001000100012001200m/zm/z peptide peptide

identificationidentification

Identification of protein mixtures by mass spectrometry

1. De novo (i.e. manually)

2. Database searching:

HH22NN-NSGDIVNLGSIAGR--NSGDIVNLGSIAGR-COOHCOOH

200200 400400 600600 800800 10001000 12001200m/zm/z

LGSIAGRLGSIAGR

GSIAGRGSIAGR

SIAGRSIAGR

IAGRIAGR

AGRAGRGRGRRR

NLGSIAGRNLGSIAGRVNLGSIAGRVNLGSIAGRIVNLGSIAGRIVNLGSIAGRDIVNLGSIAGRDIVNLGSIAGRGDIVNLGSIAGRGDIVNLGSIAGR

Peptide sequence identifies the protein

YMR134W, yeast protein involved in iron metabolism

High-throughput protein identification by LC-MS/MS and automated sequence database searching

Protein sequence and/or DNA sequence database search

HH22NN--NSGDIVNLGSIAGRNSGDIVNLGSIAGR--COOHCOOH

200200 400400 600600 800800 10001000 12001200m/zm/z

tive A

LGSIAGRLGSIAGR

GSIAGRGSIAGR

SIAGRSIAGRIAGRIAGR

AGRAGRGRGRRR

NLGSIAGRNLGSIAGRVNLGSIAGRVNLGSIAGRIVNLGSIAGRIVNLGSIAGRDIVNLGSIAGRDIVNLGSIAGRGDIVNLGSIAGRGDIVNLGSIAGR

200200 400400 600600 800800 10001000 12001200m/zm/z

tive A

200200 400400 600600 800800 10001000 12001200m/zm/z

tive A

Raw MS/MS spectrum

Peptide sequence match

Direct identification of 1000+ proteins from complex mixtures

Protein identification

Case Study: Proteomic Analysis of Oral Cancer Progression

• Mouth cancer, tongue cancer, throat cancer

• In USA, ~30,000 people are newly diagnosed with oral cancer each year, a person dies from oral cancer every hour of every day

• 350,000 to 400,000 new cases annually worldwide

• Less than half will be alive in 5 yrs; 20x higher risk of producing second, primary tumors

• However, 80 to 90% cure rate when found early. Unfortunately, at this time, the majority are found as latter stage cancers

Progression of oral cancer

??Can we find molecular markers that predict this transition?

insult or injury

Malignancy transformation rate = 5-17%

(adapted from Dr. Nelson Rhodus, U of M Dental School)

Saliva as a diagnostic fluid in oral cancer progression

• Readily available, non-invasive collection

• Heterogeneous human fluid with large dynamic range of protein abundances – requires fractionation

• Many post-translational modified proteins

• Currently only 100-150 proteins have been identified in whole saliva (LC-MS/MS)

First step: obtain a comprehensive profile of the protein components from a normal individual saliva sample

insult or injury

Multidimensional separations followed by mass spectrometry

………

Whole saliva protein mixture

FFE fractionation (70 fractions)

ESI-MS/MS – (500,000+ spectra)

RP-capLC

Protein sequence and/or DNA sequence database search Protein identification

Raw data processing: Automated database searching

Computational algorithms for searching MS/MS spectra against protein sequence databases, mRNA sequences, DNA sequences

ProFound Mascot PepSea MS-Fit MOWSE Peptident Multident Sequest PepFrag MS-Tag

200200 400400 600600 800800 10001000 12001200m/zm/z

Protein identification

Choosing a sequence database

• National Center for Biotechnology Information (NCBI)

• Swiss-Prot/TrEMBL

• Protein Information Resource (PIR)

• European Biotechnology Institute (EBI)

Considerations: organism-specificity, redundancy, annotation

Analysis of processed data: quality control of protein matches

Unfiltered – 105+ matches (lots of noise and junk)

Filtered – thousands of “true” matches

filtering

Probability of sequence match via statistical modeling

Keller et al (2002) Analytical Chemistry 74, 5383

Sequence matches automatically assigned a P score between 0 and 1

Collating and interpreting the data: Interact software tool

http://www.systemsbiology.org/Default.aspx?pagename=proteomicssoftware

Result: Processed and Filtered Data

Saliva example: 433 unique proteins identified

Interpreting the data: annotated protein databases

National Center for Biotechnology Information (NCBI)

ExPASy/Uniprot

European Bioinformatics Institute (EBI)

Organism/biology specific:

Saccharomyces Genome Database (SGD)

Human Mitochondrial Protein Database

Human Proteome Organization (HUPO)

Mining databases for data interpretation: Example 1

Mining databases for data interpretation: Example 2

Classification of interpreted data: subcellular localizationSubcellular Localization

Unknown (56)

Secreted/Extracellular (132)

Cytoplasmic (87)

Structure/Cytoskeleton (47)

Cytoplasmic/Nuclear (30)

Membrane (25)

Nuclear (17)

Endoplasmic (9)

Lysosomal (9)

Mitochondrial (7)

Ribosomal (6)

Desmosomal (3)

Endosomal (3)

Peroxisomal (2)

Peroxisomal(2) 0.5%

Unknown(56) 12.9%

Desmosomal(3) 0.7% Endosomal(3)

0.7%Ribosomal(6) 1.4%

Mitochondrial(7) 1.6%

Lysosomal(9) 2.1%

Endoplasmic(9) 2.1%

Nuclear(17)3.9%

Membrane(25) 5.8%

Cytoplasmic/Nuclear(30)

Structure/Cytoskeleton(47)10.8%

Cytoplasmic(87) 20.1%

Secreted/Extracellular(132) 30.5%

Classification of interpreted data: functional characterization

Biological Function

Unknown (39)

Transport (41)

Signaling (37)

RNA Binding/Modification (10)

Redox (16)

Protein Synthesis (14)

Protein Modification/Polymerization (6)

Protein Folding/Repair (28)

Protein Degradation/Inhibitor (57)

Metabolism-Other (23)

Metabolism-Glycolysis/Carbohydrates(33)DNA Binding/Transcription (6)

Defense/Immunoresponse (43)

Structural/Cytoskeletal (65)

Cell Grownth/Differentiation (3)

Cell Adhesion/Communication (12)

Defense/Immunoresponse(43)

Structural/Cytoskeletal(65)

Cell Growth/Differentiation(3)

0.7% Cell Adhesion/Communication(12)

2.8% Unknown(39) 9.0%

Transport(41) 9.5%

Signaling(37) 8.5%

RNA Binding/Modification(10)

Redox(16) 3.7%

Protein Synthesis(14)

3.2%Protein Modification/Polymerization(6)

Protein Folding/Repair(28)

6.5%Protein Degradation/

Inhibitor(57) 13.2%

Metobalism-Other(23) 5.3%

Metabolism-Glycolysis/Carbohydrates(33)

DNA Binding/Trascription(6)

What about quantitative measurements?

??Can we find molecular markers that predict this transition?

insult or injury

Malignancy transformation rate = 5-17%

(adapted from Dr. Nelson Rhodus, U of M Dental School)

Stable-isotope labeling of proteins for quantitative profiling

HHLLLLLL

State 1 State 2State 1 State 2

label with “light” (-L) or “heavy” (-H) reagent

combine and proteolyze

analyze by MS

LLLLLL

intensity[light] intensity[heavy]

relative protein abundance = intensity[light] intensity[heavy]

relative protein abundance =

-L and –H labels are chemically identical, but isotopically different due to incorporation of stable isotopes (i.e. 2H, 15N, 13C…)

20° vs. 37°

Chemically identical but isotopically different peptides ionize with same efficiency, act as mutual internal standards

Quantitative analysis of mRNA data

DeRisi et al, 1997, Science 278:680

Sample 1 Sample 2Sample 1 Sample 2

Automated Quantitative Proteomics

combine combine and and

proteolyzeproteolyze

multi-dimensional multi-dimensional separationseparation

550550 560560 570570 580580m/zm/z

100100

200200 400400 600600 800800m/zm/z

100100

NHNH22-EACDPLR--EACDPLR-COOHCOOH

lightlight heavyheavy

mixture 2 (heavy)mixture 2 (heavy)

mixture 1 (light)mixture 1 (light)

massmassanalysisanalysis

quantify

Identify(MS/MS)

Quantitative analysis

+TOF MS: 20 MCA scans from mm_sample.wiffa=3.56145059693694800e-004, t0=6.89652636903192620e+001

Max. 274.0 counts.

1914 1916 1918 1920 1922 1924 1926 1928 1930 1932 1934m/z, amu

1926.0240

1927.0231

1928.0203

1917.9946 1929.03221916.9909

1918.99241930.01761920.0007

1924.98031931.00771921.0165

Sample 1

Sample 2

Relative intensity =

relative protein abundance

Disease proteomics: androgen-induced effects in prostate cancer

306 peptides, 79 differentially expressed (26%)(d0/d8 > 1.5 or < 0.67)

- androgen + androgen

Dealing with the data

Data acquisition

Raw data processing(Database searching)

Analysis of processed data(Statistical filtering, quantitative analysis)

Data organization and interpretation

Archiving and databasing

Modeling(Computational Biology)

Need for better data archives and respositories

http://proteomics.jhu.edu/dl/pathidb.php

Archiving challenges: different data formats

http://sashimi.sourceforge.net/software_glossolalia.html

DNADNA mRNAmRNA

ProteinProtein

mRNAmRNA

Inactive Inactive proteinprotein

Inactive Inactive mRNAmRNA

Primary Primary RNA RNA

transcripttranscript

NucleusNucleus CytosolCytosol

TrancriptionaTrancriptionall controlcontrol

RNA RNA processinprocessing controlg control

RNA RNA transport transport controlcontrol

Translational Translational controlcontrol

Active Active proteinprotein

ProteiProtein n

activitactivity y

controcontroll

control of eukaryotic gene expression

Computational Biology: Integrating proteomics and genomics data

mRNA versus protein abundance ratios, Gal/Eth

-3 -2 -1 0 1 2 3mRNA abundance ratio (log10)

PDC1PDC1

GAL1GAL3

Integrating proteomics and genomics data:Elucidating gene expression regulatory networks

Griffin TJ et al (2002) Mol Cell Proteomics 1: 323

-3 -2 -1 0 1 2 3

mRNA abundance ratio (log10)

Protein synthesis

Mitochondriall located proteinsrRNA processing

Post-transcriptionally regulated proteins?

Computational biology: integrating information to assign function

Cytoscape: http://www.cytoscape.org/

Modeling cellular circuitry based on genomic and proteomic data

Is the virtual human on the horizon???

Acknowledgements

University of Minnesota

Griffin LaboratoryMikel RoeSri BandhakaviHongwei XieClive Nyauncho

FundingMinnesota Medical FoundationNIH

U of M Dental SchoolDr. Nelson Rhodus

MSIPatton Fast

computational methods and bioinformatics in proteomic studies bioinformatics: building bridges april...

protein slide

protein level

dna slide

big protein mixtures

century slide

turnstile slide

fast biochemistry slide

crowd control

Documents

dige-based quantitative proteomic

griffin bulletin sally griffin active living center

proteomic command line solution

characterization and proteomic-transcriptomic

proteomic analysis of marinobacter hydrocarbonoclasticus

bioinformatics-curriculum -15 shaheed benazir bhutto...

statistical approaches for proteomic biomarker...

proteomic analysis of ampa receptor complexes 1 proteomic...

comparative proteomic analysis of mycobacterium

proteomodlr for functional proteomic analysis

angel tree...2020/12/07 · stephan greene rachelle...

application of bioinformatics technology in food processing...

1 an introduction to mpls timothy g. griffin...

proteomic research in peritoneal dialysis

bioinformatics secrets the bioinformatics skill...

web based annotation tools for bioinformatics analysis of...

proteomic profiling of precipitated clostridioides

proteomic characterization of csf extracellular …

proteomic analysis of tuna processing...

legal considerations for the naturopath’s practice ©...