overview - ms proteomics in one slide obtain protein ......filtering psms a large experiment...
TRANSCRIPT
Overview - MS Proteomics in One Slide
2
Obtain protein Digest into peptides Acquire spectra in mass spectrometer
MSmasses of peptides
MS/MSfragments of a peptide
Match to sequence database
Results!
But it’s more complex than that….
3
Most things are more difficult than for genomics / transcriptomics….
• No amplification – we only ever lose signal
• Can’t sequence peptides as sensitively as DNA/RNA
• More complications – peptides can be modified in many, many ways
• Mapping spectra -> peptides –> proteins is not as easy as reads to transcripts
• Implications for quantification
Data Analysis Challenges & Solutions
Let’s look at data analysis, using an example experiment….
We have a cancer cell line. We treated it with secret compound Z.
We want to know what effect Z has on the proteome of the cells.
What proteins are in the samples?
Which proteins significantly change in amount between the samples?
This will be a discovery, shotgun proteomics experiment.
Treated Cells
Control Cells
Peptide LC-MS
5
https://commons.wikimedia.org/wiki/File:Mass_spectrometry_protocol.png
Optional separation
An LC MS System
6
Oxford TDI Proteomics Core
7
https://commons.wikimedia.org/wiki/File:Mass_spectrometry_protocol.png
MSMS Fragmentation
8
A Real Spectrum
9
Peptide from Keratin – a common contaminant!
Great, what do we get out of the machine?
15 RAW data files in vendor format (5 x triplicate runs)
Approx 30GB of raw data
500,000 – 1,000,000 spectra
Formats, formats, formats
PKL
MGF
DTA
MS2
Fragment SpectraPeak Lists
mzML mzXML
mzData
GenericFlexible Formats
Vendor Raw Data Formats
Thermo.RAW
Bruker.yep / .baf
Waters.RAW
Agilent.d
AB Sciex.wiff
VendorSoftware
AcademicConverters
BespokeScripts
Converters and BioHPC
12
BioHPC cannot install license restricted mass-spec vendor software, so we need to use open formats.
To run protein ID analyses on BioHPC we recommend obtaining MGF format data
Ask your proteomics core, or use ProteoWizard.
http://proteowizard.sourceforge.net/
Peptide Identification
Identify peptides by matching experimental spectra to theoretical ones from a protein sequence
database.
Database Search Engine
Sequence Database
Input Spectrum
e.g. UniProtKB
HQGVMVGMGQK
Score: 46
A Peptide to Spectrum Match(PSM)
SearchGUI
14
How to run searches easily??
Many tools, all with own command line and parameter formats.
Use CompOmics Search GUI
Installed on BioHPC
Installed search engines:• X! Tandem• MS-GF+• OMSSA• Comet
http://compomics.github.io/projects/searchgui.html
PSM Scoring
Making scores more meaningful…
Mascot Score = 46
Xcorr = 2.43
HyperScore = 0.9844
Every search engine uses a different scoring algorithm.
Rules for calling a good ID have evolved, but may not be based on good evidence.
Very hard to compare or combine results.
Can we transform them into something more useful?
PSM Re-Scoring
Take score(s) from the search
engine
Map to a standard scale
Fit distributions for true and false IDs
Obtain probability for score x.
Keller, A., Nesvizhskii, A. I., Kolker, E., & Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry, 74(20), 5383-5392.
Target-Decoy Method
Are our probabilities really accurate?
Elias, J. E., & Gygi, S. P. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3), 207-214.
Fake Sequences
DECOYS
Real Sequences
TARGETS
A correct match is always to a real sequence
A incorrect random match is equally likely to a target or decoy sequence.
Estimate the number of incorrect target matchesby counting the decoy matches.
An empirical estimate of the False Discovery Rate (FDR).
?
?
Filtering PSMs
A large experiment produces > 100,000 PSMs.
No way to manually inspect each one!
We usually report PSMs filtered to a specific False Discovery Rate.
1% FDR is most common.
Matches with post-translational modifications require special treatment.
PeptideShaker
19
How to do all combination and filtering easily?
Use CompOmics PeptideShaker
Installed on BioHPC
http://compomics.github.io/projects/peptide-shaker.html
Protein Inference I
We don’t identify proteins – we only identify peptides!
Peptides could come from one or more proteins – how do we resolve this?
A
B
C
D
E
1
2
3
4
5
Peptides Proteins
Present
Not Present
Present: Peptide A is uniquely assigned
??? All peptides are shared
Protein Inference II
(Very) Naïve rules
e.g. If protein is identified with 2 unique peptides it is present
Parsimony
The smallest list of proteins that can explain the peptides identified is the most likely.
Minimal set cover / minimal partial set cover etc.
Bayesian Models
Consider probabilities that proteins produced peptides and spectra.
Prior information – probability peptide x can be observed by MS etc.
Correlation with other sources
e.g. did RNA-Seq on the same sample find mRNA for the protein?
Protein Scoring - ProteinProphet
Start with a list of peptides and their identification probabilities.
Map peptides to all possible proteins that contain them.
Group proteins that can’t be distinguished - no unique peptides.
Adjust peptide probabilities based on number of siblings.
Assign weights for shared peptides to each protein containing them.
Compute protein probability assuming peptide IDs are independent events.
𝑃𝑟𝑜𝑡𝑒𝑖𝑛𝑃𝑟𝑜𝑏𝑖 = 1 − ෑ
𝑗=1
𝑁
(1 −𝑊𝑒𝑖𝑔ℎ𝑡𝑖,𝑗 𝑃𝑒𝑝𝑡𝑖𝑑𝑒𝑃𝑟𝑜𝑏𝑗)
Protein Probability = Probability at least one of the peptide IDs was correct= 1 – Probability all of the IDs were wrong
Nesvizhskii, A. I., Keller, A., Kolker, E., & Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Analytical chemistry, 75(17), 4646-4658.
Quantification – Spectral Counts
Now we have protein IDs we could do quantification by counting the number of PSMs assigned to each
protein.
PSMs produced by a protein are protein proportional to abundance
BUT
Longer proteins generate more peptides = more PSMs
Some proteins are just difficult = fewer PSMs than expected
Can compare spectral counts of same protein, or normalize by length, Mw, expected number of peptides etc.
Not good for less-abundant proteins. Low spectral counts = poor comparisons
1 v 2 not as accurate as 10 v 20
Astrocyte CompOmics Protein ID Workflow
24
Uses CompOmics tools, runs on BioHPC Nucleus clusterSee https://astrocyte.biohpc.swmed.edu/brand/biohpc
BioHPC provides a simple workflow to:• Identify peptides with 3 search engines• Combine the results• Perform target-decoy validation• Export reports• Download project for inspection in PeptideShaker GUI
The Example Experiment
We want to do an exhaustive and accurate comparison, so we will use SILAC quantification and
fractionate our samples at the peptide level.
Treated Cells
Control Cells
Normal Growth Medium
Heavy Growth Medium
Mix Digest Fractionate
Lysis
Lysis
5 MS Runs
REPEAT IN TRIPLICATE
Quantification – SILAC I
Treated Cells
Control Cells
Normal Growth Medium
Heavy Growth Medium
We always see each peptide twiceHeavy and light forms
Protein ratio = peptide Heavy to Light ratio
So, let’s find ratios for all peptides in our MS data…..
Quantification – SILAC II
Time
Light
m/z
Many scans through time see the same SILAC pair as the peptide elutes
Extract intensity of light and heavy at each scan through time
Find SILAC pair signatures in the MS run
Hea
vy Peptide Ratio = Slope
OR
Peptide Ratio = Ratio of area under curves
SILAC Pair Finding
From: Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology, 26(12), 1367-1372.
SILAC – Protein Level
In SILAC each peptide gives an ratio estimate for the protein(s) it originates from.
Ratios from multiple peptides can be combined into a protein ratio in various ways.
Simple mean / median of peptide ratios
Weighted mean / median – more abundant peptides contribute more
Find & discard outliers
Plot H vs L peptide areas and perform linear regression
Multiple peptide observations allow estimate of protein ratio error via variability
between peptide ratios.
MaxQuant
30
Installed in BioHPC winDCV session (Windows Only)Uses vendor specific input files (Thermo RAW)
Identification & Quantitation in one packageGood for SILAC experiments, not recommended for non-quantitative work
Significance Analysis I
Once we have the protein quantitation we can look for meaningful differences between samples.
At this point proteomics data is much like other datasets. You can apply techniques from e.g.
microarray analysis to proteomics data.
Proteomics has a poor reputation for statistical rigor: e.g. many people consider that in a 1:1 mixture
log2 ratios are normally distributed. Things beyond 2 s.d. are interesting changes:
log2 Protein Ratio
No!
The variance of proteomics measurements is highly dependent on intensity
Significance Analysis II
To make well-grounded decisions we must model the variance of the measurements,
which depends on intensity:
10
12
14
16
18
20
22
24
26
28
30
-2 -1 0 1 2
A
B
A & B have the same ratios between samples
A changes significantly
B does not change significantly
Can use microarray focused packages, such asLIMMA or plgem for R
Log Ratio
Log
Inte
nsi
ty
Introductory Web Resources
Proteome Software Wiki
http://proteome-software.wikispaces.com/Proteomics
http://proteome-software.wikispaces.com/Bioinformatics
CompOmics Tutorials
https://compomics.com/bioinformatics-for-proteomics/
Steen & Steen Lab @ Harvard
http://www.childrenshospital.org/cfapps/research/data_admin/Site602/mainpageS602P0.html
What do you want to do on BioHPC?
34
We have installed various software.Is there anything else you need? What analyses do you want to do on BioHPC?
• File ConversionProteowizard / msConvert (winDCV)
• Peptide Identification Search EnginesX!Tandem, OMSSA, MSGF+, Comet, SearchGUI
• Postprocessing ToolsPeptide Shaker, Trans-Proteomics Pipeline
• Quantitative ProteomicsMaxQuant (winDCV)
• Downstream StatisticsPeptideShaker, Perseus (winDCV), R, Python