overview - ms proteomics in one slide obtain protein ......filtering psms a large experiment...

Overview - MS Proteomics in One Slide

2

Obtain protein Digest into peptides Acquire spectra in mass spectrometer

MSmasses of peptides

MS/MSfragments of a peptide

Match to sequence database

Results!

But it’s more complex than that….

3

Most things are more difficult than for genomics / transcriptomics….

• No amplification – we only ever lose signal

• Can’t sequence peptides as sensitively as DNA/RNA

• More complications – peptides can be modified in many, many ways

• Mapping spectra -> peptides –> proteins is not as easy as reads to transcripts

• Implications for quantification

Data Analysis Challenges & Solutions

Let’s look at data analysis, using an example experiment….

We have a cancer cell line. We treated it with secret compound Z.

We want to know what effect Z has on the proteome of the cells.

What proteins are in the samples?

Which proteins significantly change in amount between the samples?

This will be a discovery, shotgun proteomics experiment.

Treated Cells

Control Cells

Peptide LC-MS

5

https://commons.wikimedia.org/wiki/File:Mass_spectrometry_protocol.png

Optional separation

An LC MS System

6

Oxford TDI Proteomics Core

7

https://commons.wikimedia.org/wiki/File:Mass_spectrometry_protocol.png

MSMS Fragmentation

8

A Real Spectrum

9

Peptide from Keratin – a common contaminant!

Great, what do we get out of the machine?

15 RAW data files in vendor format (5 x triplicate runs)

Approx 30GB of raw data

500,000 – 1,000,000 spectra

Formats, formats, formats

PKL

MGF

DTA

MS2

Fragment SpectraPeak Lists

mzML mzXML

mzData

GenericFlexible Formats

Vendor Raw Data Formats

Thermo.RAW

Bruker.yep / .baf

Waters.RAW

Agilent.d

AB Sciex.wiff

VendorSoftware

AcademicConverters

BespokeScripts

Converters and BioHPC

12

BioHPC cannot install license restricted mass-spec vendor software, so we need to use open formats.

To run protein ID analyses on BioHPC we recommend obtaining MGF format data

Ask your proteomics core, or use ProteoWizard.

http://proteowizard.sourceforge.net/

Peptide Identification

Identify peptides by matching experimental spectra to theoretical ones from a protein sequence

database.

Database Search Engine

Sequence Database

Input Spectrum

e.g. UniProtKB

HQGVMVGMGQK

Score: 46

A Peptide to Spectrum Match(PSM)

SearchGUI

14

How to run searches easily??

Many tools, all with own command line and parameter formats.

Use CompOmics Search GUI

Installed on BioHPC

Installed search engines:• X! Tandem• MS-GF+• OMSSA• Comet

http://compomics.github.io/projects/searchgui.html

PSM Scoring

Making scores more meaningful…

Mascot Score = 46

Xcorr = 2.43

HyperScore = 0.9844

Every search engine uses a different scoring algorithm.

Rules for calling a good ID have evolved, but may not be based on good evidence.

Very hard to compare or combine results.

Can we transform them into something more useful?

PSM Re-Scoring

Take score(s) from the search

engine

Map to a standard scale

Fit distributions for true and false IDs

Obtain probability for score x.

Keller, A., Nesvizhskii, A. I., Kolker, E., & Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical chemistry, 74(20), 5383-5392.

Target-Decoy Method

Are our probabilities really accurate?

Elias, J. E., & Gygi, S. P. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3), 207-214.

Fake Sequences

DECOYS

Real Sequences

TARGETS

A correct match is always to a real sequence

A incorrect random match is equally likely to a target or decoy sequence.

Estimate the number of incorrect target matchesby counting the decoy matches.

An empirical estimate of the False Discovery Rate (FDR).

?

?

Filtering PSMs

A large experiment produces > 100,000 PSMs.

No way to manually inspect each one!

We usually report PSMs filtered to a specific False Discovery Rate.

1% FDR is most common.

Matches with post-translational modifications require special treatment.

PeptideShaker

19

How to do all combination and filtering easily?

Use CompOmics PeptideShaker

Installed on BioHPC

http://compomics.github.io/projects/peptide-shaker.html

Protein Inference I

We don’t identify proteins – we only identify peptides!

Peptides could come from one or more proteins – how do we resolve this?

A

B

C

D

E

1

2

3

4

5

Peptides Proteins

Present

Not Present

Present: Peptide A is uniquely assigned

??? All peptides are shared

Protein Inference II

(Very) Naïve rules

e.g. If protein is identified with 2 unique peptides it is present

Parsimony

The smallest list of proteins that can explain the peptides identified is the most likely.

Minimal set cover / minimal partial set cover etc.

Bayesian Models

Consider probabilities that proteins produced peptides and spectra.

Prior information – probability peptide x can be observed by MS etc.

Correlation with other sources

e.g. did RNA-Seq on the same sample find mRNA for the protein?

Protein Scoring - ProteinProphet

Start with a list of peptides and their identification probabilities.

Map peptides to all possible proteins that contain them.

Group proteins that can’t be distinguished - no unique peptides.

Adjust peptide probabilities based on number of siblings.

Assign weights for shared peptides to each protein containing them.

Compute protein probability assuming peptide IDs are independent events.

𝑃𝑟𝑜𝑡𝑒𝑖𝑛𝑃𝑟𝑜𝑏𝑖 = 1 − ෑ

𝑗=1

𝑁

(1 −𝑊𝑒𝑖𝑔ℎ𝑡𝑖,𝑗 𝑃𝑒𝑝𝑡𝑖𝑑𝑒𝑃𝑟𝑜𝑏𝑗)

Protein Probability = Probability at least one of the peptide IDs was correct= 1 – Probability all of the IDs were wrong

Nesvizhskii, A. I., Keller, A., Kolker, E., & Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Analytical chemistry, 75(17), 4646-4658.

Quantification – Spectral Counts

Now we have protein IDs we could do quantification by counting the number of PSMs assigned to each

protein.

PSMs produced by a protein are protein proportional to abundance

BUT

Longer proteins generate more peptides = more PSMs

Some proteins are just difficult = fewer PSMs than expected

Can compare spectral counts of same protein, or normalize by length, Mw, expected number of peptides etc.

Not good for less-abundant proteins. Low spectral counts = poor comparisons

1 v 2 not as accurate as 10 v 20

Astrocyte CompOmics Protein ID Workflow

24

Uses CompOmics tools, runs on BioHPC Nucleus clusterSee https://astrocyte.biohpc.swmed.edu/brand/biohpc

BioHPC provides a simple workflow to:• Identify peptides with 3 search engines• Combine the results• Perform target-decoy validation• Export reports• Download project for inspection in PeptideShaker GUI

https://astrocyte.biohpc.swmed.edu/brand/biohpc

The Example Experiment

We want to do an exhaustive and accurate comparison, so we will use SILAC quantification and

fractionate our samples at the peptide level.

Treated Cells

Control Cells

Normal Growth Medium

Heavy Growth Medium

Mix Digest Fractionate

Lysis

Lysis

5 MS Runs

REPEAT IN TRIPLICATE

Quantification – SILAC I

Treated Cells

Control Cells

Normal Growth Medium

Heavy Growth Medium

We always see each peptide twiceHeavy and light forms

Protein ratio = peptide Heavy to Light ratio

So, let’s find ratios for all peptides in our MS data…..

Quantification – SILAC II

Time

Light

m/z

Many scans through time see the same SILAC pair as the peptide elutes

Extract intensity of light and heavy at each scan through time

Find SILAC pair signatures in the MS run

Hea

vy Peptide Ratio = Slope

OR

Peptide Ratio = Ratio of area under curves

SILAC Pair Finding

From: Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology, 26(12), 1367-1372.

SILAC – Protein Level

In SILAC each peptide gives an ratio estimate for the protein(s) it originates from.

Ratios from multiple peptides can be combined into a protein ratio in various ways.

Simple mean / median of peptide ratios

Weighted mean / median – more abundant peptides contribute more

Find & discard outliers

Plot H vs L peptide areas and perform linear regression

Multiple peptide observations allow estimate of protein ratio error via variability

between peptide ratios.

MaxQuant

30

Installed in BioHPC winDCV session (Windows Only)Uses vendor specific input files (Thermo RAW)

Identification & Quantitation in one packageGood for SILAC experiments, not recommended for non-quantitative work

Significance Analysis I

Once we have the protein quantitation we can look for meaningful differences between samples.

At this point proteomics data is much like other datasets. You can apply techniques from e.g.

microarray analysis to proteomics data.

Proteomics has a poor reputation for statistical rigor: e.g. many people consider that in a 1:1 mixture

log2 ratios are normally distributed. Things beyond 2 s.d. are interesting changes:

log2 Protein Ratio

No!

The variance of proteomics measurements is highly dependent on intensity

Significance Analysis II

To make well-grounded decisions we must model the variance of the measurements,

which depends on intensity:

10

12

14

16

18

20

22

24

26

28

30

-2 -1 0 1 2

A

B

A & B have the same ratios between samples

A changes significantly

B does not change significantly

Can use microarray focused packages, such asLIMMA or plgem for R

Log Ratio

Log

Inte

nsi

ty

Introductory Web Resources

Proteome Software Wiki

http://proteome-software.wikispaces.com/Proteomics

http://proteome-software.wikispaces.com/Bioinformatics

CompOmics Tutorials

https://compomics.com/bioinformatics-for-proteomics/

Steen & Steen Lab @ Harvard

http://www.childrenshospital.org/cfapps/research/data_admin/Site602/mainpageS602P0.html

http://proteome-software.wikispaces.com/Proteomics

http://proteome-software.wikispaces.com/Bioinformatics

https://compomics.com/bioinformatics-for-proteomics/

http://www.childrenshospital.org/cfapps/research/data_admin/Site602/mainpageS602P0.html

What do you want to do on BioHPC?

34

We have installed various software.Is there anything else you need? What analyses do you want to do on BioHPC?

• File ConversionProteowizard / msConvert (winDCV)

• Peptide Identification Search EnginesX!Tandem, OMSSA, MSGF+, Comet, SearchGUI

• Postprocessing ToolsPeptide Shaker, Trans-Proteomics Pipeline

• Quantitative ProteomicsMaxQuant (winDCV)

• Downstream StatisticsPeptideShaker, Perseus (winDCV), R, Python

overview - ms proteomics in one slide obtain protein ......filtering psms a large experiment...

Documents