a current perspective on using r and bioconductor for proteomics data analysis · 2015-12-04 ·...

A current perspective on using R and Bioconductorfor proteomics data analysis

S. Gibb1,3, LM. Breckels1,2, T. Lin Pedersen4, VA. Petyuk5, KS. Lilley2 and L. Gatto1,2,∗

1Computational Proteomics Unit and 2Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, UK3Department of Anesthesiology and Intensive Care, Medical Faculty Carl Gustav Carus, Technical University Dresden, Germany

4Chr. Hansen A/S, Hørsholm / Technical University of Denmark, Kgs. Lyngby, Denmark5Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA

∗[email protected] http://cpu.sysbiol.cam.ac.uk

IntroductionThe R statistical environment and programming language is a key player in manydomains that require robust data analysis. The Bioconductor project offer awide range of R packages dedicated to the analysis and comprehension of highthroughput biology. Originally focused on genomics, Bioconductor is gainingincreasing attention in the proteomics, metabolomics and mass spectrometrycommunities, as reflected by the download statistics and package contributions.

010

2030

4050

Num

ber

of B

ioco

nduc

tor

pack

ages

2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

Bioconductor Versions

ProteomicsMassSpectrometryMassSpectrometryData

010

000

2000

030

000

4000

050

000

Pac

kage

dow

nloa

ds

2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

Bioconductor Versions

ProteomicsMassSpectrometry

Figure 1 : The left figure shows the number of Bioconductor packages dedicated to Proteomics,Mass Spectrometry and Mass Spectrometry Data. On the right figure, we show the number ofdistinct package downloads. Note that the current version, 2.14 is still downloaded. (NB: the datafor Bioc 2.12 are interpolated due to massive scripted downloads).

Here, we present an overview of current Bioconductor infrastructure dedicated toproteomics and mass spectrometry. These software packages tightlyinteroperate, providing well defined pipeline workflows and a flexible andin-depth development environment.

Working with raw data

The proteomics community has developed a range of data standards and formatsfor MS data (e.g. mzML, mzIdentML) to overcome the shortcomings of closed,binary vendor-specific formats.

One of the main projects that implement parsers for the XML-based open formats isthe C++ proteowizard project, which is interfaced by the mzR Bioconductor packageusing the Rcpp infrastructure for fast raw and (starting with Bioconductor version3.0) identification data. mzIdentML files can also be parsed with the mzID package.

library("mzR")

ms <- openMSfile("raw_data.raw")

id <- openIDfile("msgf-res.mzid")

library("mzID")

id2 <- mzID("msgf-res2.mzid")

The resulting ms object is a file handle that allows fast access to the individual spec-tra. mzR is used by a variety of other packages like xcms, MSnbase, RMassBankand TargetSearch.

IdentificationR/Bioconductor provide software to parse mzIdentML files (mzID and mzR, seeabove), directly run identification R (rTANDEM, MSGFplus) and optimise FDRcalculations based on decoy searches. Below, we use MSnID to define identifica-tion searches filters using precursor mass error, identification scores and missedcleavage to minimise false discovery rates.

(id <- apply_filter(id, filter))

: MSnID object

: Working directory: "."

: #Spectrum Files: 24

: #PSMs: 514638 at 0.042 % FDR

: #peptides: 56546 at 0.11 % FDR

: #accessions: 30944 at 0.56 % FDR

QuantitationSeveral quantitation pipelines are supported: MSnbase and isobar for isobarictagging such as iTRAQ and TMT, MALDIquant for MALDI data, MSnbase forspectra counting and other MS2 label free methods and synapter supportsa DIA complete pipeline, including ion mobility separation on Waters Synaptinstruments.

MS data processing

MS spectrum processing is available in multiple packages, optimised for specificuse cases and pipelines: MSnbase, xcms, MALDIquant.

2000 4000 6000 8000 10000

A

2000 4000 6000 8000 10000

B

● ●●●●●

●

●●●●●

●● ●●● ● ●●

●

●●

●

●

● ●●

● ● ●

2000 4000 6000 8000

−2

−1

01

23

4 C

4180 4190 4200 4210 4220 4230 4240

D

4180 4190 4200 4210 4220 4230 4240

E

2000 4000 6000 8000 10000

F

1206.796

1466.027

1545.929

3262.745

4209.948

4644.373

5336.8715904.68

7766.411 9290.533

Figure 2 : Illustration of the MALDIquant preprocessing pipeline. The first figure (A) shows a rawspectrum with the estimated baseline. In the second figure (B) the spectrum is variance-stabilised,smoothed, baseline-corrected and the detected peaks are marked with blue crosses. The third figure(C) is an example of a fitted warping function for peak alignment. In the next figures, four peaksare shown before (D) and after (E) performing the alignment. The last figure (F) represents amerged spectrum with discovered and labelled peaks.

VisualisationUsing R for high quality programmable figures and interactive graphical userinterfaces such as https://lgatto.shinyapps.io/shinyMA/.

Retention time

M/Z

521

521.22

521.44

521.665

521.885

522.11

522.33

522.555

522.775

523

30:1 30:33 31:5 31:37 32:9 32:45 33:17 33:49 34:22 34:58

3

4

5

6

7

8

521.0

521.5

522.0

522.5

523.0

31

32

33

34

5.0e+07

1.0e+08

1.5e+08

M/ZRetention time200

400

600

800

1000

30.02

30.03

30.04

30.05

30.06

30.07

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5e+08

3.0e+08

M/ZRetention time

0 500 1000 1500 2000 2500 3000 3500

020

4060

8010

0

Time (sec)

TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXMLTIC: 7.22e+09

400 500 600 700 800 900 1000

0.0e

+00

1.5e

+08

3.0e

+08

peaks(ms, i)[,1]

Acquisition 2807Retention time 30:1

521.0 521.5 522.0 522.5

peaks(ms, i)[,1]

peak

s(m

s, i)

[,2]

200 400 600 800 1000 1200 1400

0.0e

+00

4.0e

+07

8.0e

+07

1.2e

+08

Prec M/Z521.34

200 400 600 800 1000 1200

0e+

001e

+06

2e+

063e

+06

4e+

06

Prec M/Z659.37

200 400 600 800

0e+

001e

+06

2e+

063e

+06

4e+

06

Prec M/Z540.36

200 400 600 800 1000

050

0000

1000

000

1500

000

2000

000 Prec M/Z

592.85

100 200 300 400 500 600 700

0e+

002e

+06

4e+

066e

+06

Prec M/Z416.31

200 400 600 800 1000

050

0000

1500

000

2500

000

3500

000

Prec M/Z621.87

200 400 600 800 1000 1200 1400

050

0000

1500

000

2500

000

Prec M/Z781.91

100 200 300 400 500 600 700 800

050

0000

1500

000

2500

000

Prec M/Z476.82

100 200 300 400 500 600 700

050

000

1000

0020

0000

3000

00

Prec M/Z434.77

200 400 600 800 1000

0e+

002e

+05

4e+

056e

+05

Prec M/Z647.37

Reporter ions

Inte

nsity

0.1

0.2

0.3

0.4

0.1

0.2

0.3

0.4

max

●

●

●

●

● ●

●

●

●

●

● ●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

trap

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

K.A

EF

VE

VT

K.L

K.H

LVD

EP

QN

LIK.Q

Figure 3 : Visualising raw data and process quantitative data (top) and interactive graphicalinterfaces (bottom) from the pRolocGUI and MSGFgui packages.

Data processing, statistics and machine learning

As an environment for statistical computing, the very best of data processing,statistical modelling and machine learning is readily available in packages suchas MSnbase, isobar, MSstats, msmsTests.

MSstats offers set of tools for statistical relative protein significance analysis inDDA, SRM and DIA experiments as well as power calculation. msmsTests lever-ages powerful statistical modelling from the edgeR package to analyse spectralcounting data.

ConclusionA wide range of other applications such as post-translational modifications(isobar), quality control (qcmetrics and proteoQC) or spatial proteomics(pRoloc) are available as Bioconductor packages. A complete list is availableon the Bioconductor page and in the RforProteomics package. Contributionsand discussions are welcome on our Google group and GitHub wiki.

The R project http://www.r-project.org/

Bioconductor http://bioconductor.org/

RforProteomics http://is.gd/R4Proteomics

Google group https://is.gd/rbioc_sig_proteomics

Support site https://support.bioconductor.org

This work was supported by the European Union 7th Framework Program PRIME-XS

project and a BBSRC Tools and Resources Development Fund.

HUPO meeting 5 – 8 October 2014 Madrid

[email protected]

http://cpu.sysbiol.cam.ac.uk

https://lgatto.shinyapps.io/shinyMA/

http://www.r-project.org/

http://bioconductor.org/

http://is.gd/R4Proteomics

https://is.gd/rbioc_sig_proteomics

https://support.bioconductor.org

a current perspective on using r and bioconductor for proteomics data analysis · 2015-12-04 ·...

Documents