a current perspective on using r and bioconductor for proteomics data analysis · 2015-12-04 ·...

1
A current perspective on using R and Bioconductor for proteomics data analysis S. Gibb 1,3 , LM. Breckels 1,2 , T. Lin Pedersen 4 , VA. Petyuk 5 , KS. Lilley 2 and L. Gatto 1,2,* 1 Computational Proteomics Unit and 2 Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, UK 3 Department of Anesthesiology and Intensive Care, Medical Faculty Carl Gustav Carus, Technical University Dresden, Germany 4 Chr. Hansen A/S, Hørsholm / Technical University of Denmark, Kgs. Lyngby, Denmark 5 Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA * [email protected] http://cpu.sysbiol.cam.ac.uk Introduction The R statistical environment and programming language is a key player in many domains that require robust data analysis. The Bioconductor project offer a wide range of R packages dedicated to the analysis and comprehension of high throughput biology. Originally focused on genomics, Bioconductor is gaining increasing attention in the proteomics, metabolomics and mass spectrometry communities, as reflected by the download statistics and package contributions. 0 10 20 30 40 50 Number of Bioconductor packages 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 Bioconductor Versions Proteomics MassSpectrometry MassSpectrometryData 0 10000 20000 30000 40000 50000 Package downloads 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 Bioconductor Versions Proteomics MassSpectrometry Figure 1 : The left figure shows the number of Bioconductor packages dedicated to Proteomics, Mass Spectrometry and Mass Spectrometry Data. On the right figure, we show the number of distinct package downloads. Note that the current version, 2.14 is still downloaded. (NB: the data for Bioc 2.12 are interpolated due to massive scripted downloads). Here, we present an overview of current Bioconductor infrastructure dedicated to proteomics and mass spectrometry. These software packages tightly interoperate, providing well defined pipeline workflows and a flexible and in-depth development environment. Working with raw data The proteomics community has developed a range of data standards and formats for MS data (e.g. mzML, mzIdentML) to overcome the shortcomings of closed, binary vendor-specific formats. One of the main projects that implement parsers for the XML-based open formats is the C++ proteowizard project, which is interfaced by the mzR Bioconductor package using the Rcpp infrastructure for fast raw and (starting with Bioconductor version 3.0) identification data. mzIdentML files can also be parsed with the mzID package. library("mzR") ms <- openMSfile("raw_data.raw") id <- openIDfile("msgf-res.mzid") library("mzID") id2 <- mzID("msgf-res2.mzid") The resulting ms object is a file handle that allows fast access to the individual spec- tra. mzR is used by a variety of other packages like xcms, MSnbase, RMassBank and TargetSearch. Identification R/Bioconductor provide software to parse mzIdentML files (mzID and mzR, see above), directly run identification R (rTANDEM, MSGFplus) and optimise FDR calculations based on decoy searches. Below, we use MSnID to define identifica- tion searches filters using precursor mass error, identification scores and missed cleavage to minimise false discovery rates. (id <- apply_filter(id, filter)) : MSnID object : Working directory: "." : #Spectrum Files: 24 : #PSMs: 514638 at 0.042 % FDR : #peptides: 56546 at 0.11 % FDR : #accessions: 30944 at 0.56 % FDR Quantitation Several quantitation pipelines are supported: MSnbase and isobar for isobaric tagging such as iTRAQ and TMT, MALDIquant for MALDI data, MSnbase for spectra counting and other MS 2 label free methods and synapter supports a DIA complete pipeline, including ion mobility separation on Waters Synapt instruments. MS data processing MS spectrum processing is available in multiple packages, optimised for specific use cases and pipelines: MSnbase, xcms, MALDIquant. 2000 4000 6000 8000 10000 A 2000 4000 6000 8000 10000 B ●● ●● ●●● ●● 2000 4000 6000 8000 -2 -1 0 1 2 3 4 C 4180 4190 4200 4210 4220 4230 4240 D 4180 4190 4200 4210 4220 4230 4240 E 2000 4000 6000 8000 10000 F 1206.796 1466.027 1545.929 3262.745 4209.948 4644.373 5336.871 5904.68 7766.411 9290.533 Figure 2 : Illustration of the MALDIquant preprocessing pipeline. The first figure (A) shows a raw spectrum with the estimated baseline. In the second figure (B) the spectrum is variance-stabilised, smoothed, baseline-corrected and the detected peaks are marked with blue crosses. The third figure (C) is an example of a fitted warping function for peak alignment. In the next figures, four peaks are shown before (D) and after (E) performing the alignment. The last figure (F) represents a merged spectrum with discovered and labelled peaks. Visualisation Using R for high quality programmable figures and interactive graphical user interfaces such as https://lgatto.shinyapps.io/shinyMA/. Retention time M/Z 521 521.22 521.44 521.665 521.885 522.11 522.33 522.555 522.775 523 30:1 30:33 31:5 31:37 32:9 32:45 33:17 33:49 34:22 34:58 3 4 5 6 7 8 521.0 521.5 522.0 522.5 523.0 31 32 33 34 5.0e+07 1.0e+08 1.5e+08 M/Z Retention time 200 400 600 800 1000 30.02 30.03 30.04 30.05 30.06 30.07 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 3.0e+08 M/Z Retention time 0 500 1000 1500 2000 2500 3000 3500 0 20 40 60 80 100 Time (sec) TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML TIC: 7.22e+09 400 500 600 700 800 900 1000 0.0e+00 1.5e+08 3.0e+08 peaks(ms, i)[,1] Acquisition 2807 Retention time 30:1 521.0 521.5 522.0 522.5 peaks(ms, i)[,1] 200 400 600 800 1000 1200 1400 0.0e+00 4.0e+07 8.0e+07 1.2e+08 Prec M/Z 521.34 200 400 600 800 1000 1200 0e+00 1e+06 2e+06 3e+06 4e+06 Prec M/Z 659.37 200 400 600 800 0e+00 1e+06 2e+06 3e+06 4e+06 Prec M/Z 540.36 200 400 600 800 1000 0 500000 1000000 1500000 2000000 Prec M/Z 592.85 100 200 300 400 500 600 700 0e+00 2e+06 4e+06 6e+06 Prec M/Z 416.31 200 400 600 800 1000 0 500000 1500000 2500000 3500000 Prec M/Z 621.87 200 400 600 800 1000 1200 1400 0 500000 1500000 2500000 Prec M/Z 781.91 100 200 300 400 500 600 700 800 0 500000 1500000 2500000 Prec M/Z 476.82 100 200 300 400 500 600 700 0 50000 100000 200000 300000 Prec M/Z 434.77 200 400 600 800 1000 0e+00 2e+05 4e+05 6e+05 Prec M/Z 647.37 Reporter ions Intensity 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 max 1.0 1.5 2.0 2.5 3.0 3.5 4.0 trap 1.0 1.5 2.0 2.5 3.0 3.5 4.0 K.AEFVEVTK.L K.HLVDEPQNLIK.Q Figure 3 : Visualising raw data and process quantitative data (top) and interactive graphical interfaces (bottom) from the pRolocGUI and MSGFgui packages. Data processing, statistics and machine learning As an environment for statistical computing, the very best of data processing, statistical modelling and machine learning is readily available in packages such as MSnbase, isobar, MSstats, msmsTests. MSstats offers set of tools for statistical relative protein significance analysis in DDA, SRM and DIA experiments as well as power calculation. msmsTests lever- ages powerful statistical modelling from the edgeR package to analyse spectral counting data. Conclusion A wide range of other applications such as post-translational modifications (isobar), quality control (qcmetrics and proteoQC) or spatial proteomics (pRoloc) are available as Bioconductor packages. A complete list is available on the Bioconductor page and in the RforProteomics package. Contributions and discussions are welcome on our Google group and GitHub wiki. The R project http://www.r-project.org/ Bioconductor http://bioconductor.org/ RforProteomics http://is.gd/R4Proteomics Google group https://is.gd/rbioc_sig_proteomics Support site https://support.bioconductor.org This work was supported by the European Union 7 th Framework Program PRIME-XS project and a BBSRC Tools and Resources Development Fund. HUPO meeting 5 – 8 October 2014 Madrid

Upload: others

Post on 28-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A current perspective on using R and Bioconductor for proteomics data analysis · 2015-12-04 · proteomics and mass spectrometry. These software packages tightly interoperate, providing

A current perspective on using R and Bioconductorfor proteomics data analysis

S. Gibb1,3, LM. Breckels1,2, T. Lin Pedersen4, VA. Petyuk5, KS. Lilley2 and L. Gatto1,2,∗

1Computational Proteomics Unit and 2Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, UK3Department of Anesthesiology and Intensive Care, Medical Faculty Carl Gustav Carus, Technical University Dresden, Germany

4Chr. Hansen A/S, Hørsholm / Technical University of Denmark, Kgs. Lyngby, Denmark5Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA

[email protected] http://cpu.sysbiol.cam.ac.uk

IntroductionThe R statistical environment and programming language is a key player in manydomains that require robust data analysis. The Bioconductor project offer awide range of R packages dedicated to the analysis and comprehension of highthroughput biology. Originally focused on genomics, Bioconductor is gainingincreasing attention in the proteomics, metabolomics and mass spectrometrycommunities, as reflected by the download statistics and package contributions.

010

2030

4050

Num

ber

of B

ioco

nduc

tor

pack

ages

2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

Bioconductor Versions

ProteomicsMassSpectrometryMassSpectrometryData

010

000

2000

030

000

4000

050

000

Pac

kage

dow

nloa

ds

2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

Bioconductor Versions

ProteomicsMassSpectrometry

Figure 1 : The left figure shows the number of Bioconductor packages dedicated to Proteomics,Mass Spectrometry and Mass Spectrometry Data. On the right figure, we show the number ofdistinct package downloads. Note that the current version, 2.14 is still downloaded. (NB: the datafor Bioc 2.12 are interpolated due to massive scripted downloads).

Here, we present an overview of current Bioconductor infrastructure dedicated toproteomics and mass spectrometry. These software packages tightlyinteroperate, providing well defined pipeline workflows and a flexible andin-depth development environment.

Working with raw data

The proteomics community has developed a range of data standards and formatsfor MS data (e.g. mzML, mzIdentML) to overcome the shortcomings of closed,binary vendor-specific formats.

One of the main projects that implement parsers for the XML-based open formats isthe C++ proteowizard project, which is interfaced by the mzR Bioconductor packageusing the Rcpp infrastructure for fast raw and (starting with Bioconductor version3.0) identification data. mzIdentML files can also be parsed with the mzID package.

library("mzR")

ms <- openMSfile("raw_data.raw")

id <- openIDfile("msgf-res.mzid")

library("mzID")

id2 <- mzID("msgf-res2.mzid")

The resulting ms object is a file handle that allows fast access to the individual spec-tra. mzR is used by a variety of other packages like xcms, MSnbase, RMassBankand TargetSearch.

IdentificationR/Bioconductor provide software to parse mzIdentML files (mzID and mzR, seeabove), directly run identification R (rTANDEM, MSGFplus) and optimise FDRcalculations based on decoy searches. Below, we use MSnID to define identifica-tion searches filters using precursor mass error, identification scores and missedcleavage to minimise false discovery rates.

(id <- apply_filter(id, filter))

: MSnID object

: Working directory: "."

: #Spectrum Files: 24

: #PSMs: 514638 at 0.042 % FDR

: #peptides: 56546 at 0.11 % FDR

: #accessions: 30944 at 0.56 % FDR

QuantitationSeveral quantitation pipelines are supported: MSnbase and isobar for isobarictagging such as iTRAQ and TMT, MALDIquant for MALDI data, MSnbase forspectra counting and other MS2 label free methods and synapter supportsa DIA complete pipeline, including ion mobility separation on Waters Synaptinstruments.

MS data processing

MS spectrum processing is available in multiple packages, optimised for specificuse cases and pipelines: MSnbase, xcms, MALDIquant.

2000 4000 6000 8000 10000

A

2000 4000 6000 8000 10000

B

● ●●●●●

●●●●●

●● ●●● ● ●●

●●

● ●●

● ● ●

2000 4000 6000 8000

−2

−1

01

23

4 C

4180 4190 4200 4210 4220 4230 4240

D

4180 4190 4200 4210 4220 4230 4240

E

2000 4000 6000 8000 10000

F

1206.796

1466.027

1545.929

3262.745

4209.948

4644.373

5336.8715904.68

7766.411 9290.533

Figure 2 : Illustration of the MALDIquant preprocessing pipeline. The first figure (A) shows a rawspectrum with the estimated baseline. In the second figure (B) the spectrum is variance-stabilised,smoothed, baseline-corrected and the detected peaks are marked with blue crosses. The third figure(C) is an example of a fitted warping function for peak alignment. In the next figures, four peaksare shown before (D) and after (E) performing the alignment. The last figure (F) represents amerged spectrum with discovered and labelled peaks.

VisualisationUsing R for high quality programmable figures and interactive graphical userinterfaces such as https://lgatto.shinyapps.io/shinyMA/.

Retention time

M/Z

521

521.22

521.44

521.665

521.885

522.11

522.33

522.555

522.775

523

30:1 30:33 31:5 31:37 32:9 32:45 33:17 33:49 34:22 34:58

3

4

5

6

7

8

521.0

521.5

522.0

522.5

523.0

31

32

33

34

5.0e+07

1.0e+08

1.5e+08

M/ZRetention time200

400

600

800

1000

30.02

30.03

30.04

30.05

30.06

30.07

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5e+08

3.0e+08

M/ZRetention time

0 500 1000 1500 2000 2500 3000 3500

020

4060

8010

0

Time (sec)

TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXMLTIC: 7.22e+09

400 500 600 700 800 900 1000

0.0e

+00

1.5e

+08

3.0e

+08

peaks(ms, i)[,1]

Acquisition 2807Retention time 30:1

521.0 521.5 522.0 522.5

peaks(ms, i)[,1]

peak

s(m

s, i)

[,2]

200 400 600 800 1000 1200 1400

0.0e

+00

4.0e

+07

8.0e

+07

1.2e

+08

Prec M/Z521.34

200 400 600 800 1000 1200

0e+

001e

+06

2e+

063e

+06

4e+

06

Prec M/Z659.37

200 400 600 800

0e+

001e

+06

2e+

063e

+06

4e+

06

Prec M/Z540.36

200 400 600 800 1000

050

0000

1000

000

1500

000

2000

000 Prec M/Z

592.85

100 200 300 400 500 600 700

0e+

002e

+06

4e+

066e

+06

Prec M/Z416.31

200 400 600 800 1000

050

0000

1500

000

2500

000

3500

000

Prec M/Z621.87

200 400 600 800 1000 1200 1400

050

0000

1500

000

2500

000

Prec M/Z781.91

100 200 300 400 500 600 700 800

050

0000

1500

000

2500

000

Prec M/Z476.82

100 200 300 400 500 600 700

050

000

1000

0020

0000

3000

00

Prec M/Z434.77

200 400 600 800 1000

0e+

002e

+05

4e+

056e

+05

Prec M/Z647.37

Reporter ions

Inte

nsity

0.1

0.2

0.3

0.4

0.1

0.2

0.3

0.4

max

● ●

● ●

● ●

● ●

● ●

● ●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

trap

●●

●●

●●

●●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

K.A

EF

VE

VT

K.L

K.H

LVD

EP

QN

LIK.Q

Figure 3 : Visualising raw data and process quantitative data (top) and interactive graphicalinterfaces (bottom) from the pRolocGUI and MSGFgui packages.

Data processing, statistics and machine learning

As an environment for statistical computing, the very best of data processing,statistical modelling and machine learning is readily available in packages suchas MSnbase, isobar, MSstats, msmsTests.

MSstats offers set of tools for statistical relative protein significance analysis inDDA, SRM and DIA experiments as well as power calculation. msmsTests lever-ages powerful statistical modelling from the edgeR package to analyse spectralcounting data.

ConclusionA wide range of other applications such as post-translational modifications(isobar), quality control (qcmetrics and proteoQC) or spatial proteomics(pRoloc) are available as Bioconductor packages. A complete list is availableon the Bioconductor page and in the RforProteomics package. Contributionsand discussions are welcome on our Google group and GitHub wiki.

The R project http://www.r-project.org/

Bioconductor http://bioconductor.org/

RforProteomics http://is.gd/R4Proteomics

Google group https://is.gd/rbioc_sig_proteomics

Support site https://support.bioconductor.org

This work was supported by the European Union 7th Framework Program PRIME-XS

project and a BBSRC Tools and Resources Development Fund.

HUPO meeting 5 – 8 October 2014 Madrid