a current perspective on using r and bioconductor for proteomics data analysis · 2015-12-04 ·...
TRANSCRIPT
A current perspective on using R and Bioconductorfor proteomics data analysis
S. Gibb1,3, LM. Breckels1,2, T. Lin Pedersen4, VA. Petyuk5, KS. Lilley2 and L. Gatto1,2,∗
1Computational Proteomics Unit and 2Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, UK3Department of Anesthesiology and Intensive Care, Medical Faculty Carl Gustav Carus, Technical University Dresden, Germany
4Chr. Hansen A/S, Hørsholm / Technical University of Denmark, Kgs. Lyngby, Denmark5Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
∗[email protected] http://cpu.sysbiol.cam.ac.uk
IntroductionThe R statistical environment and programming language is a key player in manydomains that require robust data analysis. The Bioconductor project offer awide range of R packages dedicated to the analysis and comprehension of highthroughput biology. Originally focused on genomics, Bioconductor is gainingincreasing attention in the proteomics, metabolomics and mass spectrometrycommunities, as reflected by the download statistics and package contributions.
010
2030
4050
Num
ber
of B
ioco
nduc
tor
pack
ages
2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14
Bioconductor Versions
ProteomicsMassSpectrometryMassSpectrometryData
010
000
2000
030
000
4000
050
000
Pac
kage
dow
nloa
ds
2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14
Bioconductor Versions
ProteomicsMassSpectrometry
Figure 1 : The left figure shows the number of Bioconductor packages dedicated to Proteomics,Mass Spectrometry and Mass Spectrometry Data. On the right figure, we show the number ofdistinct package downloads. Note that the current version, 2.14 is still downloaded. (NB: the datafor Bioc 2.12 are interpolated due to massive scripted downloads).
Here, we present an overview of current Bioconductor infrastructure dedicated toproteomics and mass spectrometry. These software packages tightlyinteroperate, providing well defined pipeline workflows and a flexible andin-depth development environment.
Working with raw data
The proteomics community has developed a range of data standards and formatsfor MS data (e.g. mzML, mzIdentML) to overcome the shortcomings of closed,binary vendor-specific formats.
One of the main projects that implement parsers for the XML-based open formats isthe C++ proteowizard project, which is interfaced by the mzR Bioconductor packageusing the Rcpp infrastructure for fast raw and (starting with Bioconductor version3.0) identification data. mzIdentML files can also be parsed with the mzID package.
library("mzR")
ms <- openMSfile("raw_data.raw")
id <- openIDfile("msgf-res.mzid")
library("mzID")
id2 <- mzID("msgf-res2.mzid")
The resulting ms object is a file handle that allows fast access to the individual spec-tra. mzR is used by a variety of other packages like xcms, MSnbase, RMassBankand TargetSearch.
IdentificationR/Bioconductor provide software to parse mzIdentML files (mzID and mzR, seeabove), directly run identification R (rTANDEM, MSGFplus) and optimise FDRcalculations based on decoy searches. Below, we use MSnID to define identifica-tion searches filters using precursor mass error, identification scores and missedcleavage to minimise false discovery rates.
(id <- apply_filter(id, filter))
: MSnID object
: Working directory: "."
: #Spectrum Files: 24
: #PSMs: 514638 at 0.042 % FDR
: #peptides: 56546 at 0.11 % FDR
: #accessions: 30944 at 0.56 % FDR
QuantitationSeveral quantitation pipelines are supported: MSnbase and isobar for isobarictagging such as iTRAQ and TMT, MALDIquant for MALDI data, MSnbase forspectra counting and other MS2 label free methods and synapter supportsa DIA complete pipeline, including ion mobility separation on Waters Synaptinstruments.
MS data processing
MS spectrum processing is available in multiple packages, optimised for specificuse cases and pipelines: MSnbase, xcms, MALDIquant.
2000 4000 6000 8000 10000
A
2000 4000 6000 8000 10000
B
● ●●●●●
●
●●●●●
●● ●●● ● ●●
●
●●
●
●
● ●●
● ● ●
2000 4000 6000 8000
−2
−1
01
23
4 C
4180 4190 4200 4210 4220 4230 4240
D
4180 4190 4200 4210 4220 4230 4240
E
2000 4000 6000 8000 10000
F
1206.796
1466.027
1545.929
3262.745
4209.948
4644.373
5336.8715904.68
7766.411 9290.533
Figure 2 : Illustration of the MALDIquant preprocessing pipeline. The first figure (A) shows a rawspectrum with the estimated baseline. In the second figure (B) the spectrum is variance-stabilised,smoothed, baseline-corrected and the detected peaks are marked with blue crosses. The third figure(C) is an example of a fitted warping function for peak alignment. In the next figures, four peaksare shown before (D) and after (E) performing the alignment. The last figure (F) represents amerged spectrum with discovered and labelled peaks.
VisualisationUsing R for high quality programmable figures and interactive graphical userinterfaces such as https://lgatto.shinyapps.io/shinyMA/.
Retention time
M/Z
521
521.22
521.44
521.665
521.885
522.11
522.33
522.555
522.775
523
30:1 30:33 31:5 31:37 32:9 32:45 33:17 33:49 34:22 34:58
3
4
5
6
7
8
521.0
521.5
522.0
522.5
523.0
31
32
33
34
5.0e+07
1.0e+08
1.5e+08
M/ZRetention time200
400
600
800
1000
30.02
30.03
30.04
30.05
30.06
30.07
0.0e+00
5.0e+07
1.0e+08
1.5e+08
2.0e+08
2.5e+08
3.0e+08
M/ZRetention time
0 500 1000 1500 2000 2500 3000 3500
020
4060
8010
0
Time (sec)
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXMLTIC: 7.22e+09
400 500 600 700 800 900 1000
0.0e
+00
1.5e
+08
3.0e
+08
peaks(ms, i)[,1]
Acquisition 2807Retention time 30:1
521.0 521.5 522.0 522.5
peaks(ms, i)[,1]
peak
s(m
s, i)
[,2]
200 400 600 800 1000 1200 1400
0.0e
+00
4.0e
+07
8.0e
+07
1.2e
+08
Prec M/Z521.34
200 400 600 800 1000 1200
0e+
001e
+06
2e+
063e
+06
4e+
06
Prec M/Z659.37
200 400 600 800
0e+
001e
+06
2e+
063e
+06
4e+
06
Prec M/Z540.36
200 400 600 800 1000
050
0000
1000
000
1500
000
2000
000 Prec M/Z
592.85
100 200 300 400 500 600 700
0e+
002e
+06
4e+
066e
+06
Prec M/Z416.31
200 400 600 800 1000
050
0000
1500
000
2500
000
3500
000
Prec M/Z621.87
200 400 600 800 1000 1200 1400
050
0000
1500
000
2500
000
Prec M/Z781.91
100 200 300 400 500 600 700 800
050
0000
1500
000
2500
000
Prec M/Z476.82
100 200 300 400 500 600 700
050
000
1000
0020
0000
3000
00
Prec M/Z434.77
200 400 600 800 1000
0e+
002e
+05
4e+
056e
+05
Prec M/Z647.37
Reporter ions
Inte
nsity
0.1
0.2
0.3
0.4
0.1
0.2
0.3
0.4
max
●
●
●
●
● ●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
trap
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
K.A
EF
VE
VT
K.L
K.H
LVD
EP
QN
LIK.Q
Figure 3 : Visualising raw data and process quantitative data (top) and interactive graphicalinterfaces (bottom) from the pRolocGUI and MSGFgui packages.
Data processing, statistics and machine learning
As an environment for statistical computing, the very best of data processing,statistical modelling and machine learning is readily available in packages suchas MSnbase, isobar, MSstats, msmsTests.
MSstats offers set of tools for statistical relative protein significance analysis inDDA, SRM and DIA experiments as well as power calculation. msmsTests lever-ages powerful statistical modelling from the edgeR package to analyse spectralcounting data.
ConclusionA wide range of other applications such as post-translational modifications(isobar), quality control (qcmetrics and proteoQC) or spatial proteomics(pRoloc) are available as Bioconductor packages. A complete list is availableon the Bioconductor page and in the RforProteomics package. Contributionsand discussions are welcome on our Google group and GitHub wiki.
The R project http://www.r-project.org/
Bioconductor http://bioconductor.org/
RforProteomics http://is.gd/R4Proteomics
Google group https://is.gd/rbioc_sig_proteomics
Support site https://support.bioconductor.org
This work was supported by the European Union 7th Framework Program PRIME-XS
project and a BBSRC Tools and Resources Development Fund.
HUPO meeting 5 – 8 October 2014 Madrid