pride cluster 062016 update
Post on 11-Apr-2017
124 Views
Preview:
TRANSCRIPT
Spectrum clustering of PRIDE MS/MS data
Dr. Juan Antonio Vizcaíno
Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Overview• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Overview• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-based proteomics data:
• Peptide and protein expression data (identification and quantification)
• Post-translational modifications• Mass spectra (raw data and peak
lists)• Technical and biological metadata• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005Vizcaíno et al., NAR, 2016
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
ProteomeXchange: A Global, distributed proteomics database
PASSEL (SRM data)
PRIDE (MS/MS data)
MassIVE (MS/MS data)
Raw
ID/Q
Met
a
155 datasets/month since July 2015
Mandatory raw data deposition since July 2015
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Overview• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Motivation• Data is stored in PRIDE as originally analysed by
the submitters (no data reprocessing is done)
• Heterogeneous quality, difficult to make the data comparable
• Enable assessment of (published) proteomics data
• Pre-requisite for data reuse (e.g. in other bioinformatics resources such as UniProt)
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2013
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 10 spectra in a cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster: Implementation• Griss et al, Nat. Methods
2013
• Clustered all public, identified spectra in PRIDE
• EBI compute farm, LSF• 20.7 M identified
spectra• 610 CPU days, two
calendar weeks• Validation, calibration• Feedback into PRIDE
datasets• EBI farm, LSF
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Overview• Brief introduction to PRIDE
• PRIDE Cluster: motivation and concept, first implementation
• PRIDE Cluster second implementation
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Archive• World-leading repository for MS/MS-based
proteomics data• Founding member of ProteomeXchange
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster
Sequence-based search engines
Spectrum clustering
Incorrectly or unidentified spectra
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster: Second Implementation• Griss et al, Nat. Methods
2013
• Clustered all public, identified spectra in PRIDE
• EBI compute farm, LSF• 20.7 M identified
spectra• 610 CPU days, two
calendar weeks• Validation, calibration• Feedback into PRIDE
datasets• EBI farm, LSF
Griss et al, Nat. Methods 2016, in pressClustered all public spectra in PRIDE by April 2015Apache Hadoop• Starting with 256 M
spectra.• 190 M unidentified spectra
(they were filtered to 111 M for spectra that are likely to represent a peptide).
• 66 M identified spectra• Result: 28 M clusters • 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a cluster and ratio >70%.
Originally submitted identified spectra
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Output of the analysis• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified clusters
• 3. Clusters just containing unidentified spectra
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR No sequence has a proportion in the cluster >50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
Originally submitted identified spectra
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the search.
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
2. Inferring identifications for originally unidentified spectra
30
• 9.1 M unidentified spectra were contained in clusters with a reliable identification.
• These are candidate new identifications (that need to be confirmed), often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
3. Consistently unidentified clusters
31
• 19 M clusters contain only unidentified spectra.• 41,155 of these spectra have more than 100 spectra (= 12
M spectra).• Most are likely to be derived from peptides.• They could correspond to PTMs or variant peptides.• With various methods, we found likely identifications for
about 20%.• Vast amount of data mining remains to be done.
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
PRIDE Cluster as a Public Data Mining Resource
36
• http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species.• All clustering results, as well as specific subsets of interest
available.• Source code (open source) and Java API
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Applications of spectrum clustering…
38
• In individual or small groups or “similar” datasets:• Can be used to target spectra that are “consistently”
unidentified.• Unidentified spectra could represent PTMs or sequence
variants.• Try “more-expensive” computational analysis methods
(e.g. spectral searches, de novo).
• When mixing identified and unidentified spectra from different experiments, if “non-initially” found PTMs are identified, one could modify the initial search parameters.
• For quantification purposes, the alignment of different runs could be improved by clustering the spectra first?
Juan A. Vizcaínojuan@ebi.ac.uk
Seminar20 June 2016
Acknowledgements
Johannes GrissRui Wang
Yasset Perez-RiverolSteve LewisHenning Hermjakob
Open MS team (led by O. Kohlbacher) David Tabb
The rest of the PRIDE team especially Noemi del Toro and Jose A. Dianes
top related