pride cluster 062016 update

Spectrum clustering of PRIDE MS/MS data

Dr. Juan Antonio Vizcaíno

Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK

Juan A. Vizcaínojuan@ebi.ac.uk

Seminar20 June 2016

Overview• Brief introduction to PRIDE

• PRIDE Cluster: motivation and concept, first implementation

• PRIDE Cluster second implementation

Seminar20 June 2016

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride

• PRIDE stores mass spectrometry (MS)-based proteomics data:

• Peptide and protein expression data (identification and quantification)

• Post-translational modifications• Mass spectra (raw data and peak

lists)• Technical and biological metadata• Any other related information

• Full support for tandem MS approaches

Martens et al., Proteomics, 2005Vizcaíno et al., NAR, 2016

Seminar20 June 2016

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

155 datasets/month since July 2015

Mandatory raw data deposition since July 2015

Seminar20 June 2016

Motivation• Data is stored in PRIDE as originally analysed by

the submitters (no data reprocessing is done)

• Heterogeneous quality, difficult to make the data comparable

• Enable assessment of (published) proteomics data

• Pre-requisite for data reuse (e.g. in other bioinformatics resources such as UniProt)

Seminar20 June 2016

PRIDE Cluster - Concept

Griss et al., Nat Methods, 2013

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

Threshold: At least 10 spectra in a cluster and ratio >70%.

Originally submitted identified spectra

Seminar20 June 2016

PRIDE Cluster: Implementation• Griss et al, Nat. Methods

• Clustered all public, identified spectra in PRIDE

• EBI compute farm, LSF• 20.7 M identified

spectra• 610 CPU days, two

calendar weeks• Validation, calibration• Feedback into PRIDE

datasets• EBI farm, LSF

Seminar20 June 2016

PRIDE Archive• World-leading repository for MS/MS-based

proteomics data• Founding member of ProteomeXchange

Seminar20 June 2016

PRIDE Cluster

Sequence-based search engines

Spectrum clustering

Incorrectly or unidentified spectra

Seminar20 June 2016

PRIDE Cluster: Second Implementation• Griss et al, Nat. Methods

• Clustered all public, identified spectra in PRIDE

• EBI compute farm, LSF• 20.7 M identified

spectra• 610 CPU days, two

calendar weeks• Validation, calibration• Feedback into PRIDE

datasets• EBI farm, LSF

Griss et al, Nat. Methods 2016, in pressClustered all public spectra in PRIDE by April 2015Apache Hadoop• Starting with 256 M

spectra.• 190 M unidentified spectra

(they were filtered to 111 M for spectra that are likely to represent a peptide).

• 66 M identified spectra• Result: 28 M clusters • 5 calendar days on 30 node

Hadoop cluster, 340 CPU cores

Seminar20 June 2016

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

Seminar20 June 2016

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

Seminar20 June 2016

Output of the analysis• 1. Inconsistent spectrum clusters

• 2. Clusters including identified and unidentified clusters

• 3. Clusters just containing unidentified spectra

Seminar20 June 2016

1. Re-analysis of inconsistent clusters

NMMAACDPR

IGGIGTVPVGR

NMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR No sequence has a proportion in the cluster >50%

Consensus spectrum

PPECPDFDPPR

VFDEFKPLVEEPQNLIK

Seminar20 June 2016

1. Re-analysis of inconsistent clusters

• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.

• 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.

• In this case, it is likely that a contaminants DB was not used in the search.

Seminar20 June 2016

Validation

Seminar20 June 2016

2. Inferring identifications for originally unidentified spectra

• 9.1 M unidentified spectra were contained in clusters with a reliable identification.

• These are candidate new identifications (that need to be confirmed), often missed due to search engine settings

• Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.

Seminar20 June 2016

3. Consistently unidentified clusters

• 19 M clusters contain only unidentified spectra.• 41,155 of these spectra have more than 100 spectra (= 12

M spectra).• Most are likely to be derived from peptides.• They could correspond to PTMs or variant peptides.• With various methods, we found likely identifications for

about 20%.• Vast amount of data mining remains to be done.

Seminar20 June 2016

3. Consistently unidentified clusters

Seminar20 June 2016

PRIDE Cluster as a Public Data Mining Resource

• http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species.• All clustering results, as well as specific subsets of interest

available.• Source code (open source) and Java API

Seminar20 June 2016

Applications of spectrum clustering…

• In individual or small groups or “similar” datasets:• Can be used to target spectra that are “consistently”

unidentified.• Unidentified spectra could represent PTMs or sequence

variants.• Try “more-expensive” computational analysis methods

(e.g. spectral searches, de novo).

• When mixing identified and unidentified spectra from different experiments, if “non-initially” found PTMs are identified, one could modify the initial search parameters.

• For quantification purposes, the alignment of different runs could be improved by clustering the spectra first?

Seminar20 June 2016

Acknowledgements

Johannes GrissRui Wang

Yasset Perez-RiverolSteve LewisHenning Hermjakob

Open MS team (led by O. Kohlbacher) David Tabb

The rest of the PRIDE team especially Noemi del Toro and Jose A. Dianes

Seminar20 June 2016

Questions?

pride cluster 062016 update

Science

pride i john 2:15-17. what is pride? defined: 1. what does...

pride & honor

the paw pride december 2018 the paw pride

cluster initiative bavaria - cluster-offensive bayern ·...

pride cluster presentation

2. végétarien-ne-s, végétalien-ne ... - veggie pride...

2019 temple ar final...- temple cluster elementary schools...

annexure-1 cluster ulbs “cluster” or “gmada cluster”

vibranic broucher 062016 rev

2016-2017...pioneer pride. pioneer pride. pioneer pride....

pride project

united in pride ~ heartland pride volunteer training

ko pride pride pomlad (2)

062016 jcdl media networks upload

mortais - nerdpride.com.br · kit. mortais. o site. o...

pride glasgow pride guide 2013

community pride

gay pride rome italy/ gay pride roma, italia

verbundprojekt pride

blue ridge pride 2014 pride guide