pride and proteomexchange

49
PRIDE resources and ProteomeXchange Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK

Upload: juan-antonio-vizcaino

Post on 15-Apr-2017

43 views

Category:

Science


0 download

TRANSCRIPT

EMBL-EBI Now and in the Future

PRIDE resources and ProteomeXchangeDr. Juan Antonio Vizcano

Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas

PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.

2

PRIDE Archive (in the context of ProteomeXchange and the PSI standards)

How to submit data to PRIDE: PRIDE tools

How to access data in PRIDE Archive

PRIDE Cluster and PRIDE Proteomes

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

PRIDE Archive (in the context of ProteomeXchange and the PSI standards)

How to submit data to PRIDE: PRIDE tools

How to access data in PRIDE Archive

PRIDE Cluster and PRIDE Proteomes

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information

Full support for tandem MS approachesAny type of data can be stored.

PRIDE (PRoteomics IDEntifications) Archivehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

5

Data content in PRIDE ArchiveSubmission driven resource

PRIDE is split in datasets (group of assays)

An assay represents one MS run (in most cases).

No data reprocessing at present. PRIDE aims to represent the authors view on the data

Supported formats: PRIDE XML and mzIdentML.

Raw data is also now stored

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Meta

jPOST(MS/MS data)

Mandatory raw data deposition since July 2015

Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014Deutsch et al., NAR, 2017, in press

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Vizcano et al., Nat Biotechnol, 2014Deutsch et al., NAR, 2017, in press

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

8

ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs Receiving repositories

PRIDE

GPMDB

Researchers results

Raw dataMetadata

PASSEL

proteomicsDB

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

OmicsDIIntegration with other omics datasets

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

9

PRIDE: Source of MS proteomics data

PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the Expression Atlas.

http://www.ebi.ac.uk/pride

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Explain that PRIDE is working in two main directions: Develop submission/dissemination pipelines of MS proteomics data involving the main proteomics resources (ProteomeXchange consortium), Integrate proteomics information (peptide/protein expression data) with other EBI resources like Ensembl (Genomics), the Expression Atlas (transcriptomics) and UniProt (to protein sequence information). Proteomics data is needed to have a more complete picture of biology. 10

PRIDE Archive (in the context of ProteomeXchange and the PSI standards)

How to submit data to PRIDE: PRIDE tools

How to access data in PRIDE Archive

A sneak peak to other PRIDE resources

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

12

CompletePartialComplete vs Partial submissions: processed resultsFor complete submissions, it is possible to connect the spectra with the identificationprocessed results and they can be visualized.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Complete vs Partial submissions: experimental metadata

CompletePartialGeneral experimental metadata about the projects is similar. However, at the assay level information in partial submissions is not so detailed

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016How to perform a complete PX submission to PRIDE

Decide between a complete/partial submission

File conversion/export to mzIdentML (or PRIDE XML)

File check before submission (PRIDE Inspector)

Experimental annotation and actual file submission (PX submission tool)

Post-submission steps

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

PX Data workflow for MS/MS data

Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).Result files: Complete submissions: Result files can be converted to the mzIdentML data standard (or PRIDE XML). Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files:QUANT: Quantification related resultse. FASTAPEAK: Peak list filesf. SP_LIBRARYGEL: Gel imagesOTHER: Any other file typePublished RawFilesOther files

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

16

PX Data workflow for MS/MS data

Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).Result files: Complete submissions: Result files can be converted to the mzIdentML data standard (or PRIDE XML). Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. Other files: Optional files (the list can be extended):QUANT: Quantification related resultse. FASTAPEAK: Peak list filesf. SP_LIBRARYGEL: Gel imagesOTHER: Any other file typePublished RawFilesOther files

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

17

PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

18

Tools

RESULT file generationFinal RESULT file mzIdentML RESULTNative file export to mzIdentMLSpectra files

(mzML, mzXML, mzData, mgf, pkl, ms2, dta, apl)MascotProteinPilotScaffoldPEAKSMSGF+Others

Native File export

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Put logo here19

Complete submissionsSearch Engine Results + MS filesSearch enginesmzIdentML

Mascot MSGF+ MyriMatch and related tools from D. Tabbs lab OpenMS PEAKS PeptideShaker ProCon (ProteomeDiscoverer, Sequest) Scaffold TPP via the idConvert tool (ProteoWizard) ProteinPilot (from version 5.0) X!Tandem native conversion (Beta, PILEDRIVER) Others: library for X!Tandem conversion, lab internal pipelines, Crux Soon: ProteomeDiscoverer (Thermo)

An increasing number of tools support export to mzIdentML 1.1

Referenced spectral files need to be submitted as well (all open formats are supported).

Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

20

Tools

RESULT file generationFinal RESULT file mzTab RESULTComing soon: Support for mzTabSpectra files

(mzML, mzXML, mzData, mgf, pkl, ms2, dta, apl)MascotMaxQuantOthers

Native File export

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Put logo here21

PRIDE Components: Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XML2

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Inspector Toolsuite

Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., MCP, 2016

PRIDE InspectorPRIDE Inspector 2 supports:

PRIDE XML mzIdentML + all types of spectra files mzML mzTab identification and Quantification (+ all types of spectra files)

https://github.com/PRIDE-Toolsuite/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

23

PRIDE Inspector ToolsuitePRIDE Inspector 2

https://github.com/PRIDE-Toolsuite/

New visualisation functionality for Protein Groups

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

24

PRIDE Components: Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XML3

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PX Submission Tool

Desktop application for data submissions to ProteomeXchange via PRIDE

Implemented in Java 7Streamlines the submission processCapture mappings between filesRetain metadataFast file transfer with Aspera (FASP transfer technology) FTP also availableCommand line option

Submission tool screenshot

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

26

PX submission tool: screenshots

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Archive over 5,000 datasets from over 51 countries and 2,000 groupsUSA 814 datasetsGermany 528 UK 338China 328France 222Netherlands 175Canada - 137

Data volume:Total: ~275 TB Number of all files: ~560,000PXD000320-324: ~ 4 TBPXD002319-26 ~2.4 TBPXD001471 ~1.6 TB1,973 datasets i.e. 52% of all are publicly accessible~90% of all ProteomeXchange datasets

YearSubmissionsAll submissionsCompletePRIDE Archive growthIn the last 12 months: ~165 submitted datasets per monthTop Species studied by at least 100 datasets:2,010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus >900 reported taxa in total

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016(> 922 processed by MaxQuant)

28

Public data release: when does it happen?

When the author tells us to do it (the authors can do it by themselves)

When we find out that a dataset has been published

We look for PXD identifiers in PubMed abstracts.

If your PXD identifier is not in the abstract, a paper may have been published and the data is still private. Let us know!

New web form in the PRIDE web to facilitate the process

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Partial submissions can be used to store other data typesEverything can be stored, not only MS/MS data: very flexible mechanism to be able to capture all types of datasets.

PRIDE does not actively store SRM data (PASSEL).

Top down proteomics datasets.

Mass Spectrometry Imaging datasets.

Data independent acquisition techniques: e.g. SWATH-MS, MSE, HD-MSE, etc.

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

CDFrom original publication [13]Reconstructed ProteomeXchange dataThermo RAW data / UDPMirion Software (JLU)Thermo RAW data / UDPConvert to imzMLUpload to PRIDE (EBI, Cambridge, UK)

Download from PRIDEDisplay in MSiReaderVendor-independent data formatFreely available software (open source)open data free to reuseAnybody can do this! A public repository for mass spectrometry imaging dataRmpp et al., 2015PRIDE databaseEuropean Bioinformatics Institute, Cambridge, UK 3. Upload4. Download

No file size limit!

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

31

PRIDE Archive (in the context of ProteomeXchange and the PSI standards)

How to submit data to PRIDE: PRIDE tools

How to access data in PRIDE Archive

PRIDE Cluster and PRIDE Proteomes

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Ways to access data in PRIDE Archive

PRIDE web interface

File repository

REST web service

PRIDE Inspector tool

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Archive web interface

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Archive web interface (2)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016ProteomeCentralMetadata / ManuscriptRaw DataResults

Journals

Peptide Atlas Receiving repositories

PRIDE

Researchers results

Raw dataMetadata

PASSEL

Research groupsReanalysis of datasets

MassIVE

jPOST MS/MS data(as completesubmissions)

Any other workflow (mainly partial submissions)

DATASETS

SRM data

Reprocessed results

MassIVEProteomeXchange data workflow

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

36

ProteomeCentral: Centralised portal for all PX datasetshttp://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016RSS feed and Twitter for following announcements of public datasets

http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml @proteomexchange

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

PRIDE Archive (in the context of ProteomeXchange and the PSI standards)

How to submit data to PRIDE: PRIDE tools

How to access data in PRIDE Archive

PRIDE Cluster and PRIDE Proteomes

Overview

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Added value resources: PRIDE Cluster and PRIDE ProteomesCondensed and across-data set, QC-filtered view on PRIDE data.PRIDE Cluster: Peptide centric.PRIDE Proteomes: Protein centric (identification data)

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Cluster

Provide an aggregated peptide centric view of PRIDE Archive.Hypothesis: same peptide will generate similar MS/MS spectra across experiments.New version of spectral clustering algorithm to reliably group spectra coming from the same peptide. Enables QC of peptide-spectrum matches (PSMs). Infer reliable identifications by comparing submitted identifications of spectra within a cluster.

After clustering, a representative spectrum is built for all peptides consistently identified across different datasets.Used to build spectral libraries (for 16 species).Griss et al., Nat. Methods, 2013Griss et al., Nat. Methods, 2016

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

41

Example: one perfect cluster

880 PSMs give the same peptide ID4 species28 datasetsSame instruments

http://www.ebi.ac.uk/pride/cluster/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Cluster as a Public Data Mining Resource43http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE Proteomes web interface: identification info

Unique/Shared Peptides Mass spec-based sequence coveragePTM detected ( )Observed tissues

Biological vs Sample Prep PTMshttp://wwwdev.ebi.ac.uk/pride/proteomes/

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Main characteristics of PRIDE Archive and ProteomeXchange

PX/PRIDE submission workflow for MS/MS dataPRIDE InspectorPX submission tool

PRIDE/ProteomeXchange has become the de facto standard for data submission and data availability in proteomics

PRIDE Proteomes and PRIDE Cluster: new resources

Conclusions

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PRIDE resources

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016

Do you want to know a bit more?

http://www.slideshare.net/JuanAntonioVizcaino

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Enrique Perez

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!

@pride_ebi

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201648

Questions?

Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201649