the proteome xchange consortium

PRIDE and ProteomeXchange

Henning Hermjakob

Head of Molecular Systems

European Bioinformatics Institute

[email protected]

Director of Bioinformatics

National Center for Protein Sciences, Beijing

Data resources at EMBL-EBIGenes, genomes & variation

RNA Central

Array

Express

Expression Atlas

Metabolights

PRIDE

InterPro Pfam UniProt

ChEMBL SureChEMBL ChEBI

Molecular structures

Protein Data Bank in Europe

Electron Microscopy Data Bank

European Nucleotide Archive

European Variation Archive

European Genome-phenome Archive

Gene, protein & metabolite expression

Protein sequences, families &

motifs

Chemical biology

Systems

BioModels

BioSamples

Enzyme Portal

IntAct

Reactome

Ensembl

Ensembl Genomes

GWAS Catalog

Metagenomics portal

Europe PubMed Central

BioStudies

Gene Ontology

Experimental Factor Ontology

Literature &

ontologies

A Proteomics Workflow

Sample Raw data Id/Quant Analysis Res

Vizcaíno JA, et al. 2016 update of the PRIDE database and its related tools.

Nucleic Acids Res. 2016 Jan4;44(D1):D447-56.

Metadata

ProteomeXchange: A Global, distributed proteomics

database

PASSEL

(SRM data)

PRIDE

(MS/MS data)

MassIVE

(MS/MS data)R

aw

ID/Q

Me

ta

Mandatory raw data deposition

since July 2015

>150 datasets/month

since July 2015JPost

(MS/MS data)

ProteomeXchange: 3,802 datasets up until April 1st, 2016

Data volume:

Total: ~150 TB

Number of all files: ~400,000

PXD001860: ~ 12 TB

PXD000320-324: ~ 4 TB

PXD002319-26 ~2.4 TB

PXD001471 ~1.6 TB

Origin:

885 USA465 Germany342 United Kingdom264 China

194 France

158 Netherland136 Canada126 Switzerland

107 Denmark

104 Spain

99 Australia

95 Japan

72 Belgium

68 Austria

63 Sweden

61 India

51 Norway

43 Taiwan

30 Italy

29 Brazil

28 Singapore

28 Finland

27 Ireland

27 Russia

26 Israel …

Datasets/year:

2012: 102

2013: 527

2014: 963

2015: 1758

2016: 452

Top Species studied by at least 20 datasets:

1526 Homo sapiens

485 Mus musculus

150 Saccharomyces cerevisiae

121 Arabidopsis thaliana

102 Rattus norvegicus

86 Escherichia coli

44 Bos taurus

35 Drosophila melanogaster

32 Glycine max

~ 700 species in total

Funding

All ProteomeXchange partners are independently funded

• Basic institutional core funding

• Research grants

• PRIDE:

• 2 FTE EMBL-EBI core funding

• 2 FTE Wellcome Trust PRIDE

• 2 FTE BBSRC BBR

• 0.2 de.NBI (proto-Elixir)

Key development phase was enabled by EU ProteomeXchange

grant 2011-2014 (number 260558) with funding to both EU and US

partners

Challenge: Scarce international funding

• Two failed applications to joint NSF/BBSRC call

8

Usage

Ca. 2,000 data submitters per year, defined through

intensive email contact.

Ca. 5,000 data access users/year, defined through distinct

IP addresses in web logs.

Dataset level metrics through log analysis - challenging

9

Downloads Hits/ No files =

dataset

Dataset Title

PXD001641 31808/91 = 350

Single muscle fiber proteomics reveals unexpected mitochondrial specialization

PXD001126 4897/26 = 188

Building high-quality assay libraries for targeted analysis of

SWATH MS data

PXD001574 4436/32 = 139 Phospho-iTRAQ

PXD000475 6638/50 = 133 Yersinia enterocolitica SOR17

PXD000700 266/2 = 133

Proteomic analysis of accessory gland in sexually mature Eriocheir sinensis

PXD000561 46578/2383=20 A draft map of the human proteome

Usage

Ca. 2,000 data submitters per year, defined through

intensive email contact.

Ca. 5,000 data access users/year, defined through distinct

IP addresses in web logs.

Strong culture of data citation in Proteomics

10

Data Re-Use:

Anecdotal Evidence

12

Systematic Tracking of Data Re-use:

OmicsDI

www.omicsDI.org

Data Re-Use

www.omicsDI.org

Data re-use

www.omicsDI.org

Contingency Planning

In proteomics, contingency planning

is not hypothetical:

• NCBI Peptidome closed in 2011:

• Managed process, data remains accessible

on FTP server

• Data transferred to EBI PRIDE, recurated

• Tranche closed in 01/2013

• Heavily used repository, built on Bittorrent technology

• Became unreliable in 2012

• Closed in 1/2013

• MassIVE and PeptideAtlas tried to rescue data, largely

unsuccessful

• There are now dead data links in the public literature

16

Best intentions – one for all, all for one…

17

18

Acknowledgements• ProteomeXchange partners,

in particular:• Eric Deutsch, ISB, Seattle

• Nuno Bandeira, UCSD

• Yasushi Ishihama, jPOST

• Andy Jones, U Liverpool

• Lennart Martens, U Gent

• Pierre-Alain Binz, SIB, Geneva

• Martin Eisenacher, MPC, Bochum

• Ruedi Aebersold, ETH Zurich

• Laurent Gatto, U Cambridge

• Editors

• Mike Dunn, Proteomics

• Achim Kraus, Proteomics

• Ralph Bradshaw, MCP

• PRIDE team• Juan Antonio Vizcaino

• Attila Csordas

• Johannes Griss, EBI/U Vienna

• Tobias Ternent

• Yasset Perez Riverol

• Mingze Bai

• Noemi del Toro Ayllon

• Funding:

• Wellcome Trust PRIDE

• EU FW7 ProteomeXchange

• BBSRC BBR Process

• BBSRC BBR ProteoGenomics

• NIH BD2K Center of Excellence

@ UCLA, Grant number

1U54GM114833-01

All data providers!

• ?

proteomexchange.org psidev.info

If the Human Genome Project

had not followed an open data

release policy, what would we

be searching our spectra

against today?

20

Proteomics Minimum Data Requirements

The HUPO Proteomics Standards Initiative

mzML:

spectra,

2011

mzidentML:

identification

,

2012

mzquantML:

quantitation,

2013

mzTab:

summary, 2014

Controlled vocabulary:

2013

qcML:

quality

control,

2014

22

“In particular, novel protein sequences should be deposited in UniProt (www.uniprot.org); molecular interactions in an IMEx partner database (imex.sf.net); and protein identification data in PRIDE (www.ebi.ac.uk/pride), World-2DPAGE (www.expasy.org/world-2dpage/), or a comparable database.”

If a manuscript is accepted by the journal, all mass spectra

contributing to the described work must be deposited in

electronic form by the time of publication at a publicly

accessible site that is independent of the authors' control.

Data Deposition Requirements

ProteomeCentral

Metadata /

Manuscript

Raw Data*

Results

Journals

UniProt/

neXtProt

Peptide Atlas

Other DBs

Receiving repositories

PASSEL

(SRM data)

PRIDE

(MS/MS data)

Other DBs

GPMDB

Researcher’s results

Reprocessed results

Raw data*

Metadata

MassIVE

(MS/MS data)

ProteomeXchange data flow

24

Curators!

Attila Csordas

Tobias Ternent

History

2005:

• First PRIDE Publication

• ProteomeXchange concept published

2011: Peptidome (NCBI) closes

2011: EU ProteomeXchange grant start

• DB partners PRIDE, PeptideAtlas (ISB, Seattle)

2012: ProteomeXchange production start

2013: Tranche closes

2014: ProteomeXchange grant ends

2014: Massive (UCSD) joins

2015: Mandatory submissions by MCP

2016: jPOST (Japan) joins

25

the proteome xchange consortium

Documents