the proteome xchange consortium
TRANSCRIPT
PRIDE and ProteomeXchange
Henning Hermjakob
Head of Molecular Systems
European Bioinformatics Institute
Director of Bioinformatics
National Center for Protein Sciences, Beijing
Data resources at EMBL-EBIGenes, genomes & variation
RNA Central
Array
Express
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families &
motifs
Chemical biology
Systems
BioModels
BioSamples
Enzyme Portal
IntAct
Reactome
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor Ontology
Literature &
ontologies
A Proteomics Workflow
Sample Raw data Id/Quant Analysis Res
Vizcaíno JA, et al. 2016 update of the PRIDE database and its related tools.
Nucleic Acids Res. 2016 Jan4;44(D1):D447-56.
Metadata
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)R
aw
ID/Q
Me
ta
Mandatory raw data deposition
since July 2015
>150 datasets/month
since July 2015JPost
(MS/MS data)
ProteomeXchange: 3,802 datasets up until April 1st, 2016
Data volume:
Total: ~150 TB
Number of all files: ~400,000
PXD001860: ~ 12 TB
PXD000320-324: ~ 4 TB
PXD002319-26 ~2.4 TB
PXD001471 ~1.6 TB
Origin:
885 USA465 Germany342 United Kingdom264 China
194 France
158 Netherland136 Canada126 Switzerland
107 Denmark
104 Spain
99 Australia
95 Japan
72 Belgium
68 Austria
63 Sweden
61 India
51 Norway
43 Taiwan
30 Italy
29 Brazil
28 Singapore
28 Finland
27 Ireland
27 Russia
26 Israel …
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 1758
2016: 452
Top Species studied by at least 20 datasets:
1526 Homo sapiens
485 Mus musculus
150 Saccharomyces cerevisiae
121 Arabidopsis thaliana
102 Rattus norvegicus
86 Escherichia coli
44 Bos taurus
35 Drosophila melanogaster
32 Glycine max
~ 700 species in total
Funding
All ProteomeXchange partners are independently funded
• Basic institutional core funding
• Research grants
• PRIDE:
• 2 FTE EMBL-EBI core funding
• 2 FTE Wellcome Trust PRIDE
• 2 FTE BBSRC BBR
• 0.2 de.NBI (proto-Elixir)
Key development phase was enabled by EU ProteomeXchange
grant 2011-2014 (number 260558) with funding to both EU and US
partners
Challenge: Scarce international funding
• Two failed applications to joint NSF/BBSRC call
8
Usage
Ca. 2,000 data submitters per year, defined through
intensive email contact.
Ca. 5,000 data access users/year, defined through distinct
IP addresses in web logs.
Dataset level metrics through log analysis - challenging
9
Downloads Hits/ No files =
dataset
Dataset Title
PXD001641 31808/91 = 350
Single muscle fiber proteomics reveals unexpected mitochondrial specialization
PXD001126 4897/26 = 188
Building high-quality assay libraries for targeted analysis of
SWATH MS data
PXD001574 4436/32 = 139 Phospho-iTRAQ
PXD000475 6638/50 = 133 Yersinia enterocolitica SOR17
PXD000700 266/2 = 133
Proteomic analysis of accessory gland in sexually mature Eriocheir sinensis
PXD000561 46578/2383=20 A draft map of the human proteome
Usage
Ca. 2,000 data submitters per year, defined through
intensive email contact.
Ca. 5,000 data access users/year, defined through distinct
IP addresses in web logs.
Strong culture of data citation in Proteomics
10
Contingency Planning
In proteomics, contingency planning
is not hypothetical:
• NCBI Peptidome closed in 2011:
• Managed process, data remains accessible
on FTP server
• Data transferred to EBI PRIDE, recurated
• Tranche closed in 01/2013
• Heavily used repository, built on Bittorrent technology
• Became unreliable in 2012
• Closed in 1/2013
• MassIVE and PeptideAtlas tried to rescue data, largely
unsuccessful
• There are now dead data links in the public literature
16
18
Acknowledgements• ProteomeXchange partners,
in particular:• Eric Deutsch, ISB, Seattle
• Nuno Bandeira, UCSD
• Yasushi Ishihama, jPOST
• Andy Jones, U Liverpool
• Lennart Martens, U Gent
• Pierre-Alain Binz, SIB, Geneva
• Martin Eisenacher, MPC, Bochum
• Ruedi Aebersold, ETH Zurich
• Laurent Gatto, U Cambridge
• Editors
• Mike Dunn, Proteomics
• Achim Kraus, Proteomics
• Ralph Bradshaw, MCP
• PRIDE team• Juan Antonio Vizcaino
• Attila Csordas
• Johannes Griss, EBI/U Vienna
• Tobias Ternent
• Yasset Perez Riverol
• Mingze Bai
• Noemi del Toro Ayllon
• Funding:
• Wellcome Trust PRIDE
• EU FW7 ProteomeXchange
• BBSRC BBR Process
• BBSRC BBR ProteoGenomics
• NIH BD2K Center of Excellence
@ UCLA, Grant number
1U54GM114833-01
All data providers!
• ?
proteomexchange.org psidev.info
If the Human Genome Project
had not followed an open data
release policy, what would we
be searching our spectra
against today?
The HUPO Proteomics Standards Initiative
mzML:
spectra,
2011
mzidentML:
identification
,
2012
mzquantML:
quantitation,
2013
mzTab:
summary, 2014
Controlled vocabulary:
2013
qcML:
quality
control,
2014
22
“In particular, novel protein sequences should be deposited in UniProt (www.uniprot.org); molecular interactions in an IMEx partner database (imex.sf.net); and protein identification data in PRIDE (www.ebi.ac.uk/pride), World-2DPAGE (www.expasy.org/world-2dpage/), or a comparable database.”
If a manuscript is accepted by the journal, all mass spectra
contributing to the described work must be deposited in
electronic form by the time of publication at a publicly
accessible site that is independent of the authors' control.
Data Deposition Requirements
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data flow
History
2005:
• First PRIDE Publication
• ProteomeXchange concept published
2011: Peptidome (NCBI) closes
2011: EU ProteomeXchange grant start
• DB partners PRIDE, PeptideAtlas (ISB, Seattle)
2012: ProteomeXchange production start
2013: Tranche closes
2014: ProteomeXchange grant ends
2014: Massive (UCSD) joins
2015: Mandatory submissions by MCP
2016: jPOST (Japan) joins
25