the tranche data repository:
TRANSCRIPT
The Tranche data repository: Progress made and lessons learned from 24.4 TB of data, 1,612 users and 12,458 deposi@ons.
Philip Andrews University of Michigan
Center for Computa@onal Medicine and Bioinforma@cs
Proteome Commons Session sponsored by: Statistical Proteomics Initiative (SPI) of HUPO
Empty archives-‐ the problem Most researchers agree that open access to data is the scien@fic ideal, so what is
stopping it happening?
Nature Vol 461|10 September 2009 Proteome Commons
“Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost. (Bryan Heidorn, 2008)
Proteome Commons
Data not in the public domain has less value
If data is in the public domain can/will it actually be u:lized?
“[The ins@tu@onal data repository] is like a roach motel. Data goes in, but it doesn’t come out.”
(Dorothea Salo, 2008)
Proteome Commons
Why is data sharing an important issue?
• Volume of high value data is increasing. • Crucial for peer review and cri@cal analysis. • Economics are driving demand for data reuse. • Cultural shiW: Public access to data collected with public funds.
• Storage and transport geXng beYer and cheaper. • Improved algorithms require access to data. • Raises research quality and lowers risk of unethical behavior.
Proteome Commons
Omics data sets all have different proper:es (and different solu:ons are being pursued)
• Microarrays (RNAseq) • ChIPseq • Metabolomics • Proteomics • Glycomics
Proteome Commons
The Proteomics data challenge
• Proteomics projects generate large amounts of data at high cost. – Data sets can be quite large. – File formats proprietary. – Reuse of data sets has been limited.
• Proteomics technologies evolve rapidly. – Data types and rela@onships change quickly.
• Annota@on has been variable. • Documenta@on for peer review.
– The Paris Guidelines were developed by major proteomics journals and domain experts to address data documenta@on concerns.
Proteome Commons
Technical challenges in proteome informa:cs impact data sharing
• Proteome technologies evolve rapidly. – SoWware always lags behind hardware. – SoWware always lags behind applica@ons.
• Database structures are inadequate – missing data, data quality, complex interac@ons, changing interac@ons, new data types, pedigree, etc.
Proteome Commons
What does Open Access mean for data?
(by imperfect analogy to open access publica@ons)
• Data sets are online • Free of restric@ons (copyright) • Free of cost (to user) • No technical barriers to use
Proteome Commons
Open data requirements • NIH: Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-‐Wide Associa@on Studies (GWAS)
• Final NIH Statement on Sharing Research Data Release Date: February 26, 2003, NOT-‐OD-‐03-‐032
• Paris guidelines for publica@on • Amsterdam accord for data sharing in proteomics
• MCP requirements for publica@on Proteome Commons
What was life like when we started the Tranche repository?
• Real concerns about ability of proteomics to deliver quality data (resulted in Paris guidelines for publica@on).
• Recogni@on that reproducibility across laboratories for iden@cal experiments required access to raw data.
• Expecta@on that few inves@gators would share raw data.
• No turnkey way to share data existed. • Large number of proprietary data formats.
Selected data sharing issues • How do we encourage researchers to share data?
• How should cita@ons, authorship, intellectual property, etc. be handled?
• What are the legal constraints? • Who should pay for infrastructure and maintenance?
• Is access to the data sets alone sufficient?
Proteome Commons
Why do inves:gators not want to share their data?
• Compe@@ve edge (funding process) • All the normal reasons primates don’t share:
– What if my data sets not as good as everyone else’s? (data insecurity)
– What if I missed something obvious? (also data insecurity)
– It’s a valuable dataset. (data hoarding) – It’s mine! (emo@onal response, greed) – Its too hard. (laziness, priori@es) – It gives me no advantage.
Proteome Commons
Why do inves:gators want to share their data?
• All the reasons primates want to share: – Because it’s the right thing to do (altruism) – Look how great my data sets are (boas@ng) – More eyes are beYer and I will get value back (enlightened self interest)
– Improved tools will be developed to analyze data. – Larger datasets become available that they can use (if you share, then others are more likely to share).
– New perspec@ves gained from colleagues. – Reputa@on enhanced as good ci@zen and colleague, etc.
Proteome Commons
What can be done to promote data sharing?
• Provide beYer infrastructure (lower the energy barrier). • Provide posi@ve incen@ves. • Value data sets release more highly.
– Role for funding agencies and journals. • Allow ini@al author of data to retain some control over
when, where, how. • Address and possibly modify scien@fic mores on cita@on
and authorship. • Modify community views on interpreta@ons and errors.
Proteome Commons
Should data sets be copyrighted?
• What are the limits on primary use? Secondary use?
• Should we (scien@sts) retain moral rights? • What are the consequences if all rights are waived?
• What happens to intellectual property rights?
Proteome Commons
The Tranche Data Repository
• Distributed data repository (not a database) • Allows secure upload and dissemina@on of files • Two-‐stage release (encrypted, public) • Links well to other resources • Data provenance and fidelity are inherent to system
• Key clients were researchers and publishers • Closely integrated with Proteomecommons.org
http://www.trancheproject.org Proteome Commons
Proteomecommons.org
• Online project management based on a social network model
• Allows annota@on standards to be applied • Provides linkage of data sets to publica@ons • Primarily used for collabora@ve projects
http://www.proteomecommons.org/
Tranche is a cloud storage system…..
Proteome Commons
…..linked to the Proteomecommons.org resource
Tranche system current stats } Tranche cita@on codes have appeared in a large number of journals
} 1,612 registered users
} 12,458 deposi@ons represen@ng approximately a billion mass spectra totaling (24.4 TB)
} Average data chunk has over three replica@ons } CuXng-‐edge data sets are available for public access as soon as manuscripts are accepted
20 Proteome Commons
Tranche Stats (as of 2010)
• Less than 40 of the 9,638 data sets in Tranche invoke a restric@ve data license!
• 245 out of > 700 registered users are from US.
• Europe is a major user with Asia, Canada represen@ng other major users.
• About 20% of data in Tranche is encrypted and wai@ng for manuscript publica@on-‐ has remained fairly constant.
Tranche Features
• Registra@on required for data upload (provenance) but not download
• Two status levels for data – Encrypted (PI has password) – Public (for public dissemina@on)
• Hash code is used for cita@on • Data license embedded in file with provenance info • At @me of publishing, an HTML data page is generated and uniquely indexed by Google
• Public data mined by TheGPMdb, Pep@deAtlas, etc. • Can link data sets to your publica@ons
Proteomecommons.org Project Manager
23
• Collabora:ve resource for management of proteomics projects • Users track and manage all aspects of their projects.
• Data sets (upload, make public, track downloads, annotate, delete, hide, version)
• Publica@ons (add, link to data sets, track references) • Groups, Tools, Links, News
• Contains an extensive annotation management tool • Collaborative annotation • Management and tracking of annotations • Dynamic data model • Populates web page with data links and annotations when published
Proteome Commons
ProteomeCommons Annota:on Interface
• Provides annota@ons linked to datasets • Annota@ons are divided into func@onal categories
• Responsibili@es for annota@on categories can be assigned to domain experts on project
• % comple@on calculated for each category • Supports annota@on standards (MIAPE)
24 Proteome Commons
Current Status of Tranche
• In “maintenance” mode. • Servers are being consolidated and failing servers shut down.
• System will be transferred to NIH/NCRR Center for Computa@onal Mass Spectrometry, UCSD (Nuno Bandeira and Pavel Pevzner)
Acknowledgements • Proteomecommons/Tranche Development Team:
– James (Augie) Hill – Bryan Smith – Mark Gjukich – Jayson Falkner (emeritus)
• caBIG mentors – Dong Fu – Baris Suzek
• caBIG Staff – Brian Davis – Michael Keller – Natasha Sefcovic
26
Funding NCRR (P41 RR018627) NCI CPTC project State of Michigan (CTA)
Proteome Commons
ProteomeXchange Eric Deutsch Ron Beavis Lennart Martens Henning Hermjakob Doug SloYa Akhilesh Pandey
Some Views on Data Sharing
• It is cri@cally important to protect IP. – A raw data set is not necessarily IP. – The correct interpreta@on of a data set may represent valuable IP.
• Data sets need to be available for reuse and to validate interpreta@on.
• Making a dataset publicly available is equivalent to publishing (more or less).
• Data set release needs to be more highly valued.
Proteome Commons