the tranche data repository:

27
The Tranche data repository: Progress made and lessons learned from 24.4 TB of data, 1,612 users and 12,458 deposi@ons. Philip Andrews University of Michigan Center for Computa@onal Medicine and Bioinforma@cs Proteome Commons Session sponsored by: Statistical Proteomics Initiative (SPI) of HUPO

Upload: nguyencong

Post on 14-Feb-2017

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Tranche data repository:

The  Tranche  data  repository:  Progress  made  and  lessons  learned  from  24.4  TB  of  data,  1,612  users  and  12,458  deposi@ons.    

Philip  Andrews  University  of  Michigan  

Center  for  Computa@onal  Medicine  and  Bioinforma@cs  

Proteome Commons Session sponsored by: Statistical Proteomics Initiative (SPI) of HUPO

Page 2: The Tranche data repository:

Empty  archives-­‐  the  problem  Most  researchers  agree  that  open  access  to  data  is  the  scien@fic  ideal,  so  what  is  

stopping  it  happening?    

Nature Vol 461|10 September 2009 Proteome Commons

Page 3: The Tranche data repository:

“Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost. (Bryan Heidorn, 2008)

Proteome Commons

Data not in the public domain has less value

Page 4: The Tranche data repository:

If  data  is  in  the  public  domain  can/will  it  actually  be  u:lized?  

 

“[The  ins@tu@onal  data  repository]  is  like  a  roach  motel.  Data  goes  in,  but  it  doesn’t  come  out.”    

(Dorothea  Salo,  2008)      

Proteome Commons

Page 5: The Tranche data repository:

Why  is  data  sharing  an  important  issue?  

•  Volume    of  high  value    data  is  increasing.  •  Crucial  for  peer  review  and  cri@cal  analysis.  •  Economics  are  driving  demand  for  data  reuse.  •  Cultural  shiW:  Public  access  to  data  collected  with  public    funds.  

•  Storage  and  transport  geXng  beYer  and  cheaper.  •  Improved  algorithms  require  access  to  data.  •  Raises  research  quality  and  lowers  risk  of  unethical  behavior.  

Proteome Commons

Page 6: The Tranche data repository:

Omics  data  sets  all  have  different  proper:es  (and  different  solu:ons  are  being  pursued)  

•  Microarrays  (RNAseq)  •  ChIPseq  •  Metabolomics  •  Proteomics  •  Glycomics  

Proteome Commons

Page 7: The Tranche data repository:

The  Proteomics  data  challenge  

•   Proteomics  projects  generate  large  amounts  of  data  at  high  cost.    – Data  sets  can  be  quite  large.  – File  formats  proprietary.  – Reuse  of  data  sets  has  been  limited.  

•     Proteomics  technologies  evolve  rapidly.  – Data  types  and  rela@onships  change  quickly.    

•     Annota@on  has  been  variable.    •     Documenta@on  for  peer  review.      

– The  Paris  Guidelines  were  developed  by  major  proteomics  journals  and  domain  experts  to  address  data  documenta@on  concerns.  

Proteome Commons

Page 8: The Tranche data repository:

Technical  challenges  in  proteome  informa:cs  impact  data  sharing  

•  Proteome  technologies  evolve  rapidly.  – SoWware  always  lags  behind  hardware.  – SoWware  always  lags  behind  applica@ons.    

•  Database  structures  are  inadequate  –  missing  data,  data  quality,  complex  interac@ons,  changing  interac@ons,  new  data  types,  pedigree,  etc.  

Proteome Commons

Page 9: The Tranche data repository:

What  does  Open  Access  mean  for  data?    

(by  imperfect  analogy  to  open  access  publica@ons)  

•  Data  sets  are  online  •  Free  of  restric@ons  (copyright)  •  Free  of  cost  (to  user)  •  No  technical  barriers  to  use  

Proteome Commons

Page 10: The Tranche data repository:

Open  data  requirements  •  NIH:  Policy  for  Sharing  of  Data  Obtained  in  NIH  Supported  or  Conducted  Genome-­‐Wide  Associa@on  Studies  (GWAS)  

•  Final  NIH  Statement  on  Sharing  Research  Data  Release  Date:  February  26,  2003,  NOT-­‐OD-­‐03-­‐032    

•  Paris  guidelines  for  publica@on  •  Amsterdam  accord  for  data  sharing  in  proteomics  

•  MCP  requirements  for  publica@on  Proteome Commons

Page 11: The Tranche data repository:

What  was  life  like  when  we  started  the  Tranche  repository?  

•  Real  concerns  about  ability  of  proteomics  to  deliver  quality  data  (resulted  in  Paris  guidelines  for  publica@on).  

•  Recogni@on  that  reproducibility  across  laboratories  for  iden@cal  experiments  required  access  to  raw  data.  

•  Expecta@on  that  few  inves@gators  would  share  raw  data.  

•  No  turnkey  way  to  share  data  existed.  •  Large  number  of  proprietary  data  formats.  

Page 12: The Tranche data repository:

Selected  data  sharing  issues  •  How  do  we  encourage  researchers  to  share  data?  

•  How  should  cita@ons,  authorship,  intellectual  property,  etc.  be  handled?  

•  What  are  the  legal  constraints?  •  Who  should  pay  for  infrastructure  and  maintenance?  

•  Is  access  to  the  data  sets  alone  sufficient?  

Proteome Commons

Page 13: The Tranche data repository:

Why  do  inves:gators  not  want  to  share  their  data?  

•  Compe@@ve  edge  (funding  process)  •  All  the  normal  reasons  primates  don’t  share:  

– What  if  my  data  sets  not  as  good  as  everyone  else’s?  (data  insecurity)  

– What  if    I  missed  something  obvious?  (also  data    insecurity)  

–  It’s  a  valuable  dataset.  (data  hoarding)  –  It’s  mine!  (emo@onal  response,  greed)  –  Its  too  hard.  (laziness,  priori@es)  –  It  gives  me  no  advantage.  

Proteome Commons

Page 14: The Tranche data repository:

Why  do  inves:gators  want  to  share  their  data?  

•  All  the  reasons  primates  want  to  share:  –  Because  it’s  the  right  thing  to  do  (altruism)  –  Look  how  great  my  data  sets  are  (boas@ng)  – More  eyes  are  beYer  and  I  will  get  value  back  (enlightened  self  interest)  

–  Improved  tools  will  be  developed  to  analyze  data.  –  Larger  datasets  become  available  that  they  can  use  (if  you  share,  then  others  are  more  likely  to  share).  

–  New  perspec@ves  gained  from  colleagues.  –  Reputa@on  enhanced  as  good  ci@zen  and  colleague,  etc.  

Proteome Commons

Page 15: The Tranche data repository:

What  can  be  done  to  promote  data  sharing?  

•  Provide  beYer  infrastructure  (lower  the  energy  barrier).  •  Provide  posi@ve  incen@ves.  •  Value  data  sets  release  more  highly.  

–  Role  for  funding  agencies  and  journals.  •  Allow  ini@al  author  of  data  to  retain  some  control  over  

when,  where,  how.  •  Address  and  possibly  modify  scien@fic  mores  on  cita@on  

and  authorship.  •  Modify  community  views  on  interpreta@ons  and  errors.  

Proteome Commons

Page 16: The Tranche data repository:

Should  data  sets  be  copyrighted?  

•  What  are  the  limits  on  primary  use?  Secondary  use?  

•  Should  we  (scien@sts)  retain  moral  rights?  •  What  are  the  consequences  if  all  rights  are  waived?  

•  What  happens  to  intellectual  property  rights?  

Proteome Commons

Page 17: The Tranche data repository:

The  Tranche  Data  Repository  

•  Distributed  data  repository  (not  a  database)  •  Allows  secure  upload  and  dissemina@on  of  files  •  Two-­‐stage  release  (encrypted,  public)  •  Links  well  to  other  resources  •  Data  provenance  and  fidelity  are  inherent  to  system  

•  Key  clients  were  researchers  and  publishers  •  Closely  integrated  with  Proteomecommons.org  

http://www.trancheproject.org Proteome Commons

Page 18: The Tranche data repository:

Proteomecommons.org  

•  Online  project  management  based  on  a  social  network  model  

•  Allows  annota@on  standards  to  be  applied  •  Provides  linkage  of  data  sets  to  publica@ons  •  Primarily  used  for  collabora@ve  projects  

http://www.proteomecommons.org/

Page 19: The Tranche data repository:

Tranche  is  a  cloud  storage  system…..  

Proteome Commons

…..linked to the Proteomecommons.org resource

Page 20: The Tranche data repository:

Tranche  system  current  stats    }  Tranche  cita@on  codes  have  appeared  in  a  large  number  of  journals  

}  1,612  registered  users  

}  12,458  deposi@ons  represen@ng  approximately  a  billion  mass  spectra  totaling  (24.4  TB)  

} Average  data  chunk  has  over  three  replica@ons  } CuXng-­‐edge  data  sets  are  available  for  public  access  as  soon  as  manuscripts  are  accepted  

20  Proteome Commons

Page 21: The Tranche data repository:

Tranche  Stats  (as  of  2010)  

•  Less  than  40  of  the  9,638  data  sets  in  Tranche  invoke  a  restric@ve  data  license!  

•  245  out  of  >  700  registered  users  are  from  US.  

•  Europe  is  a  major  user  with  Asia,  Canada  represen@ng  other  major  users.  

•  About  20%  of  data  in  Tranche  is  encrypted  and  wai@ng  for  manuscript  publica@on-­‐  has  remained  fairly  constant.  

Page 22: The Tranche data repository:

Tranche  Features  

•  Registra@on  required  for  data  upload  (provenance)  but  not  download  

•  Two  status  levels  for  data    –  Encrypted  (PI  has  password)  –  Public  (for  public  dissemina@on)  

•  Hash  code  is  used  for  cita@on  •  Data  license  embedded  in  file  with  provenance  info  •  At  @me  of  publishing,  an  HTML  data  page  is  generated  and  uniquely  indexed  by  Google  

•  Public  data  mined  by  TheGPMdb,  Pep@deAtlas,  etc.  •  Can  link  data  sets  to  your  publica@ons  

Page 23: The Tranche data repository:

Proteomecommons.org    Project  Manager  

23

•  Collabora:ve  resource  for  management  of  proteomics  projects  •  Users  track  and  manage  all  aspects  of  their  projects.  

•  Data  sets  (upload,  make  public,  track  downloads,  annotate,  delete,  hide,  version)‏  

•  Publica@ons  (add,  link  to  data  sets,  track  references)‏  •  Groups, Tools, Links, News  

•  Contains an extensive annotation management tool •  Collaborative annotation •  Management and tracking of annotations •  Dynamic data model •  Populates web page with data links and annotations when published

Proteome Commons

Page 24: The Tranche data repository:

ProteomeCommons    Annota:on  Interface  

•  Provides  annota@ons  linked  to  datasets  •  Annota@ons  are  divided  into  func@onal  categories  

•  Responsibili@es  for  annota@on  categories  can  be  assigned  to  domain  experts  on  project  

•  %  comple@on  calculated  for  each  category  •  Supports  annota@on  standards  (MIAPE)  

24 Proteome Commons

Page 25: The Tranche data repository:

Current  Status  of  Tranche  

•  In  “maintenance”  mode.  •  Servers  are  being  consolidated  and  failing  servers  shut  down.  

•  System  will  be  transferred  to  NIH/NCRR  Center  for  Computa@onal  Mass  Spectrometry,  UCSD  (Nuno  Bandeira  and  Pavel  Pevzner)  

Page 26: The Tranche data repository:

Acknowledgements  •  Proteomecommons/Tranche    Development  Team:  

–  James  (Augie)  Hill  –  Bryan  Smith  –  Mark  Gjukich  –  Jayson  Falkner  (emeritus)  

•  caBIG  mentors  –  Dong  Fu  –  Baris  Suzek  

•  caBIG  Staff  –  Brian  Davis  –  Michael  Keller  –  Natasha  Sefcovic  

26

Funding NCRR (P41 RR018627) NCI CPTC project State of Michigan (CTA)

Proteome Commons

ProteomeXchange    Eric  Deutsch  Ron  Beavis  Lennart  Martens  Henning  Hermjakob  Doug  SloYa  Akhilesh  Pandey  

Page 27: The Tranche data repository:

Some  Views  on  Data  Sharing  

•  It  is  cri@cally  important  to  protect  IP.  – A  raw  data  set  is  not  necessarily  IP.  – The  correct  interpreta@on  of  a  data  set  may  represent  valuable  IP.  

•  Data  sets  need  to  be  available  for  reuse  and  to  validate  interpreta@on.  

•  Making  a  dataset  publicly  available  is    equivalent  to  publishing  (more  or  less).  

•  Data  set  release  needs  to  be  more  highly  valued.  

Proteome Commons