2014 genome informatics linked data

1
Metadata-driven tools to access and reproduce ENCODE data and pipelines Venkat S Malladi 1 , Esther T Chan 1 , Ben C Hitz 1 , Eurie L Hong 1 , J Seth Strattan 1 , Timothy R. Dreszer 2 , Laurence D Rowe 1 , Cricket A Sloan 1 , Nikhil R Podduturi 1 , Morgan Maddren 2 , Stuart Miyasato 1 , Matt Simison 1 , W James Kent 2 , J Michael Cherry 1 1 Stanford University School of Medicine, Department of Genetics, Stanford, CA; 2 University of California at Santa Cruz, Center for Biomolecular Science and Engineering, Santa Cruz, CA The Encyclopedia of DNA Elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Now in its 9th year, ENCODE has grown to include more than 40 experimental techniques to survey DNA-binding proteins, RNA-binding proteins, the transcriptional landscape, and chromatin structure in 400+ cell lines and tissues. All experimental data and computational analyses of these data are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the accessibility and reproducibility of data become key challenges. Here, we present our implementation of a scalable metadata driven analysis tracking system and representational state transfer application programming interface (REST API) to access ENCODE data files and metadata. The metadata are stored in structured data model and annotated to ensure data provenance. In addition, the breadth of metadata that describes the pipelines, software and analysis steps support reproducibility of ENCODE analysis standards and promote easy implementation of analysis to other data. Along with currently enabling sharing and coordination of ongoing production within the ENCODE consortium, we believe the REST API and metadata can also be used by the larger genomics community to facilitate further analysis of ENCODE data as well as integration of their own data with this and other collaborative projects that adopt these increasingly utilized standards. Data from the ENCODE project can be accessed via the ENCODE portal (h ttp://www.encodeproject.org ) and documentation for the REST API can be accessed at : h"ps://www.encodeproject.org/help/restapi . . @ENCODEDCC encode [email protected] ENCODE DCC h"ps:// www.encodeproject.org Pipeline Metadata QA Metrics Metadata returned in JSON format Sample code h"ps://github.com/ENCODEDCC/ submission_sample_scripts Query string is a URL ENCODE REST API DocumentaCon h"ps://www.encodeproject.org/help/ restapi REST API File Download Next Steps I. Principles driving metadata definiCon • Track how an analysis was done • Communicate key assumpMons and purpose • Provide easily accessible quality metrics and analysis standards Transparency Reproducibility Provenance • Can we recapitulate the analysis of all ENCODE data X years from now? • Can someone rerun the same analysis and get the same results? • What files were generated using soTware X? • What files (e.g fastqs, assembly) were used to generate file Y? II. Capturing Metadata in Objects • Format (e.g fastq, bam) • md5sum • Assembly • File size FILE III. RelaConships between metadata objects reflect underlying experimental processes FILES • Name • Source URL • Bug Tracker • References • Version SOFTWARE • SoTware • Input files • Output files • QA Metrics Analysis Steps Steps Rich metadata are captured into objects detailing important pipeline variables that reflect the discrete steps of the process. Object a"ribute values are expressed using ontologies, controlled vocabularies and restricted formats wherever possible to promote consistency and interoperability. Primary Data (fastq) File 1 Step 2 FILES Pipeline 1 Step 1 File 2 Step 3 File 3 Software 1 Software 2 Software 3 Software 4 File 4 QA Metrics File 1 Step 2 Pipeline 2 Step 1 File 2 Step 3 File 3 Software 1 Software 2 Software 3 Step 4 Software 4 • Use GPU to accelerate pipelines (see poster # ) • Publically and open source available ENCODE pipelines through github (h"ps://github.com/ENCODEDCC) Data Primary Data (fastq) Mapped Reads (bam) QA Metrics Uniform Peak Calls

Upload: encode-dcc

Post on 07-Jul-2015

93 views

Category:

Data & Analytics


0 download

DESCRIPTION

Poster at 2014 Genome Informatics describing linked data of file provenance.

TRANSCRIPT

Page 1: 2014 genome informatics Linked Data

Metadata-driven tools to access and reproduce ENCODE data and pipelines

Venkat S Malladi1, Esther T Chan1, Ben C Hitz1, Eurie L Hong1, J Seth Strattan1, Timothy R. Dreszer2, Laurence D Rowe1, Cricket A Sloan1, Nikhil R Podduturi1, Morgan Maddren2, Stuart Miyasato1, Matt Simison1, W James Kent2, J Michael Cherry1

1Stanford University School of Medicine, Department of Genetics, Stanford, CA; 2University of California at Santa Cruz, Center for Biomolecular Science and Engineering, Santa Cruz, CA

The Encyclopedia of DNA Elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Now in its 9th year, ENCODE has grown to include more than 40 experimental techniques to survey DNA-binding proteins, RNA-binding proteins, the transcriptional landscape, and chromatin structure in 400+ cell lines and tissues. All experimental data and computational analyses of these data are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the accessibility and reproducibility of data become key challenges. Here, we present our implementation of a scalable metadata driven analysis tracking system and representational state transfer application programming interface (REST API) to access ENCODE data files and metadata. The metadata are stored in structured data model and annotated to ensure data provenance. In addition, the breadth of metadata that describes the pipelines, software and analysis steps support reproducibility of ENCODE analysis standards and promote easy implementation of analysis to other data. Along with currently enabling sharing and coordination of ongoing production within the ENCODE consortium, we believe the REST API and metadata can also be used by the larger genomics community to facilitate further analysis of ENCODE data as well as integration of their own data with this and other collaborative projects that adopt these increasingly utilized standards. Data from the ENCODE project can be accessed via the ENCODE portal (http://www.encodeproject.org) and documentation for the REST API can be accessed at : h"ps://www.encodeproject.org/help/rest-­‐api.    .

@ENCODE-­‐DCC  

encode-­‐[email protected]  

ENCODE  DCC  h"ps://www.encodeproject.org  

Pipeline  Metadata  

QA  Metrics  

Metadata  returned  in  JSON  format  

Sample  code  h"ps://github.com/ENCODE-­‐DCC/submission_sample_scripts  

   

         Query  string  is  a  URL  

ENCODE  REST  API  DocumentaCon  h"ps://www.encodeproject.org/help/rest-­‐api  

REST  API      

     File  Download  

Next  Steps  

I.  Principles  driving  metadata  definiCon  

•  Track  how  an  analysis  was  done    •  Communicate  key  assumpMons  and  purpose    •  Provide  easily  accessible  quality  metrics  and  analysis  standards  

Transparency   Reproducibility   Provenance  

•  Can  we  recapitulate  the  analysis  of  all  ENCODE  data  X  years  from  now?  

 •  Can  someone  rerun  the  same  analysis  and  get  the  same  results?  

•  What  files  were  generated  using  soTware  X?  

 •  What  files  (e.g  fastqs,  assembly)  were  used  to  generate  file  Y?  

II.  Capturing  Metadata  in  Objects  

• Format  (e.g  fastq,  bam)  • md5sum  • Assembly    • File  size    

         FILE  

III.  RelaConships  between  metadata  objects  reflect  underlying  experimental  processes  

FILES  

• Name  • Source  URL  • Bug  Tracker  • References  • Version  

SOFTWARE  

• SoTware  • Input  files  • Output  files  • QA  Metrics        

Analysis  Steps  

Steps

Rich  metadata  are  captured  into  objects  detailing  important  pipeline  variables  that  reflect  the  discrete  steps  of  the  process.  Object  a"ribute  values  are  expressed  using  ontologies,  controlled  vocabularies  and  restricted  formats  wherever  possible  to  promote  consistency  and  interoperability.  

Primary  Data  (fastq)  

File 1 Step 2

FILES  

Pipeline  1  

Step 1 File 2 Step 3 File 3

Software 1 Software 2 Software 3 Software 4

File 4

QA  Metrics  File 1

Step 2

Pipeline  2  

Step 1

File 2

Step 3 File 3

Software 1

Software 2 Software 3

Step 4

Software 4

•  Use  GPU  to  accelerate  pipelines  (see  poster  #          )  

•  Publically  and  open  source  available  ENCODE  pipelines  through  github  

 (h"ps://github.com/ENCODE-­‐DCC)      

Data  Primary  Data  (fastq)  

Mapped  Reads  (bam)  

QA  Metrics   Uniform  Peak  Calls