2014 genome informatics linked data
DESCRIPTION
Poster at 2014 Genome Informatics describing linked data of file provenance.TRANSCRIPT
Metadata-driven tools to access and reproduce ENCODE data and pipelines
Venkat S Malladi1, Esther T Chan1, Ben C Hitz1, Eurie L Hong1, J Seth Strattan1, Timothy R. Dreszer2, Laurence D Rowe1, Cricket A Sloan1, Nikhil R Podduturi1, Morgan Maddren2, Stuart Miyasato1, Matt Simison1, W James Kent2, J Michael Cherry1
1Stanford University School of Medicine, Department of Genetics, Stanford, CA; 2University of California at Santa Cruz, Center for Biomolecular Science and Engineering, Santa Cruz, CA
The Encyclopedia of DNA Elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Now in its 9th year, ENCODE has grown to include more than 40 experimental techniques to survey DNA-binding proteins, RNA-binding proteins, the transcriptional landscape, and chromatin structure in 400+ cell lines and tissues. All experimental data and computational analyses of these data are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the accessibility and reproducibility of data become key challenges. Here, we present our implementation of a scalable metadata driven analysis tracking system and representational state transfer application programming interface (REST API) to access ENCODE data files and metadata. The metadata are stored in structured data model and annotated to ensure data provenance. In addition, the breadth of metadata that describes the pipelines, software and analysis steps support reproducibility of ENCODE analysis standards and promote easy implementation of analysis to other data. Along with currently enabling sharing and coordination of ongoing production within the ENCODE consortium, we believe the REST API and metadata can also be used by the larger genomics community to facilitate further analysis of ENCODE data as well as integration of their own data with this and other collaborative projects that adopt these increasingly utilized standards. Data from the ENCODE project can be accessed via the ENCODE portal (http://www.encodeproject.org) and documentation for the REST API can be accessed at : h"ps://www.encodeproject.org/help/rest-‐api. .
@ENCODE-‐DCC
encode-‐[email protected]
ENCODE DCC h"ps://www.encodeproject.org
Pipeline Metadata
QA Metrics
Metadata returned in JSON format
Sample code h"ps://github.com/ENCODE-‐DCC/submission_sample_scripts
Query string is a URL
ENCODE REST API DocumentaCon h"ps://www.encodeproject.org/help/rest-‐api
REST API
File Download
Next Steps
I. Principles driving metadata definiCon
• Track how an analysis was done • Communicate key assumpMons and purpose • Provide easily accessible quality metrics and analysis standards
Transparency Reproducibility Provenance
• Can we recapitulate the analysis of all ENCODE data X years from now?
• Can someone rerun the same analysis and get the same results?
• What files were generated using soTware X?
• What files (e.g fastqs, assembly) were used to generate file Y?
II. Capturing Metadata in Objects
• Format (e.g fastq, bam) • md5sum • Assembly • File size
FILE
III. RelaConships between metadata objects reflect underlying experimental processes
FILES
• Name • Source URL • Bug Tracker • References • Version
SOFTWARE
• SoTware • Input files • Output files • QA Metrics
Analysis Steps
Steps
Rich metadata are captured into objects detailing important pipeline variables that reflect the discrete steps of the process. Object a"ribute values are expressed using ontologies, controlled vocabularies and restricted formats wherever possible to promote consistency and interoperability.
Primary Data (fastq)
File 1 Step 2
FILES
Pipeline 1
Step 1 File 2 Step 3 File 3
Software 1 Software 2 Software 3 Software 4
File 4
QA Metrics File 1
Step 2
Pipeline 2
Step 1
File 2
Step 3 File 3
Software 1
Software 2 Software 3
Step 4
Software 4
• Use GPU to accelerate pipelines (see poster # )
• Publically and open source available ENCODE pipelines through github
(h"ps://github.com/ENCODE-‐DCC)
Data Primary Data (fastq)
Mapped Reads (bam)
QA Metrics Uniform Peak Calls