2014 genome informatics linked data

Metadata-driven tools to access and reproduce ENCODE data and pipelines

Venkat S Malladi1, Esther T Chan1, Ben C Hitz1, Eurie L Hong1, J Seth Strattan1, Timothy R. Dreszer2, Laurence D Rowe1, Cricket A Sloan1, Nikhil R Podduturi1, Morgan Maddren2, Stuart Miyasato1, Matt Simison1, W James Kent2, J Michael Cherry1

1Stanford University School of Medicine, Department of Genetics, Stanford, CA; 2University of California at Santa Cruz, Center for Biomolecular Science and Engineering, Santa Cruz, CA

The Encyclopedia of DNA Elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human and mouse genomes. Now in its 9th year, ENCODE has grown to include more than 40 experimental techniques to survey DNA-binding proteins, RNA-binding proteins, the transcriptional landscape, and chromatin structure in 400+ cell lines and tissues. All experimental data and computational analyses of these data are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. As the volume of data increases, the accessibility and reproducibility of data become key challenges. Here, we present our implementation of a scalable metadata driven analysis tracking system and representational state transfer application programming interface (REST API) to access ENCODE data files and metadata. The metadata are stored in structured data model and annotated to ensure data provenance. In addition, the breadth of metadata that describes the pipelines, software and analysis steps support reproducibility of ENCODE analysis standards and promote easy implementation of analysis to other data. Along with currently enabling sharing and coordination of ongoing production within the ENCODE consortium, we believe the REST API and metadata can also be used by the larger genomics community to facilitate further analysis of ENCODE data as well as integration of their own data with this and other collaborative projects that adopt these increasingly utilized standards. Data from the ENCODE project can be accessed via the ENCODE portal (http://www.encodeproject.org) and documentation for the REST API can be accessed at : h"ps://www.encodeproject.org/help/rest-‐api. .

@ENCODE-‐DCC

encode-‐[email protected]

ENCODE DCC h"ps://www.encodeproject.org

Pipeline Metadata

QA Metrics

Metadata returned in JSON format

Sample code h"ps://github.com/ENCODE-‐DCC/submission_sample_scripts

Query string is a URL

ENCODE REST API DocumentaCon h"ps://www.encodeproject.org/help/rest-‐api

REST API

File Download

Next Steps

I. Principles driving metadata definiCon

•  Track how an analysis was done •  Communicate key assumpMons and purpose •  Provide easily accessible quality metrics and analysis standards

Transparency Reproducibility Provenance

•  Can we recapitulate the analysis of all ENCODE data X years from now?

•  Can someone rerun the same analysis and get the same results?

•  What files were generated using soTware X?

•  What files (e.g fastqs, assembly) were used to generate file Y?

II. Capturing Metadata in Objects

• Format (e.g fastq, bam) • md5sum • Assembly • File size

FILE

III. RelaConships between metadata objects reflect underlying experimental processes

FILES

• Name • Source URL • Bug Tracker • References • Version

SOFTWARE

• SoTware • Input files • Output files • QA Metrics

Analysis Steps

Steps

Rich metadata are captured into objects detailing important pipeline variables that reflect the discrete steps of the process. Object a"ribute values are expressed using ontologies, controlled vocabularies and restricted formats wherever possible to promote consistency and interoperability.

Primary Data (fastq)

File 1 Step 2

FILES

Pipeline 1

Step 1 File 2 Step 3 File 3

Software 1 Software 2 Software 3 Software 4

File 4

QA Metrics File 1

Step 2

Pipeline 2

Step 1

File 2

Step 3 File 3

Software 1

Software 2 Software 3

Step 4

Software 4

•  Use GPU to accelerate pipelines (see poster # )

•  Publically and open source available ENCODE pipelines through github

(h"ps://github.com/ENCODE-‐DCC)

Data Primary Data (fastq)

Mapped Reads (bam)

QA Metrics Uniform Peak Calls

2014 genome informatics linked data

Data & Analytics

reproducibility of data

experimental data

reproduceencode data

data files

data provenance

volume of data increases

structured data model

analysis standards