automation of (biological) data analysis and report generation

14
Automation of Biological Data Analysis and Report Generation Dmitry Grapov, PhD

Upload: dmitry-grapov

Post on 10-May-2015

4.771 views

Category:

Education


2 download

DESCRIPTION

I've been experimenting with automating simple and complex data analysis and report generation tasks for biological data and mostly using R and LATEX. You can see some of my progress and challenges encountered.

TRANSCRIPT

Page 1: Automation of (Biological) Data Analysis and Report Generation

Automation of Biological Data Analysis and Report Generation

Dmitry Grapov, PhD

Page 2: Automation of (Biological) Data Analysis and Report Generation

Bots write the darndest things

http://www.latimes.com/local/lanow/earthquake-27-quake-strikes-near-westwood-california-rdivor,0,3229825.story#axzz2wQwc82EK

• fill in the template (easy)

• human-guided automation (e.g. Metaboanalyst, intermediate)

• intelligent/reactive writing (e.g. ~AI, advanced)

http://narrativescience.com/

Page 3: Automation of (Biological) Data Analysis and Report Generation

Humans + Bots

Interaction:

• Bots and humans combine in guided analyses

• Humans: make choices (based on bot guides)

• Bots: automate!

Facilitate:

• workflow logging and template creation

• reproducible results

Bot: Initial data and meta data parsing and quality validation

(need: template input)

Human: data cleaning and experimental design identification

(use: multiple choice, dynamic GUI)

Bot: instantiation of complex workflows

Human: overview of bot assumptions and results

Bot: Numerical and text output generation

Page 4: Automation of (Biological) Data Analysis and Report Generation

Humans + Bots write darndender things?

Choose Your Own Life Adventure!

?

https://github.com/

dgrapov/AdventureR

Page 5: Automation of (Biological) Data Analysis and Report Generation

Data Analysis Tasks

Visualization (how does it look?)

• histograms, density plots, box plots, line plots, scatter plots, networks, etc.

Statistical Analysis (what is statistically significant?)

• summary tables, ANOVA, FDR adjustment, power analysis, etc.

Exploration (what are the major patterns/trends?)

• clustering, PCA, ICA, etc.

Predictive Modeling (what explains my hypothesis?)

• mixed effects, partial least squares (O-/PLS/-DA), etc.

Network Analysis and Mapping (how are things related?)

• Functional analysis: pathway enrichment or overrepresentation

• Networks: biochemical, structural, mass spectral and empirical networks

• Mapping: projection of analysis results onto network

Page 6: Automation of (Biological) Data Analysis and Report Generation

WCMC Data Analysis Reports ™

Statistical analysisClusteringPCAO-PLS-DABiochemical enrichmentNetwork mapping

Input template: BinBase

• inference of experimental goals from sample meta data

• mapping variables to external databases

Tasks:

Report:

Tools:

Page 7: Automation of (Biological) Data Analysis and Report Generation

Automation Challenges

Data cleaning and quality validation

• use: quality control samples; identify: precision/accuracy, normalization, batch corrections; mitigate: outliers, missing values, batch effects, etc.

Identification of experimental goals

• use: meta data, identify: main and accessory effects; choose: statistics, multivariate tests and visualizations

Integration of multiple tasks to evolve robust analyses • tasks: statistics, multivariate, functional, networks,

database mapping, etc

Data analysis report generation

• use: R, Latex, markdown

?

Page 8: Automation of (Biological) Data Analysis and Report Generation

Challenges to automated metabolite ID mapping

Stereochemistry?

Search: catechin

Best Match: Catechin

Biologically relevant:

D-catechin

Synonyms?

Search: UDP GlcNAc

FAIL: UDP GlcNac

PASS: UDP-GlcNac

Page 9: Automation of (Biological) Data Analysis and Report Generation

Strategies for automated metabolite ID mapping (from synonym)

#1: CTS+ #2: Web query #3: Curated DB

• Use CTS to translate from synonyms to KEGG (KID) and PubChem (CID)

• Use KEGGREST and PUG to filter and choose most appropriate IDs

• Use fuzzy matching and word similarity metrics (e.g. Damerau–Levenshtein distance)

• Use KEGGREST + PubChem PUG to translate synonyms to IDs

• For KEGG ID:

synonym SID KID

• Generate a curated DB for KEGG and CID translations +

• Include InChI Keys

• Map to other DBs

• Allow fuzzy matching on synonyms

• e.g. IDEOM http://bioinformatics.oxfordjournals.org/content/early/2012/02/04/bioinformatics.bts069

Page 10: Automation of (Biological) Data Analysis and Report Generation

Interactive Analysis and Report Generation

knitr (http://yihui.name/knitr/)

Analysis Report Generation

• Analysis on rails or open sandbox

• Humans facilitate robust results generation + Bots ensure reproduction

• Generation of Methods and Results should be automateable

Page 11: Automation of (Biological) Data Analysis and Report Generation

Devium 2.0Human-guided automated data analysis and report generator

Human-guided automation could help ensure robust results by making choices which are otherwise difficult to automate.

https://github.com/dgrapov/DeviumWeb

Page 12: Automation of (Biological) Data Analysis and Report Generation

MetaMapRLinking data analysis and

biologyhttps://github.com/dgrapov/MetaMapR

Integration of complex work flows is key to automation.

Page 13: Automation of (Biological) Data Analysis and Report Generation

+ Workflows for complex experiments (e.g. time-course)

+ Biochemical functional analysis (pathway enrichment)

+ GUI for report generation (Devium 2.0)

+ Integrate multi-’Omic’ data sets (MetaMapR 2.0)

+ Scientific literature mining (RapportR)

+ Interactive plots and networks (JavaScript)

Future Goals

Page 14: Automation of (Biological) Data Analysis and Report Generation

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154