bioinformatics tools for microarray analysis connie wu dr. jim breaux dr. sandeep gulati vialogy...

26
Bioinformatics Tools for Microarray Analysis Connie Wu Dr. Jim Breaux Dr. Sandeep Gulati ViaLogy Southern California Bioinformatics Institute Summer 2004 Funded by the National Science Foundation and National Institutes of Health

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Bioinformatics Tools for Microarray Analysis

Connie Wu

Dr. Jim Breaux

Dr. Sandeep GulatiViaLogy

Southern California Bioinformatics Institute

Summer 2004

Funded by the National Science Foundation and National Institutes of Health

Company Overview• Discovered and developed software Discovered and developed software

implementation of Active Signal Processing implementation of Active Signal Processing (called Quantum Resonance Interferometry)(called Quantum Resonance Interferometry)

• Applying QRI to analysis of DNA Applying QRI to analysis of DNA microarrays enhances performance:microarrays enhances performance:

• Increased detection sensitivity and Increased detection sensitivity and dynamic rangedynamic range• Increased specificityIncreased specificity• Increased reproducibilityIncreased reproducibility

Company Overview• VMAxS: web-based service for analyzing VMAxS: web-based service for analyzing

microarrays using QRI.microarrays using QRI.

VMAxS

Microarray image

Signal Values

Cel Report

Active Signal Processing

Further Analysis

in R

Cel Report File Reader

Project 1: Development of a more efficient file reader

• VMAxS generates Cel Report with gene and feature-level signal for a single microarray.• ~22000 genes• ≤ 69 features per gene • ≤ 7 statistical values for each gene and

feature

• Cel Report

Project 1: Development of a more efficient file reader

• Read through the entire file in the shortest amount of time

• Store the data in R data structure for further analysis

• Extract the statistic of interest with all labels attached (i.e. gene names, gene feature names, etc.)

• Goals:

R version Cel Report reader: average speed for one execution is over 30 sec.

Feature-level results: The Cel Report

Header

First gene

Rest of thefile

Cel Report Example

Filename Probeset ID

Array_1/1007_s_at

Cel Report Example

Values per geneFeatures per gene

Gene Results

Things to consider…

• Reading a file when no header information is disclosed

• Reading a file as efficient as possible =“open, read, close” in one step

• Use more efficient language: C• Interface C with R• Transferring C data structure to R data

structure

C Data Structure

1. Gene Feature ID2. Gene Feature

3. Gene ID4. Number of features

per Gene

5. Gene Results

R Data Structure

1. Feature Data

2. Number of Features

3. Gene Results

Output

Feature Data

Number of Features

Gene Result

Corresponding Values from the Cel Report

Feature DataNumber of

Features Gene Result

Advantages…

All vectors in C are dynamically allocated.

Both time and memory efficient:1. File is only read once

2. Only appropriate amount of memory is allocated for each data set

Runtime Comparison

16 Cel Reports, each with ~22000 genes

R Version C Version

9 min. 25 sec 28 sec

42 Cel Reports, each with ~22000 genes

R Version C Version

37min 57sec 1min 12sec

Project 2: Development of an automated comparative performance report

• Compare performance of ViaLogy’s analytical process to that of current standard approach (e.g., GCOS from Affymetrix)

• Write R script to automatically generate the following plots for performance report:

1. Sensitivity Bar Plots

2. CV Plots

3. ECDF Plots

Sensitivity Bar Plots

• Compares the Sensitivity of VMAxS to GCOS

1. Genes called Present in GCOS

2. Genes called Present in VMAxS

3. Genes called Present in GCOS, Absent in VMAxS

4. Genes called Present in VMAxS, Absent in GCOS

CV Plots• Purpose: Compare reproducibility

• Displays scatter plots of CV values for each gene.

• CVi = std.dev / mean for replicate signal values for gene i

• For each group of replicates, plot CVi,GCOS vs. CVi,VMAxS

ECDF Plot

• Displays empirical cumulative distribution function (ECDF) of the CV values for each analytical method

Subgroup Analysis

• For a given set of replicates, break down the data into smaller groups and compare the reproducibility in smaller sets of data

• One way to break down: consider PRESENT/ABSENT calls• Divide the genes into groups based on the number of

PRESENT calls received for each analytical method, e.g.:• 6 P in VMAxS, 0 P in GCOS• 6 P in VMAxS, 1 P in GCOS• 6 P in VMAxS, 2 P in GCOS• …• 0 P in VMAxS, 6 P in GCOS• Total of 49 (7x7) groups for 6 replicates.

PCount Table

Displays the total number of genes in each group

CV Plots

ECDF Plot

Acknowledgement

• Dr. Jim Breaux• Dr. Sandeep Gulati• The rest of Vialogy staff• Professors and Staff members of

SoCalBSI• Fellow Interns, especially Lien Chung• NSF & NIH