bioinformatics computation and visualization at the university louisville

26
BIOINFORMATICS COMPUTATION AND VISUALIZATION AT THE UNIVERSITY L OUISVILLE Eric C. Rouchka and Adel S. Elmaghraby Computer Engineering and Computer Science Department November 16, 2010

Upload: dell-enterprise

Post on 16-Jul-2015

989 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Bioinformatics Computation and Visualization at the University Louisville

BIOINFORMATICS COMPUTATION

AND VISUALIZATION AT THE

UNIVERSITY LOUISVILLE

Eric C. Rouchka and Adel S. ElmaghrabyComputer Engineering and Computer

Science Department

November 16, 2010

Page 2: Bioinformatics Computation and Visualization at the University Louisville

Abstract

Current high throughput molecular biology techniques are providing researchers with data growing at a rate equivalent and/or faster than Moore’s law. While the ability to store, manipulate and analyze this “Big Data’ requires intelligent utilization of HPC hardware and software resources. At the University of Louisville, we are specifically interested in understanding gene expressions for a variety of disease and disorder states through analysis of microarray data, next generation sequencing of transcriptomes and visualization of high resolution in-situ hybridization images of the central nervous system. We use a variety of approaches including GPU computing and a Dell Visualization Cluster to help achieve faster results.

HPC,GPU, Clusters

Page 3: Bioinformatics Computation and Visualization at the University Louisville

GENE EXPRESSION VISUALIZATION

Analyzing in-situ hybridization images of the central nervous system.

Page 4: Bioinformatics Computation and Visualization at the University Louisville

Statement of Problem

Multitude of high resolution biological image techniques available, including:

◦ Magnetic resonance

◦ Ultrasound

◦ Computed tomography

◦ X-ray

◦ Histological

Page 5: Bioinformatics Computation and Visualization at the University Louisville

University of Louisville Database

View genes involved in CNS◦ neurotransmitters

◦ neuroreceptors

Ages: ◦ E13.5 (Embryonic day 13.5)◦ P0 (Postnatal day 0 – newborn)

◦ P7 (Postnatal day 7)

Typical Size (TIFF Format)◦ 6000 x 6000 pixels

ranges from 3000 x 3000 to 30,000 x 30,000 30 MB to 800 MB per image

Page 6: Bioinformatics Computation and Visualization at the University Louisville

CNS Image Types

Whole Brain

Eye (Retina)

Spinal Cord

Page 7: Bioinformatics Computation and Visualization at the University Louisville

UofL In-Situ Hybridization Database

GOALS

◦ Tie in-situ database into gene expression (microarray; rtPCR) experiments

◦ Link to other existing information

◦ Localize and quantify signal

Page 8: Bioinformatics Computation and Visualization at the University Louisville

Purpose

Search Images of Interest◦ by gene name

◦ by developmental stage◦ by tissue type

Share Images◦ publicly

◦ private groups

Annotate Images Store Images

Page 9: Bioinformatics Computation and Visualization at the University Louisville
Page 10: Bioinformatics Computation and Visualization at the University Louisville

Partitioning for Web Viewing

Page 11: Bioinformatics Computation and Visualization at the University Louisville
Page 12: Bioinformatics Computation and Visualization at the University Louisville

Extending the image display on multiple tiles (15,000 x 4,800 available display pixels)

High Resolution in-situ hybridization of mouse retina

Utilizing the Dell Video Wall

Page 13: Bioinformatics Computation and Visualization at the University Louisville

Research Areas

Control of gene expression

Sources of Variability

DNA and ProteinSequence Analysis Other

FUNCTIONAL GENOMICS

•TSS classification•Transcription factor detection•Translational control

•Primer design•Gene structure prediction

•SNP analysis•Repeat analysis•Alternative splicing

•2nd level microarray analysis•Gene interactions•In-situ hybridization•Machine learning•DNA computing

Page 14: Bioinformatics Computation and Visualization at the University Louisville

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

Nu

mb

er

of

Bases i

n G

en

Ban

k (

log

2)

Pre

dic

ted

Nu

mb

er

of

Tra

nsis

tors

(lo

g2)

Year

Log Growth of GenBank

Moore’s Law vs. GenBank

286 Processor134,000 Transistors

Page 15: Bioinformatics Computation and Visualization at the University Louisville

Hard Drive Storage vs. NGS

Stein L. (2010) Genome Biology 2010. 11:207.

Page 16: Bioinformatics Computation and Visualization at the University Louisville

The LINE-1 Retrotransposon

Adapted from: Babushok DV, Kazazian HH, Jr. Progress in understanding the biology of the human mutagen LINE-1. Hum Mutat. 2007 Jun;28(6):527-39.

1K 2K 3K 4K 5K 6K

5’ UTR

Antisense Promoter

ORF 1 ORF 2 3’ UTR

•Long Interspersed Nuclear Element-1

•A repeat sequence found pervasively throughout the genome.

•Each copy may or may not be capable of transcription or retrotransposition.

Poly-A

Page 17: Bioinformatics Computation and Visualization at the University Louisville

What is the LINE-1 Life Cycle

Ribosomes

ORF 2: endonuclease,

reverse transcriptase

ORF1: Zipper domain

• Comprises ~21% of the genome

• As many as 100 copies are estimated to be active and capable of retrotransposition.

• Most copies of LINE-1 however, are truncated at the 3’ end, or otherwise not

intact, and are inactive.

Distance from 3’ end

Co

un

ts

0

100

200

300

400

500

600

700

0 1000 2000 3000 4000 5000 6000 7000

Number of Observations

Number of Observations

Page 18: Bioinformatics Computation and Visualization at the University Louisville

How does LINE-1 affect cellular function

Down regulation

Splice isoforms

Ectopic expression

1K 2K 3K 4K 5K 6K

5’ UTR

Antisense Promoter

ORF 1 ORF 2 3’ UTR

Poly-A

Page 19: Bioinformatics Computation and Visualization at the University Louisville

CTTGGCTCCTCCCC

GGGGAGGAGCCAAG

LINE1 5’ Signature

Reverse ComplementString search for exact match in fastq records

SRR000921.50547 …TGAGTAAATAATGGA*GGGGAGGAGCCAAGAT…

SRR003709.200687 …CTTGGCTCCTCCCCC*AAAAGGAATCATTTTAAA…

Identify and collect those sequences that contain the 5’ LINE1 signature element

and at least 25 nucleotides of additional sequence that flanks the LINE1 element

Align against reference human sequence with

BLAT

Identify and collect those sequences that whose flanking sequence maps uniquely to

the genome but alignment does not extend to cover LINE1 element. Isolate

flanking sequence and create blastable database with comprised of the flanks.

Flanking sequence LINE1 element

Flanking sequenceLINE1 element

>SRR000921.50547

…TGAGTAAATAATGGA

>SRR003709.200687

AAAAGGAATCATTTTAAA…

Convert all fastq to fasta and BLAST against

flanking sequence database

Page 20: Bioinformatics Computation and Visualization at the University Louisville

GPU AND CUDA

Page 21: Bioinformatics Computation and Visualization at the University Louisville

Typical Computational Operations

RNA Folding using Nussinov Algorithm based on Dynamic Programming which is of O(n3).

Clustering of gene data and textual information.

Page 22: Bioinformatics Computation and Visualization at the University Louisville

A binary matrix representation of a secondary structure of an RNA sequence.

Hamada M et al. Bioinformatics 2009;25:465-473

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email:

[email protected]

Page 23: Bioinformatics Computation and Visualization at the University Louisville

VaIG Lab

Use Dell Alienware with Nvidia GPUs for◦ Hierarchical clustering of DNA microarray data, 48 times

speedup over single core CPU, using Tesla C-870◦ Nussinov RNA folding, 290 times speedup, using Tesla C-

2050◦ Processing of PubMed abstracts, ongoing◦ SAT (propositional logic) as applied to haplotype

inference, ongoing◦ Semi-supervised support vector machine (S3VM), ongoing

Page 24: Bioinformatics Computation and Visualization at the University Louisville

Sample Speed up using GPU

Compute pairwise Manhattan distance and Pearson correlation coefficient of data points with GPUDar-Jen Chang, Ahmed H. Desoky, Ming Ouyang, Eric C. Rouchka,2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing

Page 25: Bioinformatics Computation and Visualization at the University Louisville

UNIVERSITY OF LOUISVILLE AND DELL

A positive experience

Page 26: Bioinformatics Computation and Visualization at the University Louisville

J.B. Speed School Industry Affiliates