cot 6930 hpc & bioinformatics microarray data analysis xingquan zhu dept. of computer science...

COT 6930HPC & Bioinformatics

Microarray Data Analysis

Xingquan Zhu

Dept. of Computer Science and Engineering

DNA RNA

cDNAESTsUniGene

phenotype

GenomicDNADatabases

Protein sequence databases

protein

Protein structure databases

transcription translation

Gene expressiondatabase

Outline

Gene Expression and Biological Network What, Why, and How

DNA Microarray Microarray Construction Comparative Hybridization Data Analysis

Public Databases

Gene Expression

Gene expression Genes are expressed when they are transcribed onto RNA Amount of mRNA indicates gene activity

No mRNA → gene is off mRNA present → gene is on & performing function

Biologically Some genes are always expressed in all tissues

Estimated 10,000 housekeeping / ubiquitous genes Other genes are selectively on

Depending on tissue, disease, and/or environment Change in environment → change in gene expression

So organism can respond

Biological Network Gene expression does not happen in isolation

Individual genes code for function Produce mRNA → protein performing function

Sets of genes can form pathways Gene products can turn on / off other genes

Sets of pathways can form networks When pathways interact

Biology is a study of networks Genes Proteins Etc…

Type of Biological Networks

Genetic network Interactions between genes, gene products

Gene regulation network Network of control decisions to turn genes on / off Subset of genetic network

Metabolic network Network of interactions between proteins Synthesize / break down molecules (enzymes, cofactors)

An example of Genetic Network

Gene Regulation Network

An example of Metabolic network

Examining Biological Networks – Benefits

Learn about gene function / regulation Tissue differentiation Response to environmental factors

Identify / treat diseases Discover genetic causes of disease Evaluate effect of drugs

Detect impact of DNA sequence variation (mutations) Detection of mutations (e.g., SNPs) Genetic typing

Examining Biological Networks – Approach

Measure protein / mRNA in cells In different tissues (e.g., brain vs. muscle)

Find gene / protein with tissue-specific function As environment changes

Find genes / proteins responsible for response In healthy & diseased tissues

Find proteins / genes responsible for disease (if any) Help identify diseases based on gene expression

In different individuals Detect DNA sequence variation

Examining Biological Networks

Direct approach Measure protein production / interaction in cell

2D electrophoresis Mass spectroscopy Protein microarray

Advantages Precise results on proteins

Disadvantages Low throughput (for now)

Examining Biological Networks

Indirect approach Measure mRNA production (gene expression) in cell

Random ESTs DNA microarray

Advantages High throughput Can test large variety of mRNA simultaneously

Disadvantages RNA level not always correlated with protein level / function Misses changes at protein level Results may thus be less precise

Outline



Public Databases

DNA Microarray

Question How to determine whether a gene is expressed, or how

to measure mRNA?

DNA Microarray

Hybridization to the Chip

The Chip is Scanned

Images

Video: http://www.youtube.com/watch?v=VNsThMNjKhM

Oligonucleotide (GeneChip) vs. Spotted Arrays

GeneChip Microarray A gene is a probe set A set of (11-16)

probes form a probe set

Probe length: 25 bp Can use small amount

of RNA Efficient hybridization

Spotted Microarray One probe per gene Probe length:

hundreds to 1k bp Less expensive

Probe set

PM

MM

Probe Pair PM

MM

MMProbe cell

1.28 cm1.

28 c

m

GeneChip: Chip->Probeset->Probe pair->Probe

25-mer unique oligo

mismatch in the middle nuclieotide

multiple probes (11~16) for each gene

from Affymetrix Inc.

GeneChip Array Design

Affymetrix GeneChip

DNA Microarray Design & Analysis

Microarray Microarray construction Array design

Choosing probe sequences Comparative Hybridization (data collection)

Measure relative amount of mRNA Image processing of scanned images

Spot detection, normalization, quantization Data Analysis

Statistical test, noise handling (low-level) Clustering, classification (high-level)

cDNA

Complementary DNA Sequences are the complements of the original mRNA

sequences Why don’t we simple capture mRNA

The environment is full of RNA-digesting enzymes Free RNA is quickly degraded To prevent the experimental samples from being lost, they

are reverse-transcribed back into more stable DNA form

DNA Microarray Construction Construction

Drops (spots) of cDNA fragments as probes Attach to glass slide / nylon array at known

locations Use mechanical pins & robotics

Use Label cDNA with fluorescent dyes (fluor) Measure contrast in intensity Use laser / CCD scanner

DNA Microarray: Automatic Detection

DNA Microarray Choice of probe

Include genes of interest Examine sequence databases

Avoid redundancy No duplicate probes

Avoid cross hybridization Genechip alleviates this

problem by using probe pairs PM MM

Can use software to help choose probes

Or simply buy pre-designed arrays Complete genomes of yeast,

Drosophila, C. elegans 33,000+ human genes from

GenBank RefSeq on 2 microarrays

Expensive but labor-saving


Microarray Microarray construction

Spotted cDNA arrays, in situ photolithography… Array design





Comparative Hybridization

Goal Measure relative amount of

mRNA expressed Algorithm

Choose cell populations mRNA extraction and reverse

transcription Fluorescent labeling of cDNA’s

(normalized) Hybridization to microarray Scan the hybridized array Interpret scanned image


Color determined by relative RNA concentrations Brightness determined by total concentration

DNA Microarray Methodology

Anatomy of a Comparative Gene Expression Study http://

www.cs.wustl.edu/~jbuhler/research/array/#diagram Flash Animation

http://www.bio.davidson.edu/courses/genomics/chip/chip.html

Streamlined Array Analysis

Normalize

normal tumor tumor normal normal tumorID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL

AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 PAFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 PAFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 PAFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 PAFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 PAFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 PAFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 PAFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 PAFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 PAFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 AAFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 AAFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 AAFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 AAFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 AAFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 AAFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 AAFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 AAFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 AAFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 AAFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 AAFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 AAFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 AAFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 AAFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 AAFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 PAFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 AAFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 PAFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 PAFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 PAFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 PAFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 PAFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 PAFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 PAFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 PAFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 PAFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 PAFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P

Raw data Filter

ClassificationSignificance Clustering

Gene lists

Function(Genome Ontology)

•Present/Absent•Minimum value•Fold change

•t-test•Machine learning

•Hierarchical CL •Biclustering

Microarray data

Gene 1

Gene 2

Gene N

Exp 1

E 1

Exp 2

E 2

Exp 3

E 3

Microarray data analysis

begin with a data matrix (gene expression values versus samples)

Typically, there are many genes (>> 10,000) and few samples (~ 10)

Low-Level Data Analysis

Normalization: when you have variability in measurements, you need

replication and statistics to find real differences Significance test:

It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates

Sources of Variability in Raw Data Biological variability Sample preparation

Probe labeling RNA extraction

Experimental condition temperature, time, mixing, etc.

Scanning laser and detector, chemistry of the flourescent label

Image analysis identifying and quantifying each spot on the array

Data Normalization

Can control for many of the experimental sources of variability (systematic, not random or gene specific)

Bring each image to the same average brightness Can use simple math or fancy:

divide by the mean (whole chip or by sectors) LOESS (locally weighted regression)

No sure biological standards

Scatter plots One of the most common visualization method for

microarray data. Useful to compare gene expression values from two

microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes

Scatter plot analysis of microarray data

expression level high

low

up

down

Brain

Astrocyte Astrocyte

Fibroblast

Differential Gene Expressionin Different Tissue and Cell Types

The major goal of scatter plot is to identify genes that are differentially regulated between different experimental conditions.

We are interested in outliers

Higher Level Data Analysis

Computational tasks: Clustering Classification Statistical validation Data visualization Pattern detection

Biological problems: Discovery of common sequences in co-regulated genes Meta-studies using data from multiple experiments Linkage between gene expression data and gene

sequence/function/metabolic pathways databases

Microarray data

Gene 1

Gene 2

Gene N

Exp 1

E 1

Exp 2

E 2

Exp 3

E 3

Why care about “clustering” ?E1 E2 E3

Gene 1

Gene 2

Gene N

E1 E2 E3

Gene N

Gene 1

Gene 2

•Discover functional relationSimilar expression functionally related

•Assign function to unknown gene

•Find which gene controls which other genes

Types of Clustering Methods

Hierarchical Link similar genes, build up to a tree of all

K-mean Clustering Self Organizing Maps (SOM)

Split all genes into similar sub-groups Finds its own groups (machine learning)

Bi-Clustering

Some distance measures

Given vectors x = (x1, …, xn), y = (y1, …, yn)

Euclidean distance:

Manhattan distance:

Correlation

distance:

n

iiiE yxyxd

1

2)(),(

.),(1

n

iiiM yxyxd

.)()(

))((1),(

1

2

1

2

1

ii

ii

iii

Cyyxx

yyxxyxd

Finding a Centroid

We use the following equation to find the n dimensional centroid point amid k n dimensional points:

),...,2

,1

(),...,,( 11121 k

xnth

k

ndx

k

stxxxxCP

k

ii

k

ii

k

ii

k

Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)

)5,5()3

924,

3

852(

CP

Hierarchical Clustering

E1 E2 E3

•Treat each example as a cluster•While (clusters >1)

•Merge two clusters with the least distance•Update cluster centroid•Clusters--

•Endwhile

•EasyNo need to specify the number of clusters beforehand

•Trouble to interpret “tree” structure

K-means Algorithm

1. Choose k initial center points randomly2. Cluster data using Euclidean distance (or other distance

metric)3. Calculate new center points for each cluster using only

points within the cluster4. Re-Cluster all data using the new center points

1. This step could cause data points to be placed in a different cluster

5. Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met

An example with k=2

1. We Pick k=2 centers at random

2. We cluster our data around these center points

K-means example with k=2

3. We recalculate centers based on our current clusters


4. We re-cluster our data around our new center points


5. We repeat the last two steps until no more data points are moved into a different cluster

Cluster Quality

Since any data can be clustered, how do we know our clusters are meaningful? The size (diameter) of the cluster vs. The inter-cluster distance Distance between the members of a cluster and the cluster’s

center Diameter of the smallest sphere

Cluster Quality Continued

size=5

size=5distance=2

0

distance=5

Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter

Cluster Quality Continued

Quality can be assessed simply by looking at the diameter of a cluster

A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.

k-means comments

Strength

Easy

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is #

iterations. Normally, k, t << n. Weakness

Sensitive to the initial seeds

Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

A Problem of K-means

Sensitive to outliers Outlier: objects with extremely large values

May substantially distort the distribution of the data

When mean is not meaningful K-medoids: the most centrally located object in a

cluster

++

0

1

2

3

4

5

67

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

67

8

9

10

0 1 2 3 4 5 6 7 8 9 10

A Problem K-means: Differing Density

Original Points K-means (3 Clusters)

Clusters with non-convex shapes

Original Points K-means (2 Clusters)

A parallel k-means package

Parallel K-Means Data Clustering http://www.ece.northwestern.edu/~wkliao/Kmeans/

index.html

Other clustering methods

Self Organizing Maps (SOM) Determine its own groups by using neural networks

Bi-clustering Simultaneously merge columns and rows into

clusters Group of genes Group of examples

Two-way clusteringof genes (y-axis)and cell lines (x-axis)

Outline



Public Databases

Public Databases

Gene Expression data is an essential aspect of annotating the genome

Publication and data exchange for microarray experiments

Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a

Microarray Experiment)

GEO at the NCBI

Array Express at EMBL

Array Express at EMBLArray Express at EMBL

Outline



Public Databases

cot 6930 hpc & bioinformatics microarray data analysis xingquan zhu dept. of computer science...

Documents

university of kansastype

gene activityno mrna

musclefind gene protein

functionproduce mrna

functionsets of genes

functionbiologicallysome

protein levelresults

offmrna present gene