dna microarras: basics

99
DNA Chips and Their Analysis Comp. Genomics: Lectures 10-11 based on many sources, primarily Zohar Yakhini

Upload: rae-deleon

Post on 30-Dec-2015

49 views

Category:

Documents


0 download

DESCRIPTION

DNA Chips and Their Analysis Comp. Genomics: Lectures 10-11 based on many sources, primarily Zohar Yakhini. DNA Microarras: Basics. What are they. Types of arrays (cDNA arrays, oligo arrays). What is measured using DNA microarrays. How are the measurements done?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DNA Microarras: Basics

DNA Chips and Their Analysis

Comp. Genomics: Lectures 10-11based on many sources, primarily Zohar Yakhini

Page 2: DNA Microarras: Basics

DNA Microarras: Basics

• What are they.• Types of arrays (cDNA arrays, oligo arrays).• What is measured using DNA microarrays. • How are the measurements done?

Page 3: DNA Microarras: Basics

DNA Microarras: Computational Questions• Design of arrays.• Techniques for analyzing experiments. • Detecting differential expression.• Similar expression: Clustering.• Other analysis techniques (mmmmmany).• Machine learning techniques, and

applications for advanced diagnosis.

Page 4: DNA Microarras: Basics

What is a DNA Microarray (I)

• A surface (nylon, glass, or plastic).• Containing hundreds to thousand

pixels.• Each pixel has copies of a sequence of single stranded DNA (ssDNA). • Each such sequence is called a

probe.

Page 5: DNA Microarras: Basics

What is a DNA Microarray (II)

• An experiment with 500-10k elements.• Way to concurrently explore the function

of multiple genes.• A snapshot of the expression level of 500-10k genes under given test

conditions

Page 6: DNA Microarras: Basics

Some Microarray Terminology

• Probe: ssDNA printed on the solid substrate (nylon or glass). These are

short substrings of the genes we are going to be testing

• Target: cDNA which has been labeled and is to be washed over the probe

Page 7: DNA Microarras: Basics

Back to Basics: Watson and Crick

James Watson and Francis Crick discovered, in 1953, the double helix structure of DNA.

From Zohar Yakhini

Page 8: DNA Microarras: Basics

Watson-Crick Complimentarity

A binds to TC binds to G

AATGCTTAGTCTTACGAATCAG

Perfect match

AATGCGTAGTCTTACGAATCAG

One-base mismatchFrom Zohar Yakhini

Page 9: DNA Microarras: Basics

Array Based Hybridization Assays (DNA Chips)

Unknown sequence or mixture (target).Many copies.

• Array of probes• Thousands to millions of

different probe sequences per array.

From Zohar Yakhini

Page 10: DNA Microarras: Basics

Array Based Hyb Assays

• Target hybs to WC complimentary probes only

• Therefore – the fluorescence pattern is indicative of the target sequence.

From Zohar Yakhini

Page 11: DNA Microarras: Basics

Central Dogma of Molecular Biology(reminder)

Transcription

mRNA

Cells express different subset of the genes in different tissues and under different conditions

Gene (DNA)

Translation

Protein

From Zohar Yakhini

Page 12: DNA Microarras: Basics

Expression Profiling on MicroArrays

• Differentially label the query sample and the control (1-3).

• Mix and hybridize to an array.

• Analyze the image to obtain expression levels information.

From Zohar Yakhini

Page 13: DNA Microarras: Basics

Microarray: 2 Types of Fabrication

1. cDNA Arrays: Deposition of DNA fragments– Deposition of PCR-amplified cDNA clones– Printing of already synthesized

oligonucleotieds

2. Oligo Arrays: In Situ synthesis– Photolithography– Ink Jet Printing– Electrochemical Synthesis

By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”

Page 14: DNA Microarras: Basics

cDNA Microarrays vs. Oligonucleotide Probes and Cost

cDNA ArraysOligonucleotide Arrays

•Long Sequences•Spot Unknown Sequences•More variability• Arrays cheaper

•Short Sequences•Spot Known Sequences•More reliable data•Arrays typically more expensive

By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”

Page 15: DNA Microarras: Basics

Photolithography (Affymetrix)

• Similar to process used to generate VLSI circuits

• Photolithographic masks are used to add each base

• If base is present, there will be a “hole” in the corresponding mask

• Can create high density arrays, but sequence length is limited

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Photodeprotection

mask

C

Page 16: DNA Microarras: Basics

Photolithography (Affymetrix)

From Zohar Yakhini

Page 17: DNA Microarras: Basics

Ink Jet Printing

• Four cartridges are loaded with the four nucleotides: A, G, C,T

• As the printer head moves across the array, the nucleotides are deposited in pixels where they are needed.

• This way (many copies of) a 20-60 base long oligo is deposited in each pixel.

By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”

Page 18: DNA Microarras: Basics

A GTC

Ink Jet Printing (Agilent)

The array is a stack of images in the colors A, C, G, T.

From Zohar Yakhini

Page 19: DNA Microarras: Basics

Inkjet Printed Microarrays

Inkjet head, squirting phosphor-ammodites

From Zohar Yakhini

Page 20: DNA Microarras: Basics

Electrochemical Synthesis

• Electrodes are embedded in the substrate to manage individual reaction sites

• Electrodes are activated in necessary positions in a predetermined sequence that allows the sequences to be constructed base by base

• Solutions containing specific bases are washed over the substrate while the electrodes are activated

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 21: DNA Microarras: Basics

Expression Profiling on MicroArrays

• Differentially label the query sample and the control (1-3).

• Mix and hybridize to an array.

• Analyze the image to obtain expression levels information.

From Zohar Yakhini

Page 22: DNA Microarras: Basics

Expression Profiling: a FLASH Demo

http://www.bio.davidson.edu/courses/genomics/chip/chip.html

URL:

Page 23: DNA Microarras: Basics

Expression Profiling – Probe Design Issues

• Probe specificity and sensitivity.• Special designs for splice variations or

other custom purposes.• Flat thermodynamics.• Generic and universal systems

From Zohar Yakhini

Page 24: DNA Microarras: Basics

Hybridization Probes

• Sensitivity:Strong interaction between the probe and its intended target, under the assay's conditions.How much target is needed for the reaction to be detectable or quantifiable?

• Specificity:No potential cross hybridization.

From Zohar Yakhini

Page 25: DNA Microarras: Basics

Specificity

• Symbolic specificity

• Statistical protection in the unknown part of the genome.

Methods, software and application in collaboration with Peter Webb, Doron Lipson.

From Zohar Yakhini

Page 26: DNA Microarras: Basics

Reading Results: Color Coding

• Numeric tables are difficult to read• Data is presented with a color scale• Coding scheme:

– Green = repressed (less mRNA) gene in experiment– Red = induced (more mRNA) gene in experiment– Black = no change (1:1 ratio)

• Or– Green = control condition (e.g. aerobic)– Red = experimental condition (e.g. anaerobic)

• We usually use ratio

Campbell & Heyer, 2003

Page 27: DNA Microarras: Basics

cDNA array,Inkjet deposition

In-Situ synthesized oligonucleotide array. 25-60 mers.

Thermal Ink Jet Arrays, by Agilent Technologies

Page 28: DNA Microarras: Basics

Application of Microarrays

• We only know the function of just about 30% of the 30,000 genes in the Human Genome– Gene exploration– Functional Genomics

• DNA microarrays are just the first among many high

throughput genomic devices

(first used approx. 1996)

http://www.gene-chips.com/sample1.html

By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”

Page 29: DNA Microarras: Basics

A Data Mining Problem

• On a given microarray, we test on the order of 10k elements in one time

• Number of microarrays used in typical experiment is no more than 100.• Insufficient sampling.• Data is obtained faster than it can be

processed.• High noise.• Algorithmic approaches to work through

this large data set and make sense of the data are desired.

Page 30: DNA Microarras: Basics

Informative Genes in aTwo Classes Experiment

• Differentially expressed in the two classes.• Identifying (statistically significant)

informative genes

- Provides biological insight

- Indicate promising research directions

- Reduce data dimensionality

- Diagnostic assay

From Zohar Yakhini

Page 31: DNA Microarras: Basics

Expression pattern and pathological diagnosis information (annotation), for a single gene

+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15

Permute the annotation by sorting the expression pattern (ascending, say).

Informative genes

+ + + + + + + + - - - - - - -

- - - - - - - + + + + + + + + - - - - + - - + + - + + + + +

etc

Non-informative genes

+ - + - + + + + - - + + - - -

- + + - + - - + + - + + - - + + - - - + + - + + - + + - + -

etc

Scoring Genes

From Zohar Yakhini

Page 32: DNA Microarras: Basics

Separation Score

• Compute a Gaussian fit for each class

(1 , 1) , (2 , 2) .

• The Separation Score is(1 - 2)/(1+ 2)

Page 33: DNA Microarras: Basics

Threshold Error Rate (TNoM) Score

Find the threshold that best separates tumors from normals, count the number of errors committed there.

- + + - + - - + + - + + - - +

# of errors = min(7,8) = 7. Not informative

6 7

Ex 1:

Ex 2: A perfect single gene classifier gets a score of 0.

Very informative

+ + + + + + + + - - - - - - -

0 From Zohar Yakhini

Page 34: DNA Microarras: Basics

p-Values• Relevance scores are more useful when

we can compute their significance:– p-value: The probability of finding a gene with

a given score if the labeling is random

• p-Values allow for higher level statistical assessment of data quality.

• p-Values provide a uniform platform for comparing relevance, across data sets.

• p-Values enable class discovery

From Zohar Yakhini

Page 35: DNA Microarras: Basics

| | | | | | | - - - - - - - - - - - - - - -

100

200

300

400

500

600

700

Tissues

Ge

ne

s

Breast Cancer BRCA1/ BRCA2 data

BRCA1 Wildtype

Genes over-expressed in BRCA1 mutants

Genes under-expressed in BRCA1 mutants

BRCA1 mutants

Sporadic sample s14321With BRCA1-mutant expression profile

BRCA1 Differential

Expression

Collab with NIHNEJM 2001

From Zohar Yakhini

Page 36: DNA Microarras: Basics

| | | | | | | | | | - - - - - - - - - - - - - - - - - - - - - - - -

50

100

150

200

250

300

350

400

450

500

550

Tissues

Ge

ne

s

LUCA, 38 samples, 14.May.2001 Kaminski.

Lung Cancer Informative Genes

Data from Naftali Kaminski’s lab, at Sheba.

•24 tumors (various types and origins)

•10 normals (normal edges and normal lung pools)

From Zohar Yakhini

Page 37: DNA Microarras: Basics

And Now: Global Analysisof Gene Expression Data

Most common tasks:

1.Construct gene network from experiments.2.Cluster - either genes, or experiments

Page 38: DNA Microarras: Basics

And Now: Global Analysisof Gene Expression Data

Most common tasks:

1.Construct gene network from experiments.2.Cluster - either genes, or experiments

Page 39: DNA Microarras: Basics

And Now: Global Analysisof Gene Expression Data

Most common tasks:

1.Construct gene network from experiments.2.Cluster - either genes, or experiments

Page 40: DNA Microarras: Basics

Pearson Correlation Coefficient, r. Values are in [-

1,1] interval• Gene expression over d experiments is a vector in Rd, e.g. for gene C: (0, 3, 3.58, 4, 3.58, 3)

• Given two vectors X and Y that contain N elements, we calculate r as follows:

Cho & Won, 2003

Page 41: DNA Microarras: Basics

Intuition for Pearson Correlation Coefficient

r(v1,v2) close to 1: v1, v2 highly correlated.r(v1,v2) close to -1: v1, v2 anti correlated.

r(v1,v2) close to 0: v1, v2 not correlated.

Page 42: DNA Microarras: Basics

Pearson Correlation and p-Values

When entries in v1,v2 are distributed according to normal distribution, can assign(and efficiently compute) p-Values for a given result.

These p-Values are determined by the Pearson correlation coefficient, r, and thedimension, d, of the vectors.For same r, vectors of higher dimension willbe assigned more significant (smaller) p-Value.

Page 43: DNA Microarras: Basics

• Replace each entry xi by its rank in vector x.

• Then compute Pearson correlation coefficients of rank vectors.

• Example: X = Gene C = (0, 3.00, 3.41, 4, 3.58, 3.01) Y = Gene D = (0, 1.51, 2.00, 2.32, 1.58, 1)

• Ranks(X)= (1,2,4,6,5,3)• Ranks(Y)= (1,3,5,6,4,2)• Ties should be taken care of, but: (1) rare

(2) can randomize (small effect)

Spearman Rank Order Coefficient

(a close relative of Pearson, non parametric)

Page 44: DNA Microarras: Basics

From Pearson Correlation Coefficients to a Gene Network

• Compute correlation coefficient for allpairs of genes (what about missing

data?)

• Choose p-Value threshold.

• Put an edge between gene i and gene j iff

p-Value exceeds threshold.

Page 45: DNA Microarras: Basics

Things May Get Messy

• What to do with significant yet negative correlation coefficients? Usually care onlyabout the p-value and put a “normal edge”

• Cases composed of multiple experimentswhere distribution is far from normal.

Page 46: DNA Microarras: Basics

Things Do Get Messy

Percentage of significantly correlated pairs

00.10.20.30.40.50.60.70.80.9

1

Experiments

%

Page 47: DNA Microarras: Basics

What to Do when Things Get Messy?

Percentage of significantly correlated pairs

00.10.20.30.40.50.60.70.80.9

1

Experiments

%

Page 48: DNA Microarras: Basics

What to do when things Get Messy

Percentage of significantly correlated pairs

00.10.20.30.40.50.60.70.80.9

1

Experiments

%

Page 49: DNA Microarras: Basics

What to do when things Get Messy

Percentage of significantly correlated pairs

00.10.20.30.40.50.60.70.80.9

1

Experiments

%

1) Create a single vector of all experiments

per gene. Compute correlations based on

these vectors. This is the common approach.

Disadvantage: Outcome is dominated by the larger experiments.

Page 50: DNA Microarras: Basics

What to do when things Get Messy

Percentage of significantly correlated pairs

00.10.20.30.40.50.60.70.80.9

1

Experiments

%

2) For each edge, count the no. of experiments where it appears significantly. Take edges exceeding some

threshold.

Disadvantage: Outcome is somewhat dominated by experiments withmany significant correlations.

Page 51: DNA Microarras: Basics

What to do when things Get Messy

Percentage of significantly correlated pairs

00.10.20.30.40.50.60.70.80.9

1

Experiments

%

3) For each edge, make a weighted count the of experiments where it appears significantly. Weights are higher if experiment has few significant correlations. Take edges exceeding some threshold.

Disadvantage: No solid mathematical justification.

Page 52: DNA Microarras: Basics

Public microarray data sets

Pearson Correlation

Gen

es

Samples

Gen

es

Genes

Pair-wise gene co-expression matrices

Gene pair score

nkk

nkk

kji

ji p

pxggscore

..1

..1,

)1ln(

)1ln(),(

• <gi,gj> - a gene pair• n - number of datasets• xk

i,j - 1 if gi and gj are significantly correlated in dataset k, 0 otherwise

• pk - proportion of significantly correlated gene pairs in dataset k

Summary of the procedure

Network of conserved co-expression links

Nodes represent genes

Edges represent highly correlated expressions

Cluster Detection

Highly inter-connected clusters

Page 53: DNA Microarras: Basics

The Outcome (Whole Network)

Page 54: DNA Microarras: Basics

0.2

0.15

0.1

0.05

Node Score Cutoff

0.2

0.15

0.1

0.05

Node Score Cutoff

Ribosome-relatedChloroplast-relatedER and mitochondrion-

related

A

B

0.20.150.10.05

Node Score Cutoff

0.20.150.10.05

Ribosome-related Chloroplast and Ribosome-related

Chloroplast-related Chloroplast and ER-related

* (1)

* (2)

* (3)* (4)

+

Outcome after Clustering

Page 55: DNA Microarras: Basics

But what is Clustering?

Page 56: DNA Microarras: Basics

Grouping and Reduction

• Grouping: Partition items into groups. Items in same group should be similar.

Items in different groups should be dissimilar.

• Grouping may help discover patterns in the data.

• Reduction: reduce the complexity of data by removing redundant probes (genes).

Page 57: DNA Microarras: Basics

Unsupervised Grouping: Clustering

• Pattern discovery via clustering

similarly entities together

• Techniques most often used:

k-Means Clustering Hierarchical Clustering Biclustering Alternative Methods: Self Organizing Maps (SOMS),

plaid models, singular value decomposition (SVD),

order preserving submatrices (OPSM),……

Page 58: DNA Microarras: Basics

Clustering Overview

• Different similarity measures in use:– Pearson Correlation Coefficient– Cosine Coefficient– Euclidean Distance– Information Gain– Mutual Information– Signal to noise ratio– Simple Matching for Nominal– –

Page 59: DNA Microarras: Basics

Clustering Overview (cont.)

• Different Clustering Methods– Unsupervised

• k-means Clustering (k nearest neighbors)• Hierarchical Clustering• Self-organizing map

– Supervised• Support vector machine• Ensemble classifier

Data Mining

Page 60: DNA Microarras: Basics

Clustering Limitations

• Any data can be clustered, therefore we must be careful what conclusions we draw from our results

• Clustering is often randomized. It can, and will, produce different results for different runs on same data

Page 61: DNA Microarras: Basics

k-means Clustering

• Given a set of m data points in

d-dimensional space and an integer k.

• We want to find the set of k “centers” in

d-dimensional space that minimizes the Euclidean (mean square) distance from each data point to its nearest center.

• No exact polynomial-time algorithms are

known for this problem (no wonder, NP-hard!).“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo

et. al

Page 62: DNA Microarras: Basics

K-means Heuristic (Lloyd’s Algorithm)

• Has been shown to converge to a locally optimal solution

• But can converge to a solution arbitrarily bad compared to the optimal solution

•“K-means-type algorithms: A generalized convergence theorem and characterization of local optimality” by Selim and Ismail

•“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et al.

K=3

Data Points

Optimal Centers

Heuristic Centers

Page 63: DNA Microarras: Basics

Euclidean Distance

n

iiiE yxyxd

1

2)(),(

543),( 22 AOd E

Now to find the distance between two points, say the origin and the point (3,4):

Simple and Fast! Remember this when we consider the complexity!

Page 64: DNA Microarras: Basics

Finding a Centroid

We use the following equation to find the n dimensional centroid point (center of mass) amid k (n dimensional) points:

),...,2

,1

(),...,,( 11121 k

xnth

k

ndx

k

stxxxxCP

k

ii

k

ii

k

ii

k

Example: The midpoint between three 2D points, say: (2,4) (5,2) (8,9)

2 5 8 4 2 9( , ) (5,5)

3 3CP

Page 65: DNA Microarras: Basics

K-means Iterative Heuristic

• Choose k initial center points “randomly”• Cluster data using Euclidean distance (or other

distance metric)• Calculate new center points for each cluster, using only points within the cluster• Re-Cluster all data using the new center points

(this step could cause some data points to be placed in

a different cluster)• Repeat steps 3 & 4 until no data points are moved

from one cluster to another (stabilization), or till some other convergence criteria is met

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 66: DNA Microarras: Basics

An example with 2 clusters

1. We Pick 2 centers at random

2. We cluster our data around these center points

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 67: DNA Microarras: Basics

K-means example with k=2

3. We recalculate centers based on our current clusters

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 68: DNA Microarras: Basics

K-means example with k=2

4. We re-cluster our data around our new center points

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 69: DNA Microarras: Basics

k-means example with k=2

5. We repeat the last two steps until no more data points are moved into a different cluster

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 70: DNA Microarras: Basics

Choosing k

• Run algorithm on data with several different values of k

• Use prior knowledge about the characteristics of your test (e.g. cancerous vs non-cancerous tissues, in case it is the experiments that are being clustered)

Page 71: DNA Microarras: Basics

Cluster Quality

• Since any data can be clustered, how do we know our clusters are meaningful?– The size (diameter) of the cluster

vs. the inter-cluster distance– Distance between the members of a cluster and the

cluster’s center– Diameter of the smallest sphere containing the cluster

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 72: DNA Microarras: Basics

Cluster Quality Continued

diameter=5

diameter=5distance=2

0

distance=5

Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 73: DNA Microarras: Basics

Cluster Quality Continued

Quality can be assessed simply by looking at the diameter of a cluster (alone????)

Warning: A cluster can be formed by the heuristic even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 74: DNA Microarras: Basics

Properties of k-means Clustering

• The random selection of initial center points implies the following properties– Non-Determinism / Randomized– May produce incoherent clusters

• One solution is to choose the centers randomly from existing points

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 75: DNA Microarras: Basics

Heuristic’s Complexity

• Linear in the number of data points, N

• Can be shown to have run time cN, where c does not depend on N, but rather the number of clusters, k

• (not sure about dependence on dimension, d?)

efficient

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 76: DNA Microarras: Basics

Hierarchical Clustering

- a different clustering paradigm

Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 77: DNA Microarras: Basics

Hierarchical Clustering (cont.)

Gene CGene DGene EGene FGene GGene HGene IGene JGene KGene LGene MGene N

Gene C0.940.96-0.400.95-0.950.410.360.230.95-0.94-1

Gene D0.84-0.100.94-0.940.680.24-0.070.94-1-0.94

Gene E-0.570.89-0.890.210.300.430.89-0.84-0.96

Gene F-0.350.350.60-0.43-0.79-0.350.100.40

Gene G-10.480.220.111-0.94-0.95

Gene H-0.48-0.21-0.11-10.940.95

Gene I0-0.750.48-0.68-0.41

Gene J00.22-0.24-0.36

Gene K0.110.07-0.23

Gene L-0.94-0.95

Gene M0.94

Gene N

Campbell & Heyer, 2003

Page 78: DNA Microarras: Basics

Hierarchical Clustering (cont.)

F

C

G

D

E

Gene CGene DGene EGene FGene G

Gene C0.940.96-0.400.95

Gene D0.84-0.100.94

Gene E-0.570.89

Gene F-0.35

Gene G

C E

1

1Gene DGene FGene G

10.89-0.4850.92

Gene D-0.100.94

Gene F-0.35

Gene G

Average “similarity” to

Gene D: (0.94+0.84)/2 = 0.89

•Gene F: (-0.40+(-0.57))/2 = -0.485

•Gene G: (0.95+0.89)/2 = 0.92

1

Page 79: DNA Microarras: Basics

Hierarchical Clustering (cont.)

F

G

D

C E

1

1Gene DGene FGene G

10.89-0.4850.92

Gene D-0.100.94

Gene F-0.35

Gene G

G D

2

Page 80: DNA Microarras: Basics

Hierarchical Clustering (cont.)

F

C E

1

G D

2

12Gene F

10.905-0.485

2-0.225

Gene F 3

Page 81: DNA Microarras: Basics

Hierarchical Clustering (cont.)

F C E

1

G D

2

3

3Gene F

3-0.355

Gene F

4

F

Page 82: DNA Microarras: Basics

Hierarchical Clustering (cont.)

F C E

1

G D

2

3

4

algorithm looks familiar?

Page 83: DNA Microarras: Basics

Clustering of entire yeast genome

Campbell & Heyer, 2003

Page 84: DNA Microarras: Basics

Hierarchical Clustering:Yeast Gene Expression Data

Eisen et al., 1998

Page 85: DNA Microarras: Basics

A SOFM Example With Yeast

“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.

Page 86: DNA Microarras: Basics

SOM Description

• Each unit of the SOM has a weighted connection to all inputs

• As the algorithm progresses, neighboring units are grouped by similarity

Input Layer

Output Layer

From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici

Page 87: DNA Microarras: Basics

An Example Using Color

Each color in the map is associated with a weight

From http://davis.wpi.edu/~matt/courses/soms/

Page 88: DNA Microarras: Basics

Cluster Analysis of Microarray Expression Data Matrices

Application of cluster analysis techniques in the elucidation gene expression data.

Important paradigm: Guilt by association

Page 89: DNA Microarras: Basics

Cluster Analysis• Cluster Analysis is an unsupervised procedure which involves grouping of objects

based on their similarity in feature space.

• In the Gene Expression context Genes are grouped based on the similarity of their Condition feature profile.

• Cluster analysis was first applied to Gene Expression data from Brewer’s Yeast (Saccharomyces cerevisiae) by Eisen et al. (1998).

Two general conclusions can be drawn from these clusters:

• Genes clustered together may be related within a biological module/system.

• If there are genes of known function within a cluster these may help to class this biological/module system.

X

Y

A

B

C

Z

Clusters A,B and C represent groups of related genes.

Clustering

1.55 1.05 0.5 2.5 1.75 0.25 0.1

1.7 0.3 2.4 2.9 1.5 0.5 1.0

1.55 1.05 0.5 2.5 1.75 0.25 0.1

1.7 0.3 2.4 1.5 0.5 1.0

1.55 0.5 2.5 1.75 0.25 0.1

0.3 2.4 2.9 1.5 0.5 1.0

1.55 1.05 0.5 2.5 1.75 0.25 0.1

Conditions

Genes

Page 90: DNA Microarras: Basics

From Data to Biological Hypothesis

System C

Cluster C with four Genes may represent System C

Relating these genes aids in elucidation of this System C

Gene Expression Microarray Cluster SetConditions (A-Z)

Gene 1Gene 2Gene 3Gene 4 Gene 5Gene 6Gene 7 X

Y

A

B

C

External Stimulus( Condition X)

Regulator Protein

Toxin

DNA Gene a Gene b Gene c Gene d

Gene Expression

Toxin Pump

Cell Membrane

Page 91: DNA Microarras: Basics

Some Drawbacks of Clustering Biological Data1. Clustering works well over small numbers of conditions but a typical Microarray

may have hundreds of experimental conditions. A global clustering may not offer sufficient resolution with so many features.

2. As with other clustering applications, it may be difficult to cluster noisy expression data.

3. Biological Systems tend to be inter-related and may share numerous factors (Genes) – Clustering enforces partitions which may not accurately represent these intimacies.

4. Clustering Genes over all Conditions only finds the strongest signals in the dataset as a whole. More ‘local’ signals within the data matrix may be missed.

X

Y

A

B

C

Z

Inter-related(3)

Local Signals(4)

May represent more complex system such as:

Page 92: DNA Microarras: Basics

How do we better model more complex systems?

• One technique that allows detection of all signals in the data is biclustering.

• Instead of clustering genes over all conditions biclustering clusters genes with respect to subsets of conditions.

-interrelated clusters (genes may belong more than one bicluster).

-local signals (genes correlated over only a few conditions).

-noisy data (allows erratic genes to belong to no cluster).

This enables better representation of:

Page 93: DNA Microarras: Basics

Biclustering

• Technique first described by J.A. Hartigan in 1972 and termed ‘Direct Clustering’.

• First Introduced to Microarray expression data by Cheng and Church(2000)

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8Gene 9

A B C D E F G H

Gene 1

Gene 4

Gene 6

Gene 7

Gene 9

B E F

Biclustering (of genes AND conditions)A B D E F G H

Gene 1

Gene 4

Gene 9

Clustering misses local signal {(B,E,F),(1,4,6,7,9)} present over subset of conditions.

Gene 1

Gene 4

Gene 9

A B C D E F G H

Clustering (of genes)

Biclustering discovers local coherences over a subset of conditions

Conditions

Page 94: DNA Microarras: Basics

Approaches to Biclustering Microarray Gene Expression

• First applied to Gene Expression Data by Cheng and Church(2000).– Used a sub-matrix scoring technique to locate biclusters.

• Tanay et al.(2000)– Modelled the expression data on Bipartite graphs and

used graph techniques to find ‘complete graphs’ or biclusters.

• Lazzeroni and Owen– Used matrix reordering to represent different ‘layers’ of

signals (biclusters) ‘Plaid Models’ to represent multiple signals within data.

• Ben-Dor et al. (2002) – “Biclusters” depending on order relations (OPSM).

Page 95: DNA Microarras: Basics

Bipartite Graph Modelling•First proposed in: “Discovering statically significant biclusters in

gene expressing data” Tanay et al. Bioinformatics 2000

Within the graph modelling paradigm biclusters are equivalent to complete bipartite sub-graphs.

Tanay and colleagues used probabilistic models to determine the least probable sub-graphs (those showing most order and consequently most surprising) to identify biclusters.

1234567

1234567

A B C D E FA

B

C

D

E

F

146

AD146

AD

Graph G

Sub-graph H(Bicluster)

Data Matrix M

Sub-Matrix (Bicluster)

Genes

Genes

Conditions

Conditions

Page 96: DNA Microarras: Basics

The Cheng and Church Approach

ija

The core element in this approach is the development of a scoring to prioritise sub-matrices.

This scoring is based on the concept of the residue of an entry in a matrix.

In the Matrix (I,J) the residue score of element is given by:

IJIjiJijij aaaaaR )(

ai

jIJ

In words, the residue of an entry is the value of the entry minus the row average, minus the column average, plus the average value in the matrix.

This score gives an idea of how the value fits into the data in the surrounding matrix.

ija

Page 97: DNA Microarras: Basics

Conclusions:

• High throughput Functional Genomics (Microarrays) requires Data Mining Applications.

• Biclustering resolves Expression Data more effectively than single dimensional Cluster Analysis.

Future Research/Question’s:

• Implement a simple H score program to facilitate study if H score concept.

• Are there other alternative scorings which would better apply to gene expression data?

• Do un-biclustered genes have any significance? Horizontally transferred genes?

• Implement full scale biclustering program and look at better adaptation to expression data sets and the biological context.

Page 98: DNA Microarras: Basics

References

• Basic microarray analysis: grouping and feature reduction by Soumya Raychaudhuri, Patrick D. Sutphin, Jeffery T. Chang and Russ B. Altman; Trends in Biotechnology Vol. 19 No. 5 May 2001

• Self Organizing Maps, Tom Germano, http://davis.wpi.edu/~matt/courses/soms

• “Data Analysis Tools for DNA Microarrays” by Sorin Draghici; Chapman & Hall/CRC 2003

• Self-Organizing-Feature-Maps versus Statistical Clustering Methods: A Benchmark by A. Ultsh, C. Vetter; FG Neuroinformatik & Kunstliche Intelligenz Research Report 0994

Page 99: DNA Microarras: Basics

References

• Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation by Tamayo et al.

• A Local Search Approximation Algorithm for k-Means Clustering by Kanungo et al.

• K-means-type algorithms: A generalized convergence theorem and characterization of local optimality by Selim and Ismail