embnet's introduction to bioinformatics€¦ · swiss institute of bioinformatics dna...
TRANSCRIPT
Swiss Institute of Bioinformatics
EMBnet's introduction to bioinformatics
Introduction to microarrays
Thierry Sengstag, PhDBioinformatics Core Facility
Swiss Institute of Bioinformatics
Part ITechnology of microarrays
Swiss Institute of Bioinformatics
Biology Fundamentals - Genes
Swiss Institute of Bioinformatics
Biology Fundamentals - Expression
Transcriptome: Genes
Proteome: Proteins
Microarrays
Swiss Institute of Bioinformatics
Genomics Fundamentals - Complexity
Difficulties:
§Contaminations
§Alternative Splicing
§Alternative PolyAdenylation
mRNA purification
Swiss Institute of Bioinformatics
RNA abundance in mammalian cells
rRNA
tRNAmRNA
80% 1%1-50
50-500
500+
Molecules/cell
3 x 106 molecules/cell 3 x 105 molecules/cell1-2 x104 different genes
Swiss Institute of Bioinformatics
DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes in a small sample and to analyze the expression of those genes.
Microarrays are ordered sets of DNA molecules of known sequence attached on a surface at a known position (spot) .
In a microarray experiment one hybridizes mRNA molecules of an extract of a sample to the spots of the microarray.
Main families of microarrays:
• Spotted arrays, PCR products spotted on chip, ~500 nt
• Oligo-arrays, e.g. Agilent, oligos ~60 nt length, in-situ
• Affymetrix, short sequence oligos, 25 nt, in-situ
What are DNA Microarrays ?
Swiss Institute of Bioinformatics
1- Samples
2- Extracting mRNA
3- Labeling
4- Hybridizing
5- Scanning
6- Visualizing
Swiss Institute of Bioinformatics
Various technological choices:• 104 to 106 features on a single array• Single- vs two-color approach• Hybridization protocols
Questions addressed:• What are the differences (in gene expression) between cell lines ?• What is the difference between knock-out and wild-type mice?• What is the difference between a tumor and a healthy tissue ?• Are there different tumor types ?
Key concept:Compare gene expression in two (or more) cell/tissue types ?
Gene expression assessed by measuring the number of RNA transcripts in a tissue sample. (Primary goal of this course.)
What are DNA Microarrays ?
Swiss Institute of Bioinformatics
Phase 1: Preparation of the microarray environment
• Which sequences do we want to interrogate on the arrays ?
• Other technically important questions (choice of scanner, chemistry, etc…)
• Presently (2006): commercial platforms are standard for "usual" organisms; "exotic" organisms still require custom-made arrays
Phase 2: Use of the microarray
• "Experimental design" (with a statistician)
• Preparation of RNA samples
• Hybridization, scanning, signal extraction
• Statistical analysis
Two major phases of a microarray experiment
Swiss Institute of Bioinformatics
Phase 1: Which sequences should we spot ?
Depends on the organism:
• Human, Mouse, Rat, … have large databases of sequences that can be used to design probes by bioinformatics means
• "Exotic" organisms require cloning of 100s to 1000s mRNA transcripts, then spotting of DNA after PCR amplification (sequencing of interesting genes can be done after microarrayexperiment)
Depends on platform:
• Oligo arrays can probe any region of a mRNA transcript
• Affymetrix require the sequence to be in the 3'-UTR region
Swiss Institute of Bioinformatics
Biological question(e.g. Differentially expressed genes,
Sample class prediction, etc.)
Testing
Biological verification and interpretation
Microarray experiment
Significance
Experimental design
Image analysis/Quality assessment
Normalization
Clustering Discrimination
(failed)
Pre-processing steps
Data Analysis
Scientific Process
Phase 2:
Swiss Institute of Bioinformatics
Spotted array preparation
“Average” mouse mRNA
cDNA isolation
Test sequence(probe) production
~100 - ~2000 bp
RT-PCR (conversionmRNA-cDNA, amplification)
Swiss Institute of Bioinformatics
Array Production: Spotting
Swiss Institute of Bioinformatics
Spotting in action…
1. Some rounds of pin cleaning2. Pickup PCR products from plate3. Spot one feature on each subarray
Spotting arrays
Swiss Institute of Bioinformatics
Oligo array preparation
Sequence databases
Millions of experiencesworldwide
Probe (sequence) design-known genes -putative genes-alternative splicing-GC contents
Gene-specific sequences
~60 bp sequences
In-situ synthesis
Swiss Institute of Bioinformatics
Spotted and oligo array usage
Hybridizationwashing
Relative mRNAlevels
Scanning
cy5 labeledcDNA
cy3 labeledcDNA
Mix
Swiss Institute of Bioinformatics
Affymetrix chip preparation
Sequence databases
Millions of experimentsworldwide
Probe (sequence) design-known genes -putative genes-alternative splicing-GC contents
Bioinformatics thinkingyields gene-specific sequences (3’-end)
25 nt sequences
In-situ synthesis
~100s of nt “consensus”sequences
Swiss Institute of Bioinformatics
Affymetrix chip usage
Hybridizationwashing
Relative mRNA levels
Scanning
Scanning
Swiss Institute of Bioinformatics
Affymetrix system
(11 to 16)
Usually the most 3 prime area, often UTR
25mer25mer
25mer
AAAA..
25mer
Swiss Institute of Bioinformatics
Probe preparation & hybridization
• Extract mRNA or total RNA• RT, add 5’ anchor• PCR with labelled nucleotide (Cy3, Cy5, DIG, …)• Overlay probe on the chip, put in the hybridization
chamber, wash
Swiss Institute of Bioinformatics
Scanner basics
• Based on fluorescence– 1 or 2 lasers: cy3 cy5
(seldom more)
• Most scanners are confocal– Target a very limited volume
of space(signal only from focal plane)
– Need to “scan” the surface
• 16-bits ADC converters– Range of values: 0-65535– Log2 range: 0 – 16
• Scan various supports– Glass Slide (e.g. Agilent,
PerkinElmer)– Affymetrix
Swiss Institute of Bioinformatics
Confocal scanner
Dye Photons Electrons SignalLaser PMT
A/DConverter
excitation amplification FilteringTime-spaceaveraging
Swiss Institute of Bioinformatics
Images from Scanner
• Resolution– standard 10 µm [currently, best ~ 5µm]– 100µm spot on chip = 10 pixels in diameter
• Image format– Typically: TIFF (tagged image file format) 16 bit (65,536
levels of gray) – also other formats – 1cm x 1cm image at 16 bit = 2Mb
• Separate image for each fluorescent sample– channel 1, channel 2, etc.
Swiss Institute of Bioinformatics
Images in analysis software
• The two 16-bit images (Cy3, Cy5) are viewed as 8-bit images
• Display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image
• RGB image :– Blue values (B) are set to 0
– Red values (R) are used for Cy5 intensities
– Green values (G) are used for Cy3 intensities
• Qualitative representation of results
Swiss Institute of Bioinformatics
Images : examples
Cy3
Cy5
repressedControl > Treatedgreen
inducedControl < Treatedred
unchangedControl = Treatedyellow
Gene expression
Signal strengthSpot color
Pseudo-color overlay
Swiss Institute of Bioinformatics
Image analysis (scanner variability)
ScanArray 4000
Agilent G2565AA
Swiss Institute of Bioinformatics
Image processing
• Align channels• Identify spot pixels• Identify background pixels• Compute representative value, e.g.
– Mean foreground value– Median background value
Swiss Institute of Bioinformatics
2-color Arrays Image Processing
GenePix
Swiss Institute of Bioinformatics
2-color Arrays Image Processing
A difficult case… JJJJ
Swiss Institute of Bioinformatics
Quantification of Expression
For each spot on the slide, calculate
Red intensity = Rfg - Rbg
(fg = foreground, bg = background) and
Green intensity = Gfg - Gbg
and combine them in the log (base 2) ratio
Log2(Red/Green)
Often, fg = mean and bg = median of relevant pixel intensities
Swiss Institute of Bioinformatics
M and A values
• M-value is the log2 of the ratio of expression values
– Properties:• If the gene is expressed with the same intensity in the red
and green conditions: M=0• If the gene is moreexpressed in the redcondition: M>0• if the gene is moreexpressed in the greencondition: M<0
• A-value is the average of the log2 of expression
M = log 2( Red / Green )= log 2( Red ) – log 2( Green )
A = 1/2 ( log 2( Red ) + log 2( Green ) )= log 2( sqrt( Red * Green ) )
Swiss Institute of Bioinformatics
Why we take logs
Linear scale Log scale
Better representation of genes with "medium" expression:
Biologically, a unit change in log2 represents a 2-fold change.
Swiss Institute of Bioinformatics
MvA plots
• Relationship between Intensity and MvA plots
log2(Green)
log2(Red)
0 160
16M
A16
45o Rotation
Red saturationRed saturation
Green saturationGreen saturation
0
+1
-1(+ stretching by afactor 2 along M axis)
Swiss Institute of Bioinformatics
Hybridization of extra materialwith known concentrations (spikes)
A real-life MvA plot
Swiss Institute of Bioinformatics
End of Part I
Swiss Institute of Bioinformatics
Part IIExtraction of gene signal
from microarrays
Swiss Institute of Bioinformatics
Biological question(e.g. Differentially expressed genes,
Sample class prediction, etc.)
Testing
Biological verification and interpretation
Microarray experiment
Significance
Experimental design
Image analysis/Quality assessment
Normalization
Clustering Discrimination
(failed)
Pre-processing steps
Data Analysis
Scientific Process
Swiss Institute of Bioinformatics
Steps in Images Processing
• Addressing (or Gridding)– Assigning coordinates to each spot
• Segmentation– Classification of pixels as either foreground (signal)
or background
• Information Extraction– Foreground fluorescence intensity pairs (R,G)
– Background intensities
– Quality measures
Swiss Institute of Bioinformatics
Addressing
This is the process of assigning coordinates to each of the spots.
Automating this part of the procedure permits high throughput analysis.
4 by 4 grids19 by 21 spots per grid
Swiss Institute of Bioinformatics
Addressing
4 by 4 grids
Within the same batch of print runs; estimate translation of grids
Swiss Institute of Bioinformatics
Problems in automatic addressing
• Misregistration of the red and green channels
• Rotation of the array in the image
RotationRotation
Swiss Institute of Bioinformatics
Problems in automatic addressing
• Skew in the array
Swiss Institute of Bioinformatics
ScanAlyze
• Parameters to address spot positions– Separation between rows and columns of
grids
– Individual translation of grids
– Separation between rows and columns of spots within each grid
– Small individual translation of spots
– Overall position of the array in the image
Addressing
• Basic structure of images known(determined by the arrayer)
Swiss Institute of Bioinformatics
Steps in Images Processing
• Addressing (or Gridding)– Assigning coordinates to each spot
• Segmentation– Classification of pixels as either foreground (signal)
or background
• Information Extraction– Foreground fluorescence intensity pairs (R,G)
– Background intensities
– Quality measures
Swiss Institute of Bioinformatics
Segmentation Methods
• Fixed circles
• Adaptive circles
• Adaptive shape – Edge detection – Seeded Region Growing (R. Adams and L.
Bishof (1994): Regions grow outwards from seed pointspreferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region
• Histogram methods
Swiss Institute of Bioinformatics
Fixed circle segmentation
• Fits a circle with a constant diameter to all spots in the image
• Easy to implement
• The spots should be of the same shape and size
May not be goodfor this example
Swiss Institute of Bioinformatics
Adaptive circle segmentation
• The circle diameter is estimated separately for each spot
Dapple finds spots by detecting edges of spots (second derivative)
• Problematic if spot exhibits oval shapes
Swiss Institute of Bioinformatics
Limitation of circular segmentation
—Small spot
—Not circular
Result ofSeed Region Growing
Swiss Institute of Bioinformatics
Adaptive shape segmentation
• Specification of starting points or seeds– Bonus: already know geometry of array
• Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region
Swiss Institute of Bioinformatics
Histogram segmentation
• Choose target mask larger than any spot
• Fg and bg intensities determined from the histogram of pixel values forpixels within the masked area
• Example : QuantArray– Background : mean between
5th and 20th percentile– Foreground : mean between
80th and 95th percentile
• May not work well when a large target mask is set to compensate for variation in spot size
Bkgd Foreground
Swiss Institute of Bioinformatics
Steps in Images Processing
• Addressing (or Gridding)– Assigning coordinates to each spot
• Segmentation– Classification of pixels as either foreground (signal)
or background
• Information Extraction– Foreground fluorescence intensity pairs (R,G)
– Background intensities
– Quality measures
Swiss Institute of Bioinformatics
Information Extraction
• Spot Intensities§mean of pixel intensities
§median of pixel intensities
§ Pixel variation (e.g. IQR)
• Background values§None
§ Local
§Constant (global)
• Quality Information
Take the average
Swiss Institute of Bioinformatics
Spot ‘foreground’ intensity
• The total amount of hybridization for a spot is proportional to the total fluorescence generated by the spot
• Spot intensity = sum of pixel intensities within the spot mask
• Since later calculations are based on ratiosbetween Cy5 and Cy3, we compute the average* pixel value over the spot mask*alternative : ratios of medians may be better than
means if bright specks present
Swiss Institute of Bioinformatics
Background intensity
• The measured fluorescence intensity includes a contribution of non-specific hybridizationand other chemicals on the glass
• Fluorescence from regions not occupied by DNA should be different from regions occupied by DNA
→ one solution is to use local negative controls (spotted DNA that should not hybridize)
Swiss Institute of Bioinformatics
BG: None
• Do not consider the background
– Can be better than some forms of local background determination with good quality arrays
+−++−+
=bgGbgfgGfg
bgRbgfgRfg
GG
RRM
,,
,,2log
σσσσ
++++
≈)(
)(log
,,
,,2
bgGfgGfg
bgRfgRfg
G
R
σσσσ
With a loose mathematicalnotation:
++
=fgGfg
fgRfg
G
RM
,
,2log
σσ
worse than
Swiss Institute of Bioinformatics
BG: Local
• Focus on small regions surrounding the spot mask
• Median of pixel values in this region
• Most software implements such an approach
ImaGene Spot, GenePixScanAlyze
• By ignoring pixels immediately surrounding the spots, bg estimate is less sensitive to the performance of the segmentation procedure
Swiss Institute of Bioinformatics
Background can matter
Without BG correction With BG correction
Swiss Institute of Bioinformatics
Summary
• Image analysis is a crucial preprocessing step
– Association of a "geographic" location (and corresponding annotation) with signal intensities
– Several non-trivial technical choices (scanner, image analysis software, etc…) can affect the quality of the signal
• Bg correction is sometimes not desirable (low bg arrays)
Swiss Institute of Bioinformatics
Biological question(e.g. Differentially expressed genes,
Sample class prediction, etc.)
Testing
Biological verification and interpretation
Microarray experiment
Significance
Experimental design
Image analysis/Quality assessment
Normalization
Clustering Discrimination
(failed)
Pre-processing steps
Data Analysis
Scientific Process
Swiss Institute of Bioinformatics
Quality assessment overview
Visual inspection of images
Evaluation of MvA plots
Compare statistical summaries for the chips
Swiss Institute of Bioinformatics
Good: low bg, detectable d.e.Bad: high bg, ghost spots, little d.e.
Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifacts such as dust or scratches
Red/Green overlay images
Swiss Institute of Bioinformatics
Spatial plots: background from two slides
Swiss Institute of Bioinformatics
Practical Problems 1
Comet Tails
§ Likely cause: insufficiently rapid immersion of the slides in the succinic anhydride blocking solution
Swiss Institute of Bioinformatics
Practical Problems 2
Blotches, unequal spot sizes, overlapping spots
Swiss Institute of Bioinformatics
Practical Problems 3
High Background• 2 likely causes:
– Insufficient blocking
– Precipitation of the labeled probe
Weak Signals
Swiss Institute of Bioinformatics
Practical Problems 4
Spot overlap§Likely cause: Too much rehydrationduring post -processing
Swiss Institute of Bioinformatics
Practical Problems 5
Swiss Institute of Bioinformatics
Artifacts in microarrays
• We are interested in finding true biologically meaningful differences between sample types
• Due to other sources of systematic variation, there are also usually artifactual differences
• Sources of artifacts include:– print tips - differences in subarrays
– plate effects – differences in rows within subarray
– batch effects
– hybridization artifacts
Swiss Institute of Bioinformatics
Sample boxplot
median
Q3
Q1
`whiskers’
suspected outliers
Swiss Institute of Bioinformatics
Liver samples from 16 mice: 8 WT, 8 ApoAI KO
Boxplots of log2R/G
(Example data associated to limmaGUI package.)
Swiss Institute of Bioinformatics
Biological question(e.g. Differentially expressed genes,
Sample class prediction, etc.)
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis/Quality assessment
Normalization
Clustering Discrimination
(failed)
Pre-processing steps
Data Analysis
Scientific Process
Swiss Institute of Bioinformatics
Boxplots of log ratios by pin groupLowess lines through points from pin groups
Pin group (sub-array) effects
Swiss Institute of Bioinformatics
Boxplots, highlighting pin group effects
Clear example of spatial bias
Print-tip group
Log-ratio
Swiss Institute of Bioinformatics
Preprocessing: Normalization
• Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples
• How do we know it is necessary?By examining self-self hybridizations, where no true differential expression is occurring.
There are dye biases which vary with spot intensity, location on the array, plate origin, pins, scanning parameters, etc.
Swiss Institute of Bioinformatics
What is self-self hybridization?
• In dual channel (2-color) microarrays, such as cDNA arrays, two samples are each labeledwith a different fluorescent dye
• In most studies, the samples are from different sources (e.g. cancer vs. normal)
• However, it is also possible to co-hybridize two samples from the same source (but differently labeled)
Swiss Institute of Bioinformatics
Dual channel co-hybridizations
(self-self)Control sample
Treated sample
Control sample
Control sample
Swiss Institute of Bioinformatics
False color overlay Boxplots within pin-groups Scatter (MA-)plots
Self-self hybridizations
Swiss Institute of Bioinformatics
Normalization: global
• Normalization based on a global adjustment
log2 R/G → log2 R/G - c = log2 R/(kG)
• Common choices for k or c = log2k are c = median or mean of log ratios for a particular gene set (e.g. all genes, or control, or ‘housekeeping’ genes)
• Another possibility is total intensitynormalization, where k = ∑Ri/ ∑Gi
Swiss Institute of Bioinformatics
Effect of global normalization
Swiss Institute of Bioinformatics
Normalization: intensity-dependent
• Here, run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e.
log2 R/G → log2 R/G - c (A) = log2 R/(k(A)G)
• One estimate of c(A) is made using the LOWESS (or loess) function of Cleveland (1979): LOcallyWEighted Scatterplot Smoothing
Swiss Institute of Bioinformatics
Effect of lowess normalization
Swiss Institute of Bioinformatics
Comparison between arrays
• Different arrays often do not show identical signal distribution of M values– Various technical reasons (e.g. labeling efficiency, amount of
labelled RNA, scanner settings, etc…)
• Need to normalize the signalbetween chips– Multiple possibilities, one
often used: "scale normalization"
Swiss Institute of Bioinformatics
Boxplots of log ratios from 3 replicate self-self hybs
Left panel: beforenormalization
Middle panel: after within print-tip groupnormalization
Right panel: after a further between-slide scalenormalization
Scale normalization: between slides
Idea: make the median spread of M values identical by multiplyingthem by a chip-specific constant
Swiss Institute of Bioinformatics
Assume: All slides have the same spread in M
• True log ratio is mij wherei represents differentslides and j represents different spots
• Observed is Mij, where Mij = ai mij
• Robust estimate of ai is
MAD i = medianj { |mij - median(mij) | }
• Could instead make same assumption for print tipgroups (rather than slides)
Taking scale into account
Swiss Institute of Bioinformatics
NCI 60 experiments
Swiss Institute of Bioinformatics
Same normalization on another data set
Before
After
Swiss Institute of Bioinformatics
Normalization: Summary
• Reduces systematic (not random) effects• Makes it possible to compare several arrays
• Use logratios (MVA plots)• Lowess normalization (dye bias)• Pin-group location normalization• Pin-group scale normalization• Between slide scale normalization
• Control Spots• Normalization introduces more variability• Outliers (bad spots) handled with replication
Swiss Institute of Bioinformatics
cDNA gene expression data
Data on p genes for n samples:
Genes(Spots)
mRNA samples
Gene expression level of gene i in mRNA sample j
= (normalized) Log2( Red intensity / Green intensity)
sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
M
Swiss Institute of Bioinformatics
Software for Microarray Analysis
• Very large number of commercial and free softwares (GeneSpring, PathwayAssist,…)
• There are several R packages for microarrayanalysis available as part of the open source BioConductor project http://www.bioconductor.org/
• BioC software often created by the author of the methodology
Swiss Institute of Bioinformatics
Biological question(e.g. Differentially expressed genes,
Sample class prediction, etc.)
Testing
Biological verification and interpretation
Microarray experiment
Significance
Experimental design
Image analysis/Quality assessment
Normalization
Clustering Discrimination
(failed)
Pre-processing steps
Data Analysis
Scientific Process
Swiss Institute of Bioinformatics
cDNA gene expression data
Data on p genes for n samples:
Genes(Spots)
mRNA samples
Gene expression level of gene i in mRNA sample j
= (normalized) Log2( Red intensity / Green intensity)
sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
M
Swiss Institute of Bioinformatics
Replicated experiments
• Have n replicates
• For each gene, have n values of M = log2 fold change, one from each array
• Summarize M1, ..., Mn for each gene by
– M = average (M1, ..., Mn)
– s = SD(M1, ..., Mn)
• Rank genes in order of strength of evidence in favor of DE
• How might we do this?
Swiss Institute of Bioinformatics
Which genes are DE?
• Difficult to judge significance
– massive multiple testing problem
– genes dependent
– don’t know null distribution of M
• Strategy
– aim to rank genes
– assume most genes are not DE (depending on type of experiment and array)
– find genes separated from the majority
Swiss Institute of Bioinformatics
Ranking criteria
• Genes i = 1, ...,p
• M i = average log2 fold change for gene i
– Problem : genes with large variability likely to be selected, even if not DE
• Fix that by taking variability into account: use ti = Mi/ (si/√n)
– Problem : genes with extremely small variances make very large t
– When the number of replicates is small, the smallest si are likely to be underestimates
Swiss Institute of Bioinformatics
Summary
• Image analysis is important to extract information from the array– Background may or may not be taken into account
• Normalization procedures are always needed– To remove systematic (technical) effects
– To allow comparisons between chips
• Identification of differentially expressed genes is difficult– No absolute estimation of significance is possible
– Ranking of genes by significance is possible
Swiss Institute of Bioinformatics
End of part II
Swiss Institute of Bioinformatics
Part IIIHigher-level analysis
Swiss Institute of Bioinformatics
Finding biological information
Once the matrix of gene-expression vs samples is available, statistical tools can be used to:
• Find similarity (or difference) of expression pattern in differentially expressed genes
• Find differentially expressed functional groups of genes (pathway analysis, gene ontology)
• Find classes in the set of samples(unsupervised analysis)
• Use differentially expressed genes as a mean to classify samples in known categories(supervised analysis)
• Find genes significantly related to survival in a pool of patients
Swiss Institute of Bioinformatics
Unsupervised analysis: Cluster analysis
• data matrix (n,p)
• distance matrix (n,n), similarity matrix (n,n)
• cluster formation:– mutually exclusive
clusters– hierarchical clusters
• comparison of clusters, means and variances
Dendrogram
Swiss Institute of Bioinformatics
Hierarchical Clustering (real case)
Sorlie et al. Proc Natl Acad Sci U S A 2001 Sep 11;98(1 9):10869-74
Swiss Institute of Bioinformatics
Swiss Institute of Bioinformatics
Unsupervised analysis: PCA
• Principal Components Analysis (PCA)
• Columns (resp. rows) of expression matrix viewed as points in multidimensional space
• Find “dominant” directions in space and hope these directions can be associated with known parameters
• Genes (resp. samples) with largest projection on those vectors explain the parameter
X1
X2
PC1PC2
X1
X2
Swiss Institute of Bioinformatics
Supervised analysis example: KNN
• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation)
• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:– find the k observations in the learning set closest to X
– predict the class of X by majority vote, i.e., choose the class that is most common among those k observations.
• The number of neighbors k can be chosen by cross-validation
?
Gene 1
Gene 2
k-Nearest Neighbor (knn)Data MatrixGene
Sam
ple
Swiss Institute of Bioinformatics
High-level analysis (summary)
• Needed to extract biological information after low-level processing conducted
• Can be used to:– Discover new interactions between genes (unsupervised
analysis)
– Validate biological hypotheses (supervised analysis)
• Contact your local statistician for more on this topic !
Swiss Institute of Bioinformatics
End of Part III