embnet's introduction to bioinformatics€¦ · swiss institute of bioinformatics dna...

106
Swiss Institute of Bioinformatics EMBnet's introduction to bioinformatics Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

EMBnet's introduction to bioinformatics

Introduction to microarrays

Thierry Sengstag, PhDBioinformatics Core Facility

Page 2: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Part ITechnology of microarrays

Page 3: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biology Fundamentals - Genes

Page 4: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biology Fundamentals - Expression

Transcriptome: Genes

Proteome: Proteins

Microarrays

Page 5: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Genomics Fundamentals - Complexity

Difficulties:

§Contaminations

§Alternative Splicing

§Alternative PolyAdenylation

mRNA purification

Page 6: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

RNA abundance in mammalian cells

rRNA

tRNAmRNA

80% 1%1-50

50-500

500+

Molecules/cell

3 x 106 molecules/cell 3 x 105 molecules/cell1-2 x104 different genes

Page 7: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes in a small sample and to analyze the expression of those genes.

Microarrays are ordered sets of DNA molecules of known sequence attached on a surface at a known position (spot) .

In a microarray experiment one hybridizes mRNA molecules of an extract of a sample to the spots of the microarray.

Main families of microarrays:

• Spotted arrays, PCR products spotted on chip, ~500 nt

• Oligo-arrays, e.g. Agilent, oligos ~60 nt length, in-situ

• Affymetrix, short sequence oligos, 25 nt, in-situ

What are DNA Microarrays ?

Page 8: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

1- Samples

2- Extracting mRNA

3- Labeling

4- Hybridizing

5- Scanning

6- Visualizing

Page 9: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Various technological choices:• 104 to 106 features on a single array• Single- vs two-color approach• Hybridization protocols

Questions addressed:• What are the differences (in gene expression) between cell lines ?• What is the difference between knock-out and wild-type mice?• What is the difference between a tumor and a healthy tissue ?• Are there different tumor types ?

Key concept:Compare gene expression in two (or more) cell/tissue types ?

Gene expression assessed by measuring the number of RNA transcripts in a tissue sample. (Primary goal of this course.)

What are DNA Microarrays ?

Page 10: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Phase 1: Preparation of the microarray environment

• Which sequences do we want to interrogate on the arrays ?

• Other technically important questions (choice of scanner, chemistry, etc…)

• Presently (2006): commercial platforms are standard for "usual" organisms; "exotic" organisms still require custom-made arrays

Phase 2: Use of the microarray

• "Experimental design" (with a statistician)

• Preparation of RNA samples

• Hybridization, scanning, signal extraction

• Statistical analysis

Two major phases of a microarray experiment

Page 11: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Phase 1: Which sequences should we spot ?

Depends on the organism:

• Human, Mouse, Rat, … have large databases of sequences that can be used to design probes by bioinformatics means

• "Exotic" organisms require cloning of 100s to 1000s mRNA transcripts, then spotting of DNA after PCR amplification (sequencing of interesting genes can be done after microarrayexperiment)

Depends on platform:

• Oligo arrays can probe any region of a mRNA transcript

• Affymetrix require the sequence to be in the 3'-UTR region

Page 12: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biological question(e.g. Differentially expressed genes,

Sample class prediction, etc.)

Testing

Biological verification and interpretation

Microarray experiment

Significance

Experimental design

Image analysis/Quality assessment

Normalization

Clustering Discrimination

(failed)

Pre-processing steps

Data Analysis

Scientific Process

Phase 2:

Page 13: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Spotted array preparation

“Average” mouse mRNA

cDNA isolation

Test sequence(probe) production

~100 - ~2000 bp

RT-PCR (conversionmRNA-cDNA, amplification)

Page 14: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Array Production: Spotting

Page 15: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Spotting in action…

1. Some rounds of pin cleaning2. Pickup PCR products from plate3. Spot one feature on each subarray

Spotting arrays

Page 16: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Oligo array preparation

Sequence databases

Millions of experiencesworldwide

Probe (sequence) design-known genes -putative genes-alternative splicing-GC contents

Gene-specific sequences

~60 bp sequences

In-situ synthesis

Page 17: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Spotted and oligo array usage

Hybridizationwashing

Relative mRNAlevels

Scanning

cy5 labeledcDNA

cy3 labeledcDNA

Mix

Page 18: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Affymetrix chip preparation

Sequence databases

Millions of experimentsworldwide

Probe (sequence) design-known genes -putative genes-alternative splicing-GC contents

Bioinformatics thinkingyields gene-specific sequences (3’-end)

25 nt sequences

In-situ synthesis

~100s of nt “consensus”sequences

Page 19: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Affymetrix chip usage

Hybridizationwashing

Relative mRNA levels

Scanning

Scanning

Page 20: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Affymetrix system

(11 to 16)

Usually the most 3 prime area, often UTR

25mer25mer

25mer

AAAA..

25mer

Page 21: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Probe preparation & hybridization

• Extract mRNA or total RNA• RT, add 5’ anchor• PCR with labelled nucleotide (Cy3, Cy5, DIG, …)• Overlay probe on the chip, put in the hybridization

chamber, wash

Page 22: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Scanner basics

• Based on fluorescence– 1 or 2 lasers: cy3 cy5

(seldom more)

• Most scanners are confocal– Target a very limited volume

of space(signal only from focal plane)

– Need to “scan” the surface

• 16-bits ADC converters– Range of values: 0-65535– Log2 range: 0 – 16

• Scan various supports– Glass Slide (e.g. Agilent,

PerkinElmer)– Affymetrix

Page 23: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Confocal scanner

Dye Photons Electrons SignalLaser PMT

A/DConverter

excitation amplification FilteringTime-spaceaveraging

Page 24: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Images from Scanner

• Resolution– standard 10 µm [currently, best ~ 5µm]– 100µm spot on chip = 10 pixels in diameter

• Image format– Typically: TIFF (tagged image file format) 16 bit (65,536

levels of gray) – also other formats – 1cm x 1cm image at 16 bit = 2Mb

• Separate image for each fluorescent sample– channel 1, channel 2, etc.

Page 25: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Images in analysis software

• The two 16-bit images (Cy3, Cy5) are viewed as 8-bit images

• Display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image

• RGB image :– Blue values (B) are set to 0

– Red values (R) are used for Cy5 intensities

– Green values (G) are used for Cy3 intensities

• Qualitative representation of results

Page 26: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Images : examples

Cy3

Cy5

repressedControl > Treatedgreen

inducedControl < Treatedred

unchangedControl = Treatedyellow

Gene expression

Signal strengthSpot color

Pseudo-color overlay

Page 27: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Image analysis (scanner variability)

ScanArray 4000

Agilent G2565AA

Page 28: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Image processing

• Align channels• Identify spot pixels• Identify background pixels• Compute representative value, e.g.

– Mean foreground value– Median background value

Page 29: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

2-color Arrays Image Processing

GenePix

Page 30: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

2-color Arrays Image Processing

A difficult case… JJJJ

Page 31: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Quantification of Expression

For each spot on the slide, calculate

Red intensity = Rfg - Rbg

(fg = foreground, bg = background) and

Green intensity = Gfg - Gbg

and combine them in the log (base 2) ratio

Log2(Red/Green)

Often, fg = mean and bg = median of relevant pixel intensities

Page 32: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

M and A values

• M-value is the log2 of the ratio of expression values

– Properties:• If the gene is expressed with the same intensity in the red

and green conditions: M=0• If the gene is moreexpressed in the redcondition: M>0• if the gene is moreexpressed in the greencondition: M<0

• A-value is the average of the log2 of expression

M = log 2( Red / Green )= log 2( Red ) – log 2( Green )

A = 1/2 ( log 2( Red ) + log 2( Green ) )= log 2( sqrt( Red * Green ) )

Page 33: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Why we take logs

Linear scale Log scale

Better representation of genes with "medium" expression:

Biologically, a unit change in log2 represents a 2-fold change.

Page 34: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

MvA plots

• Relationship between Intensity and MvA plots

log2(Green)

log2(Red)

0 160

16M

A16

45o Rotation

Red saturationRed saturation

Green saturationGreen saturation

0

+1

-1(+ stretching by afactor 2 along M axis)

Page 35: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Hybridization of extra materialwith known concentrations (spikes)

A real-life MvA plot

Page 36: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

End of Part I

Page 37: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Part IIExtraction of gene signal

from microarrays

Page 38: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biological question(e.g. Differentially expressed genes,

Sample class prediction, etc.)

Testing

Biological verification and interpretation

Microarray experiment

Significance

Experimental design

Image analysis/Quality assessment

Normalization

Clustering Discrimination

(failed)

Pre-processing steps

Data Analysis

Scientific Process

Page 39: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Steps in Images Processing

• Addressing (or Gridding)– Assigning coordinates to each spot

• Segmentation– Classification of pixels as either foreground (signal)

or background

• Information Extraction– Foreground fluorescence intensity pairs (R,G)

– Background intensities

– Quality measures

Page 40: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Addressing

This is the process of assigning coordinates to each of the spots.

Automating this part of the procedure permits high throughput analysis.

4 by 4 grids19 by 21 spots per grid

Page 41: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Addressing

4 by 4 grids

Within the same batch of print runs; estimate translation of grids

Page 42: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Problems in automatic addressing

• Misregistration of the red and green channels

• Rotation of the array in the image

RotationRotation

Page 43: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Problems in automatic addressing

• Skew in the array

Page 44: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

ScanAlyze

• Parameters to address spot positions– Separation between rows and columns of

grids

– Individual translation of grids

– Separation between rows and columns of spots within each grid

– Small individual translation of spots

– Overall position of the array in the image

Addressing

• Basic structure of images known(determined by the arrayer)

Page 45: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Steps in Images Processing

• Addressing (or Gridding)– Assigning coordinates to each spot

• Segmentation– Classification of pixels as either foreground (signal)

or background

• Information Extraction– Foreground fluorescence intensity pairs (R,G)

– Background intensities

– Quality measures

Page 46: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Segmentation Methods

• Fixed circles

• Adaptive circles

• Adaptive shape – Edge detection – Seeded Region Growing (R. Adams and L.

Bishof (1994): Regions grow outwards from seed pointspreferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region

• Histogram methods

Page 47: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Fixed circle segmentation

• Fits a circle with a constant diameter to all spots in the image

• Easy to implement

• The spots should be of the same shape and size

May not be goodfor this example

Page 48: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Adaptive circle segmentation

• The circle diameter is estimated separately for each spot

Dapple finds spots by detecting edges of spots (second derivative)

• Problematic if spot exhibits oval shapes

Page 49: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Limitation of circular segmentation

—Small spot

—Not circular

Result ofSeed Region Growing

Page 50: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Adaptive shape segmentation

• Specification of starting points or seeds– Bonus: already know geometry of array

• Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region

Page 51: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Histogram segmentation

• Choose target mask larger than any spot

• Fg and bg intensities determined from the histogram of pixel values forpixels within the masked area

• Example : QuantArray– Background : mean between

5th and 20th percentile– Foreground : mean between

80th and 95th percentile

• May not work well when a large target mask is set to compensate for variation in spot size

Bkgd Foreground

Page 52: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Steps in Images Processing

• Addressing (or Gridding)– Assigning coordinates to each spot

• Segmentation– Classification of pixels as either foreground (signal)

or background

• Information Extraction– Foreground fluorescence intensity pairs (R,G)

– Background intensities

– Quality measures

Page 53: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Information Extraction

• Spot Intensities§mean of pixel intensities

§median of pixel intensities

§ Pixel variation (e.g. IQR)

• Background values§None

§ Local

§Constant (global)

• Quality Information

Take the average

Page 54: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Spot ‘foreground’ intensity

• The total amount of hybridization for a spot is proportional to the total fluorescence generated by the spot

• Spot intensity = sum of pixel intensities within the spot mask

• Since later calculations are based on ratiosbetween Cy5 and Cy3, we compute the average* pixel value over the spot mask*alternative : ratios of medians may be better than

means if bright specks present

Page 55: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Background intensity

• The measured fluorescence intensity includes a contribution of non-specific hybridizationand other chemicals on the glass

• Fluorescence from regions not occupied by DNA should be different from regions occupied by DNA

→ one solution is to use local negative controls (spotted DNA that should not hybridize)

Page 56: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

BG: None

• Do not consider the background

– Can be better than some forms of local background determination with good quality arrays

+−++−+

=bgGbgfgGfg

bgRbgfgRfg

GG

RRM

,,

,,2log

σσσσ

++++

≈)(

)(log

,,

,,2

bgGfgGfg

bgRfgRfg

G

R

σσσσ

With a loose mathematicalnotation:

++

=fgGfg

fgRfg

G

RM

,

,2log

σσ

worse than

Page 57: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

BG: Local

• Focus on small regions surrounding the spot mask

• Median of pixel values in this region

• Most software implements such an approach

ImaGene Spot, GenePixScanAlyze

• By ignoring pixels immediately surrounding the spots, bg estimate is less sensitive to the performance of the segmentation procedure

Page 58: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Background can matter

Without BG correction With BG correction

Page 59: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Summary

• Image analysis is a crucial preprocessing step

– Association of a "geographic" location (and corresponding annotation) with signal intensities

– Several non-trivial technical choices (scanner, image analysis software, etc…) can affect the quality of the signal

• Bg correction is sometimes not desirable (low bg arrays)

Page 60: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biological question(e.g. Differentially expressed genes,

Sample class prediction, etc.)

Testing

Biological verification and interpretation

Microarray experiment

Significance

Experimental design

Image analysis/Quality assessment

Normalization

Clustering Discrimination

(failed)

Pre-processing steps

Data Analysis

Scientific Process

Page 61: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Quality assessment overview

Visual inspection of images

Evaluation of MvA plots

Compare statistical summaries for the chips

Page 62: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Good: low bg, detectable d.e.Bad: high bg, ghost spots, little d.e.

Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifacts such as dust or scratches

Red/Green overlay images

Page 63: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Spatial plots: background from two slides

Page 64: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Practical Problems 1

Comet Tails

§ Likely cause: insufficiently rapid immersion of the slides in the succinic anhydride blocking solution

Page 65: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Practical Problems 2

Blotches, unequal spot sizes, overlapping spots

Page 66: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Practical Problems 3

High Background• 2 likely causes:

– Insufficient blocking

– Precipitation of the labeled probe

Weak Signals

Page 67: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Practical Problems 4

Spot overlap§Likely cause: Too much rehydrationduring post -processing

Page 68: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Practical Problems 5

Page 69: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Artifacts in microarrays

• We are interested in finding true biologically meaningful differences between sample types

• Due to other sources of systematic variation, there are also usually artifactual differences

• Sources of artifacts include:– print tips - differences in subarrays

– plate effects – differences in rows within subarray

– batch effects

– hybridization artifacts

Page 70: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Sample boxplot

median

Q3

Q1

`whiskers’

suspected outliers

Page 71: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Liver samples from 16 mice: 8 WT, 8 ApoAI KO

Boxplots of log2R/G

(Example data associated to limmaGUI package.)

Page 72: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biological question(e.g. Differentially expressed genes,

Sample class prediction, etc.)

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis/Quality assessment

Normalization

Clustering Discrimination

(failed)

Pre-processing steps

Data Analysis

Scientific Process

Page 73: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Boxplots of log ratios by pin groupLowess lines through points from pin groups

Pin group (sub-array) effects

Page 74: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Boxplots, highlighting pin group effects

Clear example of spatial bias

Print-tip group

Log-ratio

Page 75: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Preprocessing: Normalization

• Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples

• How do we know it is necessary?By examining self-self hybridizations, where no true differential expression is occurring.

There are dye biases which vary with spot intensity, location on the array, plate origin, pins, scanning parameters, etc.

Page 76: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

What is self-self hybridization?

• In dual channel (2-color) microarrays, such as cDNA arrays, two samples are each labeledwith a different fluorescent dye

• In most studies, the samples are from different sources (e.g. cancer vs. normal)

• However, it is also possible to co-hybridize two samples from the same source (but differently labeled)

Page 77: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Dual channel co-hybridizations

(self-self)Control sample

Treated sample

Control sample

Control sample

Page 78: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

False color overlay Boxplots within pin-groups Scatter (MA-)plots

Self-self hybridizations

Page 79: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Normalization: global

• Normalization based on a global adjustment

log2 R/G → log2 R/G - c = log2 R/(kG)

• Common choices for k or c = log2k are c = median or mean of log ratios for a particular gene set (e.g. all genes, or control, or ‘housekeeping’ genes)

• Another possibility is total intensitynormalization, where k = ∑Ri/ ∑Gi

Page 80: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Effect of global normalization

Page 81: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Normalization: intensity-dependent

• Here, run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e.

log2 R/G → log2 R/G - c (A) = log2 R/(k(A)G)

• One estimate of c(A) is made using the LOWESS (or loess) function of Cleveland (1979): LOcallyWEighted Scatterplot Smoothing

Page 82: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Effect of lowess normalization

Page 83: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Comparison between arrays

• Different arrays often do not show identical signal distribution of M values– Various technical reasons (e.g. labeling efficiency, amount of

labelled RNA, scanner settings, etc…)

• Need to normalize the signalbetween chips– Multiple possibilities, one

often used: "scale normalization"

Page 84: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Boxplots of log ratios from 3 replicate self-self hybs

Left panel: beforenormalization

Middle panel: after within print-tip groupnormalization

Right panel: after a further between-slide scalenormalization

Scale normalization: between slides

Idea: make the median spread of M values identical by multiplyingthem by a chip-specific constant

Page 85: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Assume: All slides have the same spread in M

• True log ratio is mij wherei represents differentslides and j represents different spots

• Observed is Mij, where Mij = ai mij

• Robust estimate of ai is

MAD i = medianj { |mij - median(mij) | }

• Could instead make same assumption for print tipgroups (rather than slides)

Taking scale into account

Page 86: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

NCI 60 experiments

Page 87: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Same normalization on another data set

Before

After

Page 88: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Normalization: Summary

• Reduces systematic (not random) effects• Makes it possible to compare several arrays

• Use logratios (MVA plots)• Lowess normalization (dye bias)• Pin-group location normalization• Pin-group scale normalization• Between slide scale normalization

• Control Spots• Normalization introduces more variability• Outliers (bad spots) handled with replication

Page 89: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

cDNA gene expression data

Data on p genes for n samples:

Genes(Spots)

mRNA samples

Gene expression level of gene i in mRNA sample j

= (normalized) Log2( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

M

Page 90: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Software for Microarray Analysis

• Very large number of commercial and free softwares (GeneSpring, PathwayAssist,…)

• There are several R packages for microarrayanalysis available as part of the open source BioConductor project http://www.bioconductor.org/

• BioC software often created by the author of the methodology

Page 91: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Biological question(e.g. Differentially expressed genes,

Sample class prediction, etc.)

Testing

Biological verification and interpretation

Microarray experiment

Significance

Experimental design

Image analysis/Quality assessment

Normalization

Clustering Discrimination

(failed)

Pre-processing steps

Data Analysis

Scientific Process

Page 92: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

cDNA gene expression data

Data on p genes for n samples:

Genes(Spots)

mRNA samples

Gene expression level of gene i in mRNA sample j

= (normalized) Log2( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

M

Page 93: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Replicated experiments

• Have n replicates

• For each gene, have n values of M = log2 fold change, one from each array

• Summarize M1, ..., Mn for each gene by

– M = average (M1, ..., Mn)

– s = SD(M1, ..., Mn)

• Rank genes in order of strength of evidence in favor of DE

• How might we do this?

Page 94: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Which genes are DE?

• Difficult to judge significance

– massive multiple testing problem

– genes dependent

– don’t know null distribution of M

• Strategy

– aim to rank genes

– assume most genes are not DE (depending on type of experiment and array)

– find genes separated from the majority

Page 95: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Ranking criteria

• Genes i = 1, ...,p

• M i = average log2 fold change for gene i

– Problem : genes with large variability likely to be selected, even if not DE

• Fix that by taking variability into account: use ti = Mi/ (si/√n)

– Problem : genes with extremely small variances make very large t

– When the number of replicates is small, the smallest si are likely to be underestimates

Page 96: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Summary

• Image analysis is important to extract information from the array– Background may or may not be taken into account

• Normalization procedures are always needed– To remove systematic (technical) effects

– To allow comparisons between chips

• Identification of differentially expressed genes is difficult– No absolute estimation of significance is possible

– Ranking of genes by significance is possible

Page 97: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

End of part II

Page 98: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Part IIIHigher-level analysis

Page 99: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Finding biological information

Once the matrix of gene-expression vs samples is available, statistical tools can be used to:

• Find similarity (or difference) of expression pattern in differentially expressed genes

• Find differentially expressed functional groups of genes (pathway analysis, gene ontology)

• Find classes in the set of samples(unsupervised analysis)

• Use differentially expressed genes as a mean to classify samples in known categories(supervised analysis)

• Find genes significantly related to survival in a pool of patients

Page 100: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Unsupervised analysis: Cluster analysis

• data matrix (n,p)

• distance matrix (n,n), similarity matrix (n,n)

• cluster formation:– mutually exclusive

clusters– hierarchical clusters

• comparison of clusters, means and variances

Dendrogram

Page 101: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Hierarchical Clustering (real case)

Sorlie et al. Proc Natl Acad Sci U S A 2001 Sep 11;98(1 9):10869-74

Page 102: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Page 103: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Unsupervised analysis: PCA

• Principal Components Analysis (PCA)

• Columns (resp. rows) of expression matrix viewed as points in multidimensional space

• Find “dominant” directions in space and hope these directions can be associated with known parameters

• Genes (resp. samples) with largest projection on those vectors explain the parameter

X1

X2

PC1PC2

X1

X2

Page 104: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

Supervised analysis example: KNN

• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation)

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:– find the k observations in the learning set closest to X

– predict the class of X by majority vote, i.e., choose the class that is most common among those k observations.

• The number of neighbors k can be chosen by cross-validation

?

Gene 1

Gene 2

k-Nearest Neighbor (knn)Data MatrixGene

Sam

ple

Page 105: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

High-level analysis (summary)

• Needed to extract biological information after low-level processing conducted

• Can be used to:– Discover new interactions between genes (unsupervised

analysis)

– Validate biological hypotheses (supervised analysis)

• Contact your local statistician for more on this topic !

Page 106: EMBnet's introduction to bioinformatics€¦ · Swiss Institute of Bioinformatics DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes

Swiss Institute of Bioinformatics

End of Part III