Download - Steps involved in microarray analysis after the experimentsbioinfo.mbb.yale.edu/mbb452a/2002/expression.pdf · Processing flow chart Data input Background correction Cy5/Cy3 normalisation

1

Steps involved in microarray analysisafter the experiments

• Scanning slides to create images• Conversion of images to numerical data• Processing of raw numerical data• Further analysis

– Clustering– Integration with genomic data

Steps involved in microarray analysisafter the experiments

• Scanning slides to create images• Conversion of images to numerical data• Processing of raw numerical data• Further analysis

– Clustering– Integration with genomic data

2

Processing raw microarray data

• Main aims: – To identify and reduce the noise found in

microarray data

– To identify differentially expressed genes/fragments in an experiment

Methods used so far…

• Many have been ad hoc

3


• Many have been ad hoc– Visual identification (!)


• Many have been ad hoc– Visual identification (!)– 2-fold change in hybridisation signals

4


• Many have been ad hoc– Visual identification (!)– 2-fold change in hybridisation signals– Ranking ratios of hybridisation signals



• Microarray expts are

5



• Microarray expts are– Easy to do, but very sensitive– Lot of noise, artifacts, errors– ~20,000 spots per slide



• Microarray expts are– Easy to do, but very sensitive– Lot of noise, artifacts, errors– ~20,000 spots per slide

• Without good processing the results can be completely wrong

6

This is changing…..

0

200

400

600

800

10001997

1998

1999

2000

2001

2002

Year

PubM

ed h

its

clusteringprocessing

microarray

This is changing…..• Opposing camps

0

200

400

600

800

1000

1997

1998

1999

2000

2001

2002

Year

PubM

ed h

its


microarray

7


– Ad hoc camp• Biologists• Too simple and

may be wrong

0

200

400

600

800

10001997

1998

1999

2000

2001

2002

Year

PubM

ed h

its


microarray



may be wrong

– Complicated maths

• Biostatisticians• Incomprehensible• Idealised datasets0

200

400

600

800

1000

1997

1998

1999

2000

2001

2002

Year

PubM

ed h

its


microarray

8



may be wrong

– Complicated maths

• Biostatisticians• Incomprehensible• Idealised datasets

– Must strike a balance

0

200

400

600

800

10001997

1998

1999

2000

2001

2002

Year

PubM

ed h

its


microarray

Getting the most out of your microarray

• Processing the raw data

• Cleaning and assessing the quality of your data

• Identifying differentially hybridised spots

• How do you get the correct list of differentially expressed genes out of ~20000 data points?

9






Processing flow chart

Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

10


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

The data – a GenePix file

11

GenePix file – what do we look at?

• Measure red and green intensities separately

• Signal intensity = foreground – background




• Foreground signal

12





• Background signal





• Background signal

• Ratio = red/green or green/red

13


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Background correction

• Genepix background– Median intensity of

immediate area surrounding each spot

• But– Very variable

between individual spots

– Artifactual background from smudges

• Therefore add to noise

14


• Calculate the average background from surrounding area of spot

• Recommendation of 3x3 – 5x5 area

• Repeat for red and green separately


• Still have variable distribution of intensity

• Much smoother distribution of background intensity

• Remove artifactual smudges

15


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Red/green normalisation

• Normalise red and green intensities – spots with equal hybridisation should have similar

intensities– ie ratio ~ 1 for similarly expressed genes

• Otherwise, will have wrong list of differentially expressed genes

• Multiply one set of intensities by a scale factor

• But must obtain scale factor

16

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 1 2 3 4

red/green ratio

num

ber o

f spo

tsMajority method

• Majority method:– Assume that most spots do

not change expression level

– Find average ratio (red/green intensity)

– Scale factor is the amount by which need to multiply the ratio so it is 1.

Majority method

• But several issues– Hybridisation levels differ according to location on slide

– Scanning properties for different colours differ at different intensities

• Scale factor must take these into account

17

Intensity considerations

• Simple scale factor fits a straight line

0

10000

20000

30000

40000

50000

60000

0 10000 20000 30000 40000 50000 60000

green intensity

red

inte

nsity



• Distribution is curved– Difference in ratio can be

10-fold different depending on intensity

0

10000

20000

30000

40000

50000

60000

0 10000 20000 30000 40000 50000 60000

green intensity

red

inte

nsity

18



• Distribution is curved– Difference in ratio can be

10-fold different depending on intensity

• Different scale factors should be used for different intensities

0

10000

20000

30000

40000

50000

60000

0 10000 20000 30000 40000 50000 60000

green intensity

red

inte

nsity

Positional considerations

• Different regions of the slide have different levels of hybridization

• Difference in average ratio can be ~10 fold

• Different scale factors needed for each region of slide

19


0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 1 2 3 4

red/green ratio

num

ber o

f spo

ts

Scale factor for each spot• Calculate the scale factor using surrounding area of spot

• Recommendation of 12x12 – 20x20 area


• Raw data has large positional dependence

Before

20



• Normalisationwithout pos. shifted ratios towards red intensity, but does not remove artifact

Before No positional data



• Normalisationwithout pos. shifted ratios towards red intensity, but does not remove artifact

• Positional normalisationremoves most of the artifact

Before No positional data Positional data

21


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Combining multiple experiments

• Often have replicates of the same experiment

• Do you have to scale between them?• How do you combine the data from them?

22

Replicate scaling• Different slides may have

different spreads in intensity and ratios

• Adjust spread of distributions by measuring standard deviation

• Estimate of variance by quartiles, fitting distribution, or bootstrapping

0

500

1000

1500

2000

2500

0

200

400

600

800

1000

1200

1400

Combining replicate data

• Take medians of ratios?• Take means of intensity values?• Take weighted means?• Treat each experiment individually and see

which spots are consistently differntially expressed?

23







Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability

Replicate experiment variability

24


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


Filter bad spots

• Small spots• Smudgy spots• Non-round spots

etc…

25


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


Artifactual array regions

Remove regions that have artifactual background after correction

Background artifacts usually vary from slide to slide – ie not consistent

26


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


Measurement of chip quality

• How successful is an experiment?– How consistent are the hybridisations

within experiments

27

Measurement of intrachip quality• Genes are often placed as

neighbouring pairs

• Variation between them provides a measure of variation within an experiment– (x1 – x2)/(x1+x2)

• Mean gives overall variation for expt.

• Remove outliers

• Are the same spots always inconsistent across replicate expts?0

20

40

60

80

100

120

140

-1.2-1.02-0.84-0.66-0.48 -0.

3-0.12 0.0

60.240.42 0.6 0.7

80.961.14

Variation

Num

ber o

f spo

ts

Experimentquality

Outliers

Intrachip variability

0

20

40

60

80

100

120

140

-1.2-1.02-0.84-0.66-0.48 -0.

3-0.12 0.0

60.240.42 0.6 0.7

80.961.14

Variation

Num

ber o

f spo

ts

Mean = 3.7%

28

Intrachip variability – single experiment quality

0

20

40

60

80

100

120

140

-1.2-1.02-0.84-0.66-0.48 -0.

3-0.12 0.0

60.240.42 0.6 0.7

80.961.14

Variation

Num

ber o

f spo

ts

Mean = 3.7% Mean = 10.7%

Filtering poor duplicates

0

1

2

3

4

5

0 1 2 3 4 5

ratio 1

ratio

2

29

Filtering poor duplicates

0

1

2

3

4

5

0 1 2 3 4 5

ratio 1

ratio

2

0

1

2

3

4

5

0 1 2 3 4 5

ratio 1

ratio

2


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


30

Measurement of interchip quality• How consistent are

replicate experiments?

• Use same measure for equivalent spots :(x1 – x2)/(x1+x2)

• Mean gives overall variation for expt. With respect to other replicates

• Remove outlier experiments

Measurement of replicate chip quality

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

31


0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity


0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

32


0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000 60000

Green intensity

Red

inte

nsity


0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000 60000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

0 2000 4000 6000 8000 10000 12000 14000

Green intensity

Red

inte

nsity

33


0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000 60000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

0 2000 4000 6000 8000 10000 12000 14000

Green intensityR

ed in

tens

ity

Mean var. = 8.4% Mean var. = 8.2%


Mean var. = 7.3%

0

20

40

60

80

100

120

140

-1.2-1.02-0.84-0.66-0.48 -0.

3-0.12 0.0

60.240.42 0.6 0.7

80.961.14

Variation

Num

ber o

f spo

ts


0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

70000

0 10000 20000 30000 40000 50000 60000

Green intensity

Red

inte

nsity

0

10000

20000

30000

40000

50000

60000

0 2000 4000 6000 8000 10000 12000 14000

Green intensity

Red

inte

nsity



Mean var. = 7.3%

0

20

40

60

80

100

120

140

-1.2-1.02-0.84-0.66-0.48 -0.

3-0.12 0.0

60.240.42 0.6 0.7

80.961.14

Variation

Num

ber o

f spo

ts

34

Measurement of interchip quality

• Quite easy to see by eye, but provides a systematic and objective method for determining consistency

• Use to measure overall consistency of replicates

• Identify and remove bad replicate expts

• Also identify regions of the slide that are consistently error prone






35


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


Scoring differentially hybridized probes

• Identify spots that have differential hybridisation on red and green channels

• Many different methods being published

36

Scoring differentially:ratio-based method

• Calculate red/green ratio for each spot


• Calculate red/green ratio for each spot• Plot distribution:

0

500

1000

1500

2000

2500

3000

0 1 2

ratio (red/green)

Num

ber o

f spo

ts

37


• Calculate red/green ratio for each spot• Plot distribution:

• Define cut-off based on normal distribution(or use 2-fold cut-off)

0

500

1000

1500

2000

2500

3000

0 1 2

ratio (red/green)

Num

ber o

f spo

ts

Problems

38

Problems

• Many experiments don’t give normal distribution

0

100

200

300

400

500

600

700

800

900

0 1 2 3 4 5

ratio (red/green)

Num

ber o

f spo

ts

Problems

• Many experiments don’t give normal distribution

• Ratios ignore the signal intensity– More stringent for high

intensity spots

0

100

200

300

400

500

600

700

800

900

0 1 2 3 4 5

ratio (red/green)

Num

ber o

f spo

ts

0

10000

20000

30000

40000

50000

60000

0 10000 20000 30000 40000 50000 60000

green intensity

red

inte

nsity

39


Data input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


Processing flow chartData input

Background

correction

Cy5/Cy3

normalisation

Merging

replicate

experiments

Score

differential

hybridisation

Spot

qualityArtifactual

regions

Duplicate spot

variability


• Raw microarray requires a lot of initial processing before being useful• Very important as can completely change the answers you get• The issues are beginning to emerge

• Different people have different ideas of how to resolve them• There is no standard method yet – each has problems

• Very labour intensive, but can be computed relatively easily

Download - Steps involved in microarray analysis after the experimentsbioinfo.mbb.yale.edu/mbb452a/2002/expression.pdf · Processing flow chart Data input Background correction Cy5/Cy3 normalisation

Top Related