a - immunogenomica - dna micro array data analysis 2nd ed

CSC – Scientific Computing Ltd. is a non-profit organization for high-

performance computing and networking in Finland. CSC is owned by the

Ministry of Education. CSC runs a national large-scale facility for compu-

tational science and engineering and supports the university and research

community. CSC is also responsible for the operations of the Finnish Uni-

versity and Research Network (Funet).

All rights reserved. The PDF version of this book or parts of it can be

used in Finnish universities as course material, provided that this copyright

notice is included. However, this publication may not be sold or included

as part of other publications without permission of the publisher.

c© The authors and

CSC – Scientific Computing Ltd.

2005

Second edition

ISBN 952-5520-11-0 (print)

ISBN 952-5520-12-9 (PDF)

http://www.csc.fi/oppaat/siru/

http://www.csc.fi/molbio/arraybook/

Printed at

Picaset Oy

Helsinki 2005

DNA microarray data analysis 5

Preface

This is the second, revised and slightly expanded edition of the DNA microar-ray data analysis guidebook. As a change to the previous edition, some relativelyquickly changing material such as software tutorials have been exclusively pub-lished on the book’s web site. Please see http://www.csc.fi/molbio/arraybook/for more information and to access the extra material.

DNA microarrays generate large amounts of numerical data, which should beanalyzed effectively. In this book, we hope to offer a broad view of basic theoryand techniques behind the DNA microarray data analysis. Our aim was not to becomprehensive, but rather to cover the basics, which are unlikely to change muchover years. Especially, we hope that researchers starting their data analysis canbenefit from the book.

The text emphasizes gene expression analysis. Topics, such as genotyping,are discussed shortly. This book does not cover the wet-lab practises, such as sam-ple preparation or hybridization. Rather, we start when the microarrays have beenscanned, and the resulting images are being analyzed. Also, we take the files withsignal intensities, which usually generate questions such as: “How is the data nor-malized?” or “How do I identify the genes which are upregulated?”, and providesome simple solutions to these specific questions and many others.

Each chapter has a section on suggested reading, which introduces some ofthe relevant literature. Some chapters have additional information available on theweb. In the first edition the software examples were included in the book, but wehave now moved them into Internet. This allows us to better keep the material upto date.

Juha Haataja and Leena Jukka are warmly acknowledged for their supportduring the production of this book.

We are very interested in receiving feedback about this publication. Especially,if you feel that some essential technique has been missed, let us know. Please sendyour comments to the e-mail address [email protected].

Espoo, 23rd December 2005

The authors

6 DNA microarray data analysis

List of Contributors

Iiris HovattaNational Public Health InstituteHaartmaninkatu 8FI-00290 HelsinkiFinland

Katja KimppaPerkinElmer Life Sciences and Analytical Sciences- Wallac OyP.O.Box 10FI-20101 TurkuFinland

M. Minna LaineCSC, the Finnish IT center for scienceKeilaranta 14FI-02101 EspooFinland

Antti LehmussolaTampere University of TechnologyP.O.Box 553FI-33101 TampereFinland

Tomi PasanenUniversity of HelsinkiP.O.Box 68FI-00014 University of HelsinkiFinland

Janna SaarelaBiomedicum Biochip CenterHaartmaninkatu 8FI-00290 HelsinkiFinland

Ilana SaarikkoUniversity of HelsinkiP.O.Box 68FI-00014 University of HelsinkiFinland

Juha SaharinenNational Public Health InstituteHaartmaninkatu 8FI-00290 HelsinkiFinland

Pekka TiikkainenVTTP.O.Box 106FI-20521 TurkuFinland

Teemu ToivanenCentre for BiotechnologyTykistökatu 6FI-20521 TurkuFinland

Martti TolvanenInstitute of Medical TechnologyBiokatu 8FI-33520 TampereFinland

Jarno TuimalaCSC, the Finnish IT center for scienceKeilaranta 14FI-02101 EspooFinland

Mauno VihinenInstitute of Medical TechnologyBiokatu 8FI-33520 TampereFinland

Garry WongA. I. Virtanen -instituteUniversity of KuopioFI-70211 KuopioFinland

Contents 7

Contents

Preface 5

List of Contributors 6

I Introduction 14

1 Introduction 151.1 Why perform microarray experiments? . . . . . . . . . . . . . 15

1.2 What is a microarray? . . . . . . . . . . . . . . . . . . . . . . 15

1.3 Microarray production . . . . . . . . . . . . . . . . . . . . . 16

1.4 Where can I obtain microarrays? . . . . . . . . . . . . . . . . 17

1.5 Extracting and labeling the RNA sample . . . . . . . . . . . . 18

1.6 RNA extraction from scarse tissue samples . . . . . . . . . . . 20

1.7 Hybridization . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.8 Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.9 Typical research applications of microarrays . . . . . . . . . . 21

1.10 Experimental design and controls . . . . . . . . . . . . . . . . 23

1.11 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Affymetrix GeneChip system 252.1 Affymetrix technology . . . . . . . . . . . . . . . . . . . . . 25

2.2 Single Array analysis . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Detection p-value . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Detection call . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Signal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Analysis tips . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Comparison analysis . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.9 Change p-value . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.10 Change call . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.11 Signal Log Ratio Algorithm . . . . . . . . . . . . . . . . . . . 29

2.12 Quality control tips . . . . . . . . . . . . . . . . . . . . . . . 30



3 Genotyping systems 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Genotype calls . . . . . . . . . . . . . . . . . . . . . . . . . . 33


4 Image analysis 354.1 Digital image processing . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Digital image . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 Image storage . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.3 Image analysis . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Microarray image analysis . . . . . . . . . . . . . . . . . . . 37

4.2.1 Gridding . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2.3 Information extraction . . . . . . . . . . . . . . . . . . . 40

4.2.4 Quality control . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.5 Affymetrix GeneChip analysis . . . . . . . . . . . . . . . 41

4.3 Using image analysis software . . . . . . . . . . . . . . . . . 42

5 Overview of data analysis 435.1 cDNA microarray data analysis . . . . . . . . . . . . . . . . . 43

5.2 Affymetrix data analysis . . . . . . . . . . . . . . . . . . . . 44

5.3 Data analysis pipeline . . . . . . . . . . . . . . . . . . . . . . 44

6 Experimental design 476.1 Why do we need to consider experimental design? . . . . . . . 47

6.2 Choosing and using controls . . . . . . . . . . . . . . . . . . 47

6.3 Choosing and using replicates . . . . . . . . . . . . . . . . . . 48

6.4 Choosing a technology platform . . . . . . . . . . . . . . . . 49

6.5 Gene clustering v. gene classification . . . . . . . . . . . . . . 49

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 50


7 Reporting results 517.1 Why the results should be reported . . . . . . . . . . . . . . . 51

7.2 What details should be reported: the MIAME standard . . . . 52

7.3 How the data should be presented: the MAGE standard . . . . 52

7.3.1 MAGE-OM . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.3.2 MAGE-ML; an XML-representation of MAGE-OM . . . 53

7.3.3 MAGE-STK . . . . . . . . . . . . . . . . . . . . . . . . 53

7.4 Where and how to submit your data . . . . . . . . . . . . . . 53

7.4.1 ArrayExpress and GEO . . . . . . . . . . . . . . . . . . . 53

7.4.2 MIAMExpress . . . . . . . . . . . . . . . . . . . . . . . 54

7.4.3 GEO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Contents 9

7.4.4 Other options and aspects . . . . . . . . . . . . . . . . . 55

7.4.5 Your role in providing well-documented data . . . . . . . 55


8 Basic statistics 588.1 Why statistics are needed . . . . . . . . . . . . . . . . . . . . 58

8.2 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.2.2 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.2.3 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.2.4 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.3 Simple statistics . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.3.1 Number of subjects . . . . . . . . . . . . . . . . . . . . . 59

8.3.2 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8.3.3 Trimmed mean . . . . . . . . . . . . . . . . . . . . . . . 59

8.3.4 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.3.5 Percentile . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.3.6 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.3.7 Variance and the standard deviation . . . . . . . . . . . . 60

8.3.8 Coefficient of variation . . . . . . . . . . . . . . . . . . . 60

8.4 Effect statistics . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.4.1 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.4.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.4.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . 62

8.5 Frequency distributions . . . . . . . . . . . . . . . . . . . . . 64

8.5.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . 64

8.5.2 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . 65

8.5.3 Skewed distribution . . . . . . . . . . . . . . . . . . . . . 66

8.5.4 Checking the distribution of the data . . . . . . . . . . . . 66

8.6 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 68

8.6.1 Log2-transformation . . . . . . . . . . . . . . . . . . . . 68

8.7 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.8 Missing values and imputation . . . . . . . . . . . . . . . . . 70

8.9 Statistical testing . . . . . . . . . . . . . . . . . . . . . . . . 71

8.9.1 Basics of statistical testing . . . . . . . . . . . . . . . . . 71

8.9.2 Choosing a test . . . . . . . . . . . . . . . . . . . . . . . 71

8.9.3 Threshold for p-value . . . . . . . . . . . . . . . . . . . 72

8.9.4 Hypothesis pair . . . . . . . . . . . . . . . . . . . . . . . 72

8.9.5 Calculation of test statistic and degrees of freedom . . . . 72

8.9.6 Critical values table . . . . . . . . . . . . . . . . . . . . . 74

8.9.7 Drawing conclusions . . . . . . . . . . . . . . . . . . . . 74

8.9.8 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . 74


8.9.9 Empirical p-value . . . . . . . . . . . . . . . . . . . . . 75

8.10 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . 75

8.10.1 Basics of ANOVA . . . . . . . . . . . . . . . . . . . . . 75

8.10.2 Completely randomized experiment . . . . . . . . . . . . 76

8.10.3 One-way and two-way ANOVA . . . . . . . . . . . . . . 77

8.10.4 Mixed model ANOVA . . . . . . . . . . . . . . . . . . . 77

8.10.5 Experimental design and ANOVA . . . . . . . . . . . . . 78

8.10.6 Balanced design . . . . . . . . . . . . . . . . . . . . . . 78

8.10.7 Common reference and loop designs . . . . . . . . . . . . 78

8.11 Error models . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.12 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


II Analysis 81

9 Preprocessing of data 829.1 Rationale for preprocessing . . . . . . . . . . . . . . . . . . . 82

9.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . 82

9.3 Checking the background reading . . . . . . . . . . . . . . . . 839.4 Calculation of expression change . . . . . . . . . . . . . . . . 84

9.4.1 Intensity ratio . . . . . . . . . . . . . . . . . . . . . . . . 86

9.4.2 Log ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9.4.3 Variance stabilizing transformation . . . . . . . . . . . . 87

9.4.4 Fold change . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.5 Handling of replicates . . . . . . . . . . . . . . . . . . . . . . 87

9.5.1 Types of replicates . . . . . . . . . . . . . . . . . . . . . 87

9.5.2 Time series . . . . . . . . . . . . . . . . . . . . . . . . . 88

9.5.3 Case-control studies . . . . . . . . . . . . . . . . . . . . 88

9.5.4 Power analysis . . . . . . . . . . . . . . . . . . . . . . . 88

9.5.5 Averaging replicates . . . . . . . . . . . . . . . . . . . . 88

9.6 Checking the quality of replicates . . . . . . . . . . . . . . . . 89

9.6.1 Quality check of replicate chips . . . . . . . . . . . . . . 89

9.6.2 Quality check of replicate spots . . . . . . . . . . . . . . 90

9.6.3 Excluding bad replicates . . . . . . . . . . . . . . . . . . 90

9.7 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.8 Filtering bad data . . . . . . . . . . . . . . . . . . . . . . . . 91

9.9 Filtering uninteresting data . . . . . . . . . . . . . . . . . . . 92

9.10 Simple statistics . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.10.1 Mean and median . . . . . . . . . . . . . . . . . . . . . . 93

9.10.2 Standard deviation . . . . . . . . . . . . . . . . . . . . . 93

9.10.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9.11 Skewness and normality . . . . . . . . . . . . . . . . . . . . . 94

9.11.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Contents 11

9.12 Spatial effects . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.13 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 98

9.14 Similarity of dynamic range, mean and variance . . . . . . . . 98

9.15 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98


10 Normalization 9910.1 What is normalization? . . . . . . . . . . . . . . . . . . . . . 99

10.2 Sources of systematic bias . . . . . . . . . . . . . . . . . . . 99

10.2.1 Dye effect . . . . . . . . . . . . . . . . . . . . . . . . . . 99

10.2.2 Scanner malfunction . . . . . . . . . . . . . . . . . . . . 100

10.2.3 Uneven hybridization . . . . . . . . . . . . . . . . . . . . 100

10.2.4 Printing tip . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.2.5 Plate and reporter effects . . . . . . . . . . . . . . . . . . 100

10.2.6 Batch effect and array design . . . . . . . . . . . . . . . . 101

10.2.7 Experimenter issues . . . . . . . . . . . . . . . . . . . . 101

10.2.8 What might help to track the sources of bias? . . . . . . . 101

10.3 Normalization terminology . . . . . . . . . . . . . . . . . . . 101

10.3.1 Normalization, standardization and centralization . . . . . 102

10.3.2 Per-chip and per-gene normalization . . . . . . . . . . . . 103

10.3.3 Global and local normalization . . . . . . . . . . . . . . . 103

10.4 Performing normalization . . . . . . . . . . . . . . . . . . . . 103

10.4.1 Choice of the method . . . . . . . . . . . . . . . . . . . . 103

10.4.2 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 104

10.4.3 Control genes . . . . . . . . . . . . . . . . . . . . . . . . 104

10.4.4 Linearity of data matters . . . . . . . . . . . . . . . . . . 105

10.4.5 Basic normalization schemes for linear data . . . . . . . . 105

10.4.6 Special situations . . . . . . . . . . . . . . . . . . . . . . 105

10.5 Mathematical calculations . . . . . . . . . . . . . . . . . . . . 106

10.5.1 Mean centering . . . . . . . . . . . . . . . . . . . . . . . 106

10.5.2 Median centering . . . . . . . . . . . . . . . . . . . . . . 106

10.5.3 Trimmed mean centering . . . . . . . . . . . . . . . . . . 106

10.5.4 Standardization . . . . . . . . . . . . . . . . . . . . . . . 106

10.5.5 Lowess smoothing . . . . . . . . . . . . . . . . . . . . . 107

10.5.6 Ratio statistics . . . . . . . . . . . . . . . . . . . . . . . 108

10.5.7 Analysis of variance . . . . . . . . . . . . . . . . . . . . 108

10.5.8 Spiked controls . . . . . . . . . . . . . . . . . . . . . . . 108

10.5.9 Dye-swap experiments . . . . . . . . . . . . . . . . . . . 108

10.6 RMA preprocessing for Affymetrix data . . . . . . . . . . . . 109

10.7 Some caution is needed . . . . . . . . . . . . . . . . . . . . . 109

10.8 Graphical example . . . . . . . . . . . . . . . . . . . . . . . 111

10.9 Example of calculations . . . . . . . . . . . . . . . . . . . . . 111


10.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


11 Finding differentially expressed genes 11511.1 Identifying over- and under-expressed genes . . . . . . . . . . 115

11.1.1 Filtering by absolute expression change . . . . . . . . . . 115

11.1.2 Statistical single chip methods . . . . . . . . . . . . . . . 116

11.1.3 Noise envelope . . . . . . . . . . . . . . . . . . . . . . . 116

11.1.4 Sapir and Churchill’s single slide method . . . . . . . . . 116

11.1.5 Chen’s single slide method . . . . . . . . . . . . . . . . . 117

11.1.6 Newton’s single slide method . . . . . . . . . . . . . . . 118

11.2 What about the confidence? . . . . . . . . . . . . . . . . . . . 119

11.2.1 Only some treatments have replicates . . . . . . . . . . . 119

11.2.2 All the treatments have replicates: two-sample t-test . . . 120

11.2.3 All the treatments have replicates: one-sample t-test . . . 121

11.2.4 Non-parametric tests . . . . . . . . . . . . . . . . . . . . 121

11.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


12 Cluster analysis of microarray information 12312.1 Basic concept of clustering . . . . . . . . . . . . . . . . . . . 123

12.2 Principles of clustering . . . . . . . . . . . . . . . . . . . . . 123

12.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . 124

12.4 Self-organizing map . . . . . . . . . . . . . . . . . . . . . . . 125

12.5 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . 126

12.6 KNN classification - a simple supervised method . . . . . . . 127

12.7 Principal component analysis . . . . . . . . . . . . . . . . . . 129

12.8 Pros and cons of clustering . . . . . . . . . . . . . . . . . . . 130

12.9 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

12.10 Function prediction . . . . . . . . . . . . . . . . . . . . . . . 133

12.11 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


III Data mining 134

13 Gene regulatory networks 13513.1 What are gene regulatory networks? . . . . . . . . . . . . . . 135

13.2 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 135

13.3 Bayesian network . . . . . . . . . . . . . . . . . . . . . . . . 137

13.4 Calculating Bayesian network parameters . . . . . . . . . . . 139

13.5 Searching Bayesian network structure . . . . . . . . . . . . . 140

13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


Contents 13

14 Data mining for promoter sequences 14314.1 General introduction . . . . . . . . . . . . . . . . . . . . . . 143

14.2 Introduction to gene regulation . . . . . . . . . . . . . . . . . 143

14.3 Input data I: promoter region sequences . . . . . . . . . . . . 14514.4 Using BioMart to retrieve promoter regions . . . . . . . . . . 149

14.4.1 Final warnings regarding upstream data . . . . . . . . . . 149

14.5 Input data II: the matrices for transcription factor binding sites 149

14.5.1 Searching known patterns . . . . . . . . . . . . . . . . . 150

14.5.2 Pattern search without prior knowledge of what you search 151

14.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152


15 Biological sequence annotations 15415.1 Annotations in gene expression studies . . . . . . . . . . . . . 154

15.2 Using annotations with gene expression microarrays . . . . . . 154

15.3 Using Ensembl server to batch process of annotations . . . . . 155

15.4 Biological pathways and GeneOntologies . . . . . . . . . . . 156

15.5 Using GeneOntologies for gene expression data analysis . . . 156

15.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 157

15.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

16 Software issues 16016.1 Data format conversions problems . . . . . . . . . . . . . . . 160

16.2 A standard file format . . . . . . . . . . . . . . . . . . . . . . 161

16.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 161

16.3.1 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

16.3.2 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

16.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Index 162

Part I

Introduction

1 Introduction 15

1 Introduction

Garry Wong

Microarray technologies as a whole provide new tools that transform the wayscientific experiments are carried out. The principle advantage of microarray tech-nologies compared with traditional methods is one of scale. In place of conductingexperiments based on results from one or a few genes, microarrays allow for thesimultaneous interrogation of thousands of genes or entire genomes.

1.1 Why perform microarray experiments?

The answers to this question span a wide range from the formally considered, well-constructed hypotheses with elegant supporting arguments to “I can’t think of any-thing else to do, so lets’s do a microarray experiment”. The true motivation forperforming these experiments lies likely somewhere between the two extremes. Itis this combination of generating a scientific hypothesis (elegant or not), and at thesame time being able to produce massive amounts of data that has made researchin microarrays so attractive. Nonetheless, the production and use of microarrays isset with high technical and instrumentation demands. Moreover, the computationand statistical requirements for dealing with the data can be daunting, especiallyto those scientists used to single experiment – single result analysis. So, for thosewilling to try this new technology, microarray experiments are performed to answera wide range of biological questions to which the answers are to be found in therealm of hundreds, thousands, or an entire genome of individual genes.

1.2 What is a microarray?

Microarrays are microscope slides that contain an ordered series of samples (DNA,RNA, protein, tissue). The type of microarray depends upon the material placedonto the slide: DNA, DNA microarray; RNA, RNA microarray; protein, proteinmicroarray; tissue, tissue microarray. Since the samples are arranged in an orderedfashion data obtained from the microarray can be traced back to any of the samples.This means that genes on the microarray are addressable. The number of orderedsamples on a microarray can number into the hundred of thousands. The typicalmicroarray contains several thousands of addressable genes.

The most commonly used microarray is the DNA microarray. The DNA


printed or spotted onto the slides can be chemically synthesized long oligonu-cleotides or enzymatically generated PCR products. The slides contain chemicallyreactive groups (typically aldehydes or primary amines) that help to stabilize theDNA onto the slide, either by covalent bonds or electrostatic interactions. An alter-native technology allows the DNA to be synthesized directly onto the slide itself bya photolithographic process. This process has been commercialized and is widelyavailable. DNA microarrays are used to determine

1. The expression levels of genes in a sample, commonly termed expressionprofiling.

2. The sequence of genes in a sample, commonly termed minisequencing forshort nucleotide reads, and mutation or SNP (single nucleotide polymor-phism) analysis for single nucleotide reads.

1.3 Microarray production

Printing microarrays in not a trivial task and is both an art and a science. Thejob requires considerable expertise in chemistry, engineering, programming, largeproject management, and molecular biology. The aim during printing is to producereproducible spots with consistent morphology. Early versions of printers werecustom made with a basic design taken from a prototype version. Some were builtfrom laboratory robots. Current commercial microarray printers are available foralmost every size application and have made the task of printing microarrays feasi-ble and affordable for many nonspecialized laboratories. The basic printer consistsof: a nonvibrating table surface where the raw glass slides are placed, a movinghead in the x-y-z plane that contains the pins or pens to transfer the samples ontothe array, a wash station to clean the pins/pens between samples, a drying stationfor the pins/pens, a place for the samples to be printed, and a computer to con-trol the operation. Some of these procedures can be automated such as replacingthe samples to be printed, although most complete systems are semi-automated.Samples to be printed are concentrated and stored in microtitre plates.

The printers are operated in dust-free, temperature and humidity controlledrooms. Some printer designs have their own self-contained environmental controls.Printing pen designs have been adapted from ink methods and include quill, ball-point, ink-jet, and P-ring techniques. The pens can get stuck and need to be cleanedfrequently. Multiple pens placed on a printing head can multiplex the printingoperation and speed up the process. Thousands of samples in duplicate or triplicateare printed in a single run over perhaps a hundred or more slides; thus, printingtimes of several days are common. Since printing times can be long and samplevolumes are small, sample evaporation is a major concern. As a result, hygroscopicprinting buffers, often containing DMSO have been developed and are highly usefulto alleviate evaporation. A typical printer design is shown in Figure 1.1.

An alternative technology to produce microarrays is via in situ synthesis di-rectly onto the solid support. This procedure involves the photoactivation and de-protection of nucleic acids using directed light to selectively elongate or block syn-thesis of oligonucleotides at directed areas on the array. Affymetrix arrays are

1 Introduction 17

formed by directing light through photolithographic masks. Other companies havedeveloped technologies to direct light through micromirrors, microchannels, andinto microchambers. Alternatives to directed light also include the use of directedelectrochemistry to activate and protect nucleic acids.

In addition to printing and photolitography, bead-based methods such as Illu-mina are available. On Illumina chips the probe, typically a longer oligonucleotide(50 bp), is bound to a small bead that is present on the chip in multiple copies.

Figure 1.1: Typical picture of a printer.

1.4 Where can I obtain microarrays?

Microarrays can be obtained from a variety of sources. Commercial microarraysare of high quality, good density and available for the most commonly studiedorganisms including human, mouse, rat, and yeast. Many companies also havespecialized arrays available for more focused investigations. Also microarray corefacilities located in academic and government institutions produce microarrays thatare available for use (Table 1.1). The microarrays from these centers have definitecost advantages and excellent quality standards, but tend to have a more restrictedfocus in terms of the types of arrays and species covered. Custom microarrays thatare designed and produced for individual or limited use are a growing application.Custom fabrication of a microarray still requires a microarray printer or spotterfacility to produce the array, but has the advantage of producing smaller batches ofslides and more flexibility in the genes on array. If you are fortunate to have yourown facility, microarrays can be produced at your own convenience. Microarraysonce made store well in dark dessicated plastic slide boxes. Some manufacturerssuggest storage at −20 ◦C while others find room temperature adequate. The shelf


life of microarrays has been claimed to be up to 6 months although this has notbeen empirically tested.

Table 1.1: Places where microarrays can be obtained.

Center name Website

Biomedicum Biochip Center www.helsinki.fi/biochipcenterFinnish DNA Microarray Centre microarrays.btk.utu.fi

Norwegian Microarray Consortium www.med.uio.no/dnr/microarray/Affymetrix Inc. www.affymetrix.com

Agilent Technologies www.agilent.comTelechem International Inc. www.arrayit.com

1.5 Extracting and labeling the RNA sample

A typical workflow of the microarray experiment has been summarized in Fig-ure 1.2. Once microarrays have been made and obtained, the next stage is to obtainsamples for labeling and hybridization. Labeling RNA for expression analysis gen-erally involves three steps:

1. Isolation of RNA.

2. Labeling the RNA by a reverse transcription procedure with fluorescentmarkers.

3. Purification of the labeled products.

RNA can be extracted from tissue or cell samples by common organic extrac-tion procedures used in most molecular biology labs. Many commercial kits areavailable for this task. Both total RNA and mRNA can be used for labeling, but thecontaminating genomic DNA must be removed by DNAase treatment. The amountof total RNA necessary for a single labeling reaction is about 20 μg while theamount of mRNA necessary is about 0.5 μg. Lesser amounts are known to work,but require extreme purity and well developed protocols. Thus, while the absoluteamounts may vary, the purity and integrity of the RNA is an absolute must. It isgenerally a good idea to check the RNA samples before using them in microarrayexperiments. In fact, for many core facilities it is a requirement. This can be done,e.g., by assaying the absorption ratio 260/280 lambda and/or running a sample onan ethidium bromide stained agarose gel.

Direct labeling of the RNA is achieved by producing cDNA from the RNA byusing the enzyme reverse transcriptase and then incorporating the fluorescent la-bels, most commonly Cy3 and Cy5 . Other fluorophores are available (e.g. Cy3.5,TAMRO, Texas red) but have not yet found widespread use. In the indirect proce-dure, a reactive group, usually a primary amine, is incorporated into the cDNA first,and the Cy3 or Cy5 is then coupled to the cDNA in a separate reaction. The advan-tage of the indirect method is a higher labeling efficiency due to the incorporation

1 Introduction 19

Figure 1.2: Work flow of a typical expression microarray experiment.


of a smaller molecule during the reverse transcription step. Once fluorescentlylabeled probes are made, the free unincorporated nucleotides must be removed.This is typically done by column chromatography using convenient spin-columnsor by ethanol precipitation of the sample. Some protocols perform both purificationsteps. As a small aside, radioactivity is still around and may even make a comebackin microarrays. Incorporation of 33P- or 35S-labeled nucleotides into cDNAs havehigh rates and provide more sensitivity than fluoresecently labeled probes. Thesefeatures have been exploited in plastic microarrays that have the gene density andsize of microarrays but require far less instrumentation and are reusable.

1.6 RNA extraction from scarse tissue samples

For many microarray applications there is a scarcity of tissue available for RNA ex-traction. This seems to be the case especially in human tissue studies. In response,many scientists have developed techniques aimed at getting around this problem.These procedures generally involve either PCR amplification of the cDNAs madefrom the original RNAs, or production of more RNA from the original RNA sam-ple by hybridization of a T7 or T3 promoter followed by RNA sythesis with RNApolymerase. As usual for any amplification procedure, proper controls and inter-pretation of the results need to be considered.

A related issue in isolating tissues for microarray studies is the dissection ofsmall populations of cells or even single cells. Sophisticated instruments have beendeveloped for this application and many are commercially available. These laser-assisted microdissection machines, while expensive, are nonetheless fairly straight-forward to use and provide a convenient method for obtaining pure cell samples.

1.7 Hybridization

Conditions for hybridizing fluorescently labeled DNAs onto microarrays are re-markably similar to hybridizations for other molecular biology applications. Gener-ally the hybridization solution contains salt in the form of buffered standard sodiumcitrate (SSC), a detergent such as sodium dodecyl sulphate (SDS), and nonspecificDNA such as yeast tRNA, salmon sperm DNA, and/or repetitive DNA such as hu-man Cot-1. Other nonspecific blocking reagents used in hybridization reactionsinclude bovine serum albumin or Denhardt’s reagent. Lastly, the hybridization so-lution should contain the labeled cDNAs produced from the different RNA popula-tions.

Hybridization temperatures vary depending upon the buffers used, but gen-erally are performed at approximately 15 − 20 ◦C below the melting temperature,which is 42 − 45 ◦C for PCR products in 4X SSC and 42 − 50 ◦C for long oligos.Hybridization volumes vary widely from 20μl to several mLs. For small hybridis-ation volumes, hydrophobic cover slips are used. For larger volumes, hybridizationchambers can be used.

Hybridization chambers are necessary to keep the temperature constant andthe hybridization solution from evaporating. Hybridization chambers vary substan-tially from the most expensive high-tech automated instruments to empty pipette

1 Introduction 21

boxes with a few wet paper towels inserted. The range of solutions for providinga thermally stable, humidified enviroment for a microscope slide is only virtuallyunlimited. Some might even consider a sauna as a potential chamber. In smallvolumes, the hybridization kinetics are rapid so a few hours can yield reproducibleresults, although overnight hybridizations are more common.

1.8 Scanning

Following hybridization, microarrays are washed for several minutes in decreas-ing salt buffers and finally dried, either by centrifugation of the slide, or a rinsein isopropanol followed by quick drying with nitrogen gas or filtered air. Fluores-cently labeled microarrays can then be “read” with commercially available scan-ners. Most microarray scanners are basically scanning confocal microscopes withlasers exciting at wavelengths specifically for Cy3 and Cy5, the typical dyes usedin experiments. The scanner excites the fluorescent dyes present at each spot on themicroarray and the dye then emits at a characteristic wavelength that is capturedin a photomultiplier tube. The amount of signal emitted is directly in proportionto the amount of dye at the spot on the microarray and these values are obtainedand quantitated on the scanner. A reconstruction of the signals from each locationon the microarray is then produced. For cDNA microarrays one intensity value isgenerated for the Cy3 and another for the Cy5. Hence, cDNA microarrays producetwo-color data. Affymetrix (oligo) chips produce one-color data, because only onemRNA sample is hybridized to every chip (see chapter 2). When both dyes arereconstructed together, a composite image is generated. This image produces thetypical microarray picture.

1.9 Typical research applications of microarrays

The types and numbers of applications for microarray experiments are quite vari-able and constantly increasing. Microarrays used to monitor the expression level ofgenes in comparison between two conditions remains one of the most widespreaduses of microarrays. This type of study, termed gene expression profiling, can beused to determine the function of particular genes during a particular state, such asnutrition, temperature, or chemical environment. Such results could be observed asup- or down-regulation, or unchanged during particular conditions. For example,a group of genes could be up-regulated during heat shock, and as a group, thesegenes could be assigned as heat shock responsive genes. Some genes in this groupmay have already been identified as heat shock responsive, but other genes in thegroup may not have been assigned any function. Based on a similar response toheat shock, new functions are then assigned to the genes. Therefore, extrapola-tion of function based on common changes in expression remains one of the mostwidespread applications of microarray research. By assumption, genes that sharecommon regulatory patterns also share the same function.

In agriculture, microarrays have been used to identify genes which are in-volved in the ripening of tomatos, for example. In this type of study, RNAs are iso-lated from raw and ripened fruit, and then compared to determine which genes are


expressed during the process. Genes which are down-regulated during the ripeningmay also provide useful information about the process.

On a basic scientific level, microarrays have been used to map the cellular, re-gional, or tissue-specific localization of genes and their respectively encoded pro-teins. Microarrays have been used: at the subcellular level to map genes that encodemembrane or cytosolic proteins; at the cellular level to map genes that distinguishbetween different types of immune cells; at the tissue region level to distinguishgenes which encode hippocampus or cortex brain region specific proteins; and atthe tissue level to identify genes which are expressed in muscle, liver, or heart tis-sues.

Pharmacological studies have also used microarrays as a means of discerningthe mechanism of action of therapeutic agents and as a corrollary to develop newdrug targets. The guiding principle in this endeavor is that genes regulated by ther-apeutic agents result from the actions of the drug. Identification of the genes thatare regulated by a certain drug could potentially provide insight into the mechanismof action of the drug, prediction of toxicologic properties, and new drug targets.

One of the most exciting areas of application is the diagnosis of clinicallyrelevent diseases. The oncology field has been especially active and to an extentsuccessful in using microarrays to differentiate between cancer cell types. Theability to identify cancer cells based on gene expression represents a novel method-ology that has real benefits. In difficult cases where a morphological or an antigenmarker is not available or reliable enough to distinguish cancer cell types, geneexpression profiling using microarrays can be extremely valuable. Programs topredict clinical outcome and to design individual therapies based on expressionprofiling results are well underway.

A very recent application of microarrays has been to perform comparativegenomic analysis. Genome projects are producing sequences on a massive level,yet there still does not exist sufficient resources to sequence every organism thatseems interesting or worthy of the effort. Therefore, microarrays have been used asa shortcut to both characterize the genes within an organism (structural genomics)and also to determine whether those genes are expressed in a similar way to areference organism (functional genomics). A good example of this is in the speciesOryza sativa (rice). Microarrays based on rice sequences can be used to hybridizecDNAs derived from other plant species such as corn or barley. The genome sizesin the latter are simply too large for whole genome projects, so hybridization withmicroarrays to rice genes presents an agile way to address this question.

Comparative genomic hybridization (CGH) is another important application.Here, gene copy numbers are compared between two samples and genomic DNArather than RNA transcripts are labeled. CGH can detect gene amplification ordeletion events underlying tumour genesis. It can also be used to detect genomicrearrangement or abnormality events with serious health consequences in humanssuch as trisomies. RNAinterference (RNAi) arrays, where DNA producing RNAitranscripts are printed onto solid supports and cell cultures are placed on top ofthe printed spots provide still another platform by which arrays can be utilized tounderstand basic oncology mechanisms and pathways.

Single nucleotide polymorphism (SNP) microarrays are designed to detect the

1 Introduction 23

presence of single nucleotide differences between genomic samples. SNPs occur atfrequencies of approximately 1 in 1000 bases in humans and underlie the genomicdifferences between individuals. Mapping and obtaining frequencies of identifiedSNPs should provide a genetic basis for identifying disease genes, predicting ef-fects of the environment as well as responses to therapeutic agents. Minisequenc-ing, primer extension, and differential hybridization methods have been developedon the microarray platform with all the advantages of expression arrays: highthroughput, reproducibility, economy, and speed.

Since the sequencing of complete genomes of yeast, mice, rats, and humans, itwas noticed that many RNA transcripts previously sequenced represented multiplealternatively spliced genes as well as nonprotein encoding RNAs. These RNAscould be pseudogenes, alternative transcripts, antisense or microRNAs. Producingmicroarrays for these transcripts presents a problem since all possible alternativeforms of transcripts are not known. To measure the abundance of these transcripts,which originate from the genome, tiling arrays have been produced. These arraysinterrogate the transcript abundance at 30 base pair resolution across an area of thegenome. These arrays hold the promise of identifying not only novel transcripts,but also their intron-exon boundaries.

Indeed, the use of microarrays to determine whether a gene is present andwhether it goes up or down under certain conditions will continue to spawn evenmore applications that now depend only upon the imagination of the microarrayresearcher.

1.10 Experimental design and controls

Good experimental design in a microarray project requires the same principles andpractices that are part of any scientific investigation. Appropriate controls are thefoundation to any experiment. Both positive controls and negative controls can pro-vide confidence in the results and even provide insight into the success or failureof the experimental protocol. Sufficient replicates should be planned to decreaseexperimental error and to provide statistical power. Forethought and consultationon the correct statistical practices and procedures for the design are always advan-tageous. Attention to experimental parameters is a must, so that whatever treat-ment, time, dose, individual, or tissue location is being studied, the results will beinterpretable with minimum number of confounders. Indeed, probably the only dif-ference between good experimental design in microarray and other experiments isthat the time budgeted for data analysis seems always to be underestimated. As afinal suggestion, attention to the mundane but critical statistical and data analysiselements of a microarray experiment will greatly increase your ratio of joy to painat the end of your microarray journey.

1.11 Suggested reading

1. Aarnio, V., Paananen, J., Wong, G. (2005) Analysis of microarray studiesperformed in the neurosciences. J. Mol. Neurosci., 27, 261-8.


2. Barnes, M., Freudenburg, J., Thompson, S., Aronow, B., and Pavlidis, P.(2005) Experimental comparison and cross-validation of the Affymetrix andIllumina gene expression analysis platforms, PNAS, 33, 5914-5923.

3. Bertone, P., Gerstein, M., Snyder, M. (2005) Applications of DNA tilingarrays to experimental genome annotation and regulatory pathway discovery.Chromosome Res., 13, 259-74.

4. Imbeaud, S., Auffray, C. (2005) ’The 39 steps’ in gene expression profiling:critical issues and proposed best practices for microarray experiments. DrugDiscov Today., 10, 1175-82.

5. Kallioniemi, O.(2004) Medicine: profile of a tumour. Nature, 428, 379-82.

6. Lueking, A., Cahill, D.J., Mullner, S. (2005) Protein biochips: A new andversatile platform technology for molecular medicine. Drug Discov. Today.,10, 789-94.

7. Oostlander, A.E., Meijer, G.A., Ylstra, B. (2004) Microarray-based compar-ative genomic hybridization and its applications in human genetics. Clin.Genet., 66, 488-95.

8. Schena, M., Shalon, D., Davis, R.W., Brown, P.O. (1995) Quantitative mon-itoring of gene expression patterns with a complementary DNA microarray.Science, 270, 467-470.

9. Stoughton, R.B. (2005) Applications of DNA microarrays in biology. Annu.Rev. Biochem., 74, 53-82.

10. Venkatasubbarao, S. (2004) Microarrays–status and prospects. Trends Biotech-nol., 22, 630-7.

11. Wheeler, D.B., Carpenter, A.E., Sabatini, D.M. (2005) Cell microarrays andRNA interference chip away at gene function. Nat. Genet., 37 Suppl, S25-30.

2 Affymetrix GeneChip system 25

2 Affymetrix GeneChipsystem

Janna Saarela and Iiris Hovatta

2.1 Affymetrix technology

There are two principal differences between the Affymetrix GeneChip system and“traditional” cDNA microarrays in studying gene expression. First, instead ofhybridizing two RNAs labeled with different fluorophores competitively on onecDNA microarray, a single RNA is hybridized on the array in the Affymetrix sys-tem, and the comparisons are then made computationally. Second, in the Affymetrixarrays each gene is represented as a probe set of 10–25 oligonucleotide pairs insteadof one full length or partial cDNA clone. The oligonucleotide pair (probe pair)comprises of one oligonucleotide perfectly matching to the gene sequence (PerfectMatch, PM) and a second oligonucleotide having one nucleotide mismatch in themiddle of it (Mismatch, MM), that serves as a hybridization specificity control.Probes are designed within 500 base pairs of the 3’ end of each gene to hybridizeuniquely in the same, predetermined hybridization conditions. Some housekeepinggenes are represented as three probe sets, one set designed to the 5’ end of the gene,second set to the middle of the gene and the third to the 3’end. They serve as con-trols for the quality of the hybridized RNA. In addition to species specific genes,some spiked-in control probe sets are introduced to facilitate the controlling of thehybridization. In the experiment, biotin-labeled RNA is hybridized to the array,which is washed, stained with phycoerythrin-conjugated streptavidin, and scannedwith the Gene Array Scanner. A grid is automatically laid over the array image andbased on the intensities of each probe pair expression measurements are performedwithin the GCOS, (a new version of the Affymetrix Microarray Suite, version 5),version 5 software.

2.2 Single Array analysis

This analysis generates qualitative and quantitative values from one gene expres-sion experiment array and provides initial data required to perform comparisonsbetween arrays. A qualitative value, a Detection call, indicates whether a transcript


is reliably detected (Present) or not detected (Absent) in the array. The Detectioncall is determined by comparing the Detection p-value generated in the analysisagainst user-defined cut-offs. A quantitative value, a signal, assigns a relative mea-sure of abundance to the transcript.

2.3 Detection p-value

The detection algorithm uses a two-step procedure to determine the Detection p-value. First it calculates Discrimination score (R) for each probe pair. The Dis-crimination score measures the target-specific intensity difference of the probe pairrelative to its overall hybridization intensity:

R = PM − M M

PM + M M

Second, the Discrimination score (R) is tested against the user-defined thresholdvalue Tau, which is a small positive number (by default 0.015). Those probe pairswith the discrimination score higher than Tau vote for the presence of the transcript,while the scores lower than Tau vote for the absence. The voting results of all probepairs are summarized as a p-value, which is generated by using the One-sidedWilcoxon’s Signed Rank test as a statistical method. The higher the discriminationscores are above Tau, the smaller the p-value and the more likely the transcriptis present. The lower the scores are below Tau, the larger the p-value, and morelikely the transcript is absent. Additionally, prior to the two-step Detection p-valuecalculation, the level of photomultiplier saturation is evaluated for each probe pair.If all probe pairs in a probe set are saturated, the corresponding transcript is givena Present call. If the Mismatch (MM) probe is saturated, that probe pair is rejectedfrom further analysis. Tau can be used to modify the specificity and the sensitivityof the assay: increasing the value of Tau increases the specificity, but lowers thesensitivity, while decreasing Tau lowers the specificity and increases the sensitivity.

2.4 Detection call

A qualitative value, the Detection call is determined by comparing the Detection p-value with the user-modifiable p-value cut-offs (alpha1 and alpha2, α1 = 0.04, α2= 0.06 by default). Any p-value smaller than alpha1 is assigned a Present call, andthat larger than alpha2 an Absent call. Marginal calls are given to values betweenalpha1 and alpha2. The p-value cut-offs can also be adjusted to modify specificityand sensitivity.

2.5 Signal algorithm

Signal is a quantitative value calculated for each probe set and it represents therelative level of expression of a corresponding transcript. Signal is calculated as aweighted mean using the One-Step Tukey’s Biweight Estimate. The specific signalfor each probe pair is estimated by taking the log of the Perfect match intensityafter subtracting the stray signal estimate. It does not make physiological sense if


the intensity of the Mismatch probe cell is higher than the Perfect match intensity,and therefore an imputed value called Change Threshold (CT) is used instead of theuninformative Mismatch intensity. Three rules are employed to form a stray signalestimate:

• If the Mismatch value is less than the Perfect match, the Mismatch value isconsidered informative and the intensity is used directly as an estimate ofstray signal.

• If the Mismatch probe cells are generally informative across the probe setexcept for a few Mismatches, an adjusted Mismatch value is used for unin-formative Mismatches based on the bi-weight mean of the Perfect match andthe Mismatch ratio.

• If the Mismatch cells are generally uninformative, the uninformative Mis-matches are replaced with a value slightly smaller than the Perfect match.The transcripts represented by such probe sets are generally called Absentby the Detection algorithm.

Each probe pair is considered as having a potential vote in determining theSignal value. The probe pair is weighted more strongly if the signal of the probepair is closer to the median value for that probe set. Once the weight of each probepair is determined, the mean of the weighted intensity values for that probe set isidentified. This mean is the quantitative value Signal.

2.6 Analysis tips

For each experiment ensure the overall quality of the hybridization by visually in-specting the image file. It is important that quality control values are comparablefor arrays that are directly compared to each other. Check at least the followingdata points:

• Background is 40–70 (using current scanner settings).

• Scaling Factor is close to one (using current scanner settings and Target In-tensity 100) or at least in the same level for those probe arrays you are plan-ning to compare with each other.

• All positive hybridization controls (BioB, BioC, BioD, CreX) are present.

• Ratios of signals of 3’, middle and 5’ probe sets (GAPDH, Beta-actin) areclose to one. Note, however, that performing several rounds of cDNA syn-thesis and in vitro transcription to amplify small amounts of RNA (< 1 ug)increases the 3’/5’ ratio.

• Percentage of present calls is between 30-50

Experiments analyzed by GCOS software can be exported from it by the DataTransfer Tool. The Data Transfer Tool can create *.CAB files, which can be im-ported into the GCOS or for each experiment, four files (“*.EXP” experiment file,


“*.DAT” image file, “*.CEL” file and “*.CHP” data file), which can be opened inan earlier version of the software, MAS, version 5 and in other suitable software.

Some sets of experiments may benefit from using only the Prefect Match probesignals in the analysis. This can be achieved with software such as dChip [2].

2.7 Comparison analysis

Comparison analysis provides a computational comparison between two RNA sam-ples hybridized to two GeneChip probe arrays of a same type. For the analysis onearray is designated as an experiment and the other as a baseline and comparedwith each other in order to detect and quantify differences in gene expression be-tween the two samples. Two sets of algorithms are used to generate qualitative(Change, Change p-value) and quantitative (Signal Log Ratio) estimates of the po-tential change.

2.8 Normalization

Before a comparison of two probe arrays can be made, variations between the twoexperiments caused by technical and biological factors must be corrected by scal-ing or normalization. Main sources of technical variation in an array experimentare quality and quantity of the labeled RNA hybridized as well as differences inreagents, stain and chip handling. Biological variation, though irrelevant to studyquestion, may arise from differences in genetic background, growth conditions,time, sex, age, etc. Careful study design followed by standardized labeling andhybridization protocols, and either scaling or normalization should be used to min-imize this variation.

Scaling (recommended by Affymetrix) and normalization can be made usingeither data from user-selected probe sets or all probe sets. When using selectedprobe sets (for example a group of housekeeping genes) for scaling/normalization,it is important to have a priori knowledge of gene expression profiles of selectedgenes. Thus in most cases less bias is introduced to the experiment when the dataof all probe sets is used for scaling/normalization. When scaling is applied, theintensity of the probe sets from experimental and baseline arrays is scaled to same,user-defined target intensity (recommendation using current scanner setting is 100).If normalization is applied, the intensity of the probe sets from experimental arrayis normalized to the intensity of the probe sets on the baseline array.

An additional normalization factor, Robust Normalization, is also introducedto the data. It accounts for probe set characteristics resulting from sequence-relatedfactors, such as affinity of the probe set to the RNA and linearity of the hybridiza-tion of each probe pair. More specifically, this factor corrects for the inevitableerror of using an average intensity of all the probes (or selected probes) on thearray as a normalization factor for every probe set. Robust normalization of theprobe set is calculated once the initial scaling/normalization factor is determined,by calculating a slightly higher and a slightly lower factor than the original. User-modified parameter, Perturbation, which can vary between 1.00 (no perturbation)and 1.49, defines the range by which the normalization factor is adjusted up and


down. The perturbation value directly affects the subsequent p-value calculation,since of the p-values resulting from applying the normalization factor and its twoperturbed variants, the most conservative is used to evaluate whether any changein level is justified by the data. Increasing the perturbation value can reduce thenumber of false changes, but may also block the detection of some true changes.

2.9 Change p-value

Differences between Perfect Match and Mismatch intensities as well as betweenPerfect Match and background intensities are used to calculate the Change p-valueby the Wilcoxon’s signed rank test. In addition, the level of photomultiplier sat-uration for each probe pair is evaluated. In the computation, any saturated probecell is rejected from the analysis (number of used cells can be determined from theStat common pairs column). The Change p-value ranges from 0.0 to 1.0 providingan estimate of the likelihood and direction of the change. For values close to 0.5,true change in the level of expression between the two RNA samples is unlikely.Values close to 0.0 indicate probability for an increase in gene expression level inthe experiment array compared to baseline, whereas values close to 1.0 indicatelikelihood for a decrease in gene expression level.

2.10 Change call

To make the Change Call, Change p-value is categorized by cutoff values calledgamma1 (γ 1) and gamma2 (γ 2). Gamma1 (γ 1) distinguishes between Changecalls, Increase, Marginal Increase and No Change. Gamma2 (γ 2) draws the linebetween Change calls, Decrease, Marginal Decrease and No change (Figure 2.1).These values are not directly user-definable, but are derived from two user-adjustableparameters, γ 1L and γ 1H, which define the lower and upper boundaries for gamma1.Gamma 2 is computed as a linear interpolation of γ 2L and γ 2H.

It is possible to compensate for effects that influence calls based on low andhigh signals by adjusting the stringency of calls associated with low and high signalranges (γ L and γ H) independently. This option is not used by default, since gammavalues are set as γ 1L = γ 1H and γ 2L = γ 2H.

2.11 Signal Log Ratio Algorithm

Signal Log Ratio algorithm estimates the measure and the direction of change of agene/transcript when two arrays are compared. Each probe pair on the experimentarray is compared to the corresponding probe pair in the baseline array in the cal-culation of Signal Log Ratio. This process eliminates differences due to differentprobe binding coefficients. A One-Step Tukey’s Biweight method is used in com-puting the Signal Log Ratio value by taking a mean of the log ratios of probe pairintensities across the two arrays. The base 2 log scale is used, translating the SignalLog Ratio of 1.0 to a 2-fold increase in the expression level and of -1.0 to a 2-folddecrease. No change in the expression level is thus indicated by a Signal Log Ratiovalue 0.


Figure 2.1: A representation of p-values in a data set. The Y-axis is the probe set signal andthe X-axis is the p-values. Values γ 1L, γ 2L, γ 1H, γ 2H describe user-adjustable valuesused in making the Change Call (I= Increased, MI = Marginal Increase, MD = MarginalDecrease, D = Decrease).

Tukey’s Biweight method also gives estimate of the amount of variation in thedata. Confidence intervals are generated from the scale of variation of the data.A 95% confidence interval shows a range of values, which will include the truevalue 95% of the time. Small confidence interval implies that the expression datais more exact, while large confidence intervals reflect more noise and uncertaintyin estimating the true level. Since the confidence intervals attached to Signal LogRatios are computed from variation between probes, they may not reflect the fullwidth of experimental variation.

2.12 Quality control tips

In order to reliably compare different arrays to each other it is important to alwaysperform the labeling and hybridization using a standardized protocol starting fromthe same amount of RNA. It is essential to keep track of every step of the exper-iment starting from the tissue collection and to apply rigorous quality controls inall steps. This will minimize the false positives and maximize the sensitivity whencomparing several arrays to each other.

Using small amount of RNA as a starting material and several rounds of cDNAsynthesis and in vitro transcription introduce a systematic bias to the data. There-fore arrays labeled using two different labeling methods (one-cycle and two-cycle)are not directly comparable with each other. Also more replicate samples may beneeded when performing amplification of RNA since the variation of the resultswill be higher. Some algorithms have been developed trying to minimize the bias[1].



1. Cheng, L., and W. H. Wong (2003) DNA-Chip Analyzer (dChip). In: Theanalysis of gene expression data: methods and software. Edited by G Parmi-giani, ES Garrett, R Irizarry and SL Zeger. Springer.

2. Wilson, C. L., Pepper, S. D., Hey, Y., and Miller, C. J. (2004) Amplificationprotocols introduce systematic but reproducible errors into gene expressionstudies. Biotechniques, 36, 498-506.


3 Genotyping systems

Janna Saarela

3.1 Introduction

SNP (single nucleotide polymorphism) microarrays provide an efficient and rela-tively inexpensive tool for studying several genetic variations in multiple samplessimultaneously. Numerous methods have been developed for SNP genotyping andtwo of them, namely single base extension (SBE) followed by tag-array hybridiza-tion and allele-specific primer extension, are shortly described here.

3.2 Methodologies

For the single base extension reaction, multiple genomic DNA-regions flankingSNP sites are amplified by PCR using SNP-specific primers (Figure 3.1:A). Af-ter enzymatic removal of the excess primers and nucleotides single base extensionreactions are carried out with detection primers each containing a sequence comple-mentary to the predefined TAG sequence spotted on the array and the SNP-specificsequence just preceding the variation. One allele-defining nucleotide, each labeledwith different fluorescent dye, is added to the detection primer in the SBE reaction.Finally, detection primers are hybridized to the TAG sequences spotted on an array.Genotypes are determined by comparing the signals of two possible nucleotideslabeled with different fluorescent dyes on each spot.

For the second method, allele-specific primer extension, two amino-modifieddetection primers, each containing one of the variable nucleotides of the SNP astheir 3’ nucleotide, are spotted and covalently linked to chemically activated mi-croscope slides (Figure 3.1:B). Genomic DNA flanking the SNP is amplified withSNP-specific primers containing T7 or T3 RNA polymerase recognition sequencein their 5’ end. Multiple SNPs can be amplified in one multiplex PCR reactionsimultaneously. By using T7 (or T3) RNA polymerase, PCR products are subse-quently transcribed to RNA, which is then hybridized to the SNP array containingthe two detection primers for each SNP. Reverse transcriptase enzyme and fluo-rescently labeled nucleotides are then employed to visualize the sequence-specificextension reaction of the allele-defining detection primer(s). Up to 80 differentsamples can be analyzed on one slide when as many identical subarrays of de-

3 Genotyping systems 33

tection primers are spotted on a slide, which is partitioned into small hybridizationchambers. Genotypes are determined by comparing the signals of the two detectionprimers representing the SNP on the array.

Figure 3.1: Principles of the two SNP genotyping methods. A. Single base primer exten-sion followed by TAG-array hybridization. B. Allele-specific primer extension on array.Traffic lights at the bottom of the figures describe three possible genotypes with eachmethod.

3.3 Genotype calls

To make the genotype call, signals of the two possible variants are compared. Incase of the SBE reaction, intensities of the two different fluorophores in a spotrepresenting the SNP are measured. In case of the allele-specific primer extensionmethod, each SNP is represented by two spots, both putatively labeled (with thesame fluorophore). The intensities of the two spots are measured and comparedto each other to determine the genotype. The same methods can be used for bothcomparisons. One method is to calculate the part of the signal of one allele overthe total signal intensity:

P = signal of allele 1

(signal of allele 1 + signal of allele 2)

All the values are between zero and one (Figure 3.2). Values close to onerepresent the homozygote allele 1 genotype, and those close to zero represent theother homozygote genotypes (allele 2). Those in-between (close to 0.5) represent


the heterozygote genotypes. A scatter plot can be formed having the values for Pin the X-axis and logarithm of the total intensity as the Y-axis. Three clear groupsof signals should show clearly, each representing one of the possible genotypes.

Figure 3.2: An example of genotype calls. TT and CC denote homozygous individual,and TC denotes heterozygous individuals.


1. Guo, Z., Guilfoyle, R. A., Thiel, A. J., Wang, R., and Smith, L. M. (1994)Direct fluorescence analysis of genetic polymorphisms by hybridization witholigonucleotide arrays on glass supports. Nucleic Acids Res., 22, 5456-5465.

2. Hirschhorn, J. N., Sklar, P., Lindblad-Toh, K., Lim, Y. M., Ruiz-Gutierrez,M., Bolk, S., Langhorst, B., Schaffner, S., Winchester, E., Lander E. S.(2000) SBE-TAGS: an array-based method for efficient single-nucleotidepolymorphism genotyping. Proc. Natl. Acad. Sci. U S A, 97, 12164-9.

3. Pastinen, T., Raitio, M., Lindroos, K., Tainola, P., Peltonen, L., Syvanen,A-C. (2000) A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays. Genome Res., 10,1031-42.

4. Syvanen, A-C. (1994) Detection of point mutations in human genes by thesolid-phase minisequencing method. Clin. Chim. Acta., 226, 225-36.

4 Image analysis 35

4 Image analysis

Antti Lehmussola, Katja Kimppa and Pekka Tiikkainen

4.1 Digital image processing

In the current section, some general level concepts of digital image processing areintroduced. Instead of presenting in-depth theory of image processing, this sectionserves as a recap on the general principles of image processing used in microar-ray technology. More detailed information concerning digital images and imageprocessing can be found from the literature.

4.1.1 Digital image

Briefly, a digital image is a reconstruction of a continuous analog image. Whereasan analog monochrome image is a continuous light intensity distribution repre-sented as a function of spatial position, a digital monochrome image is a finite setof the analog values. Moreover, a digital color image is a combination of differentmonochrome images, each representing the intensity of particular color.

The transformation from a continuous monochrome image f(y,x) into digitalform is made through sampling and quantization, as illustrated in Figure 4.1. Withsampling, the continuous coordinate values are discretized. Owing to the resolu-tion determined by the image acquisition system, the continuous image is sampledboth horizontally and vertically to M × N discrete image elements. Thereafter, thesampled image is quantized. That is, the continuous amplitude value of each sam-pled image element is represented discretely within given bit limits. For example,16-bit gray-scale image can discriminate 65,536 different intensity values in eachpixel. As a result of sampling and quantization, a representation of the continuousimage is a M × N matrix F(y, x) containing real numbers. The elements of thematrix are called pixels. Each image pixel has a unique coordinate (y, x), wherey = [0, M − 1] and x = [0, N − 1]. Commonly, the origin (0,0) is located in theupper-left corner of the image.

4.1.2 Image storage

The space required for storing a digital image is defined by the spatial resolutionand the number of graylevels of the image. According to the resolution, a digitalimage consists of M × N pixels, and the intensity of each pixel is represented with


Continuous signalQuantized signal

A

B

Figure 4.1: Illustration of sampling and quantization. (A) Continuous monochrome imageis represented as M × N matrix of real numbers; that is, as a digital image. (B) Exampleof quantization in one-dimensional space. Continuous signal is quantized into ten discretelevels.

certain number of bits. With k bits, each pixel has 2k possible intensity values.Therefore, the number of bits needed to store a digital image is b = M × N × k.

The storage of high resolution, high accuracy images acquires excessive amountof file space. The problematic nature of unreasonably large files emphasizes the ne-cessity for data compression. Data compression methods can be categorized intotwo groups: lossless and lossy. With lossless compression, the redundancy in theimage is reduced using information theory, and all original data can be recoveredwhen the file is uncompressed. An example of an image format utilizing loss-less compression is tagged image format (TIFF). However, highly more efficientcompression results can be achieved with lossy compression. The most commonmethod for lossy compression is transform coding, which transforms the imageinto frequency domain, where irrelevant information can be eliminated. It followsthat the higher the compression rate, the more original information is lost. Thepermanent loss of information will generate compression artifacts and prevent re-covery of the original image. Perhaps the most well-known format using lossyimage compression is JPEG. Although lossy compression produces highly com-pressed images, it is recommended to use these images only in case of humanvisual-inspection.

4.1.3 Image analysis

Image analysis is a field of digital image processing having probably the widestscale of different applications. With the aid of image analysis, meaningful informa-tion can be extracted from digital images, facilitating the automatic understandingof contents of the images. The analysis of several attributes in images is achievedusing variety of different lower lever image processing techniques.

One of the most important stages in image analysis is segmentation: a method

4 Image analysis 37

for automatic identification of objects in images. In segmentation, the image is di-vided into regions of equal characteristics determined by the nature of the analysistask in question. Several approaches exist for segmentation, some of them verysimple and some more complex. One way to categorize different segmentation ap-proaches is to group them into thresholding, edge and region based methods. Withthresholding based methods, an appropriate intensity threshold for discriminatingregion of interest from the other parts of the image is determined. Edge basedmethods categorize images into different areas utilizing strong edges between theareas originating from different sources. Finally, the region based methods aim atgrowing homogenous regions from predefined seed points.

4.2 Microarray image analysis

Image analysis provides a measurement framework for microarray technology, andtherefore is an essential part of each microarray experiment. As a result of thoroughimage analysis, scanned microarray image is quantified into gene expression val-ues. In this chapter, the different stages of microarray image analysis are explained,and some challenges arising from the stages are discussed.

Beyond the experimental differences between the different microarray plat-forms, the image analysis approaches can also vary. Academically, the analysis ofcDNA microarray images has acquired far more attention than of commercial plat-forms. Thus, the main focus of this chapter is concentrated on the methods used foranalysis of cDNA microarray images. However, in addition to the cDNA microar-ray experiments, many commercial manufactures utilize similar array structure fore.g. oligonucleotide microarrays. Therefore, the basic theory of cDNA image anal-ysis can be generalized to other spotted microarrays. Finally, a brief introductionto analysis of Affymetrix GeneChip images is given.

4.2.1 Gridding

As an initial stage in the microarray image analysis, the location of each spot onthe array needs to be determined. This stage is often referred as gridding, gridalignment, or spot addressing. Due to the congruency of this chapter, this stageis referred as gridding. Basically, gridding serves two separate objectives. First,a tentative location of each spot needs to be detected. Extensively varying back-ground and various error sources in microarray images will hinder the possibilityof segmenting all spots globally. Therefore, the gridding information is used forprocessing each spot locally. By analyzing each spot individually, clearly morereliable segmentation is obtained. Second, each spot on the image needs to be ad-dressed unambiguously to a particular spot in the physical experiment slide; that is,to a certain gene. Understandably, the whole experiment loses its meaning, if theexpression levels on the array are addressed to wrong genes.

Essentially, the physical architecture of the microarray is defined by the print-ing process. For example, the number of rows and columns in the array, as well asthe number of subgrids, is known in advance. Although this knowledge of the basicstructure of the microarray image can facilitate the gridding, it does not provide ac-


curate solution for the problem. In an ideal case, a microarray image would consistof an entirely regular, non-tilted grid of equal-sized spots. Nevertheless, severalerror sources in real-life microarrays make this hypothesis overly simplistic; spotscan have irregular shapes and sizes, the space between rows and columns may vary,subgrids may have different orientations, and several artifacts can degrade the im-age.

The problematic nature of finding initial spot locations has encouraged theusage of several approaches for microarray gridding. Both manual and automaticmethods have been introduced as a solution. In manual gridding, the experimenteradjusts a grid manually on the microarray image using some software tool. Clearly,such approach is very time consuming and prone to human errors. In addition,a reproduction of manual gridding is practically impossible. A natural extensionfor manual approach is semiautomatic gridding, where the final spot locations areobtained by manually fine-tuning some automatically generated grid proposal. Allgridding methods requiring human interaction are time consuming and the resultsnon-reproducible. Thus, a truly high-throughput microarray analysis is not ob-tained until the spots are located automatically.

Several methods have been proposed for automatic gridding of microarray im-ages. A common approach uses horizontal and vertical projections of the image. InFigure 4.2 this method is described in more detail. Almost universally, automaticmethods aim at locating orthogonal rows and columns from the image, instead ofprocessing each spot individually. In such case, misalignments or irregular sizesof individual spots can acts as a significant error sources. Recently, some efforthas been made to avoid these errors by gridding spots individually. Unfortunately,some automatic methods still lack required robustness, and therefore manual grid-ding has sustained its popularity. Commercial microarray manufacturers have triedto solve the gridding problem by using physical markers on the array. Using thesemarkers, the grid can be automatically aligned to approximately correct location,and only local adjustment is needed.

4.2.2 Segmentation

Suggestive identification of rough spot locations is followed by segmentation. Insegmentation, the scanned microarray image is divided into regions of two classes:foreground (spot) and background. Ideally, the segmentation would be rather trivialprocess, if all the spots had circular shape, similar size, and homogenous intensity.Instead, fairly common, bad quality of microarrays makes the segmentation morechallenging.

Several approaches have been proposed for microarray spot segmentation. Es-sentially, the variety of different algorithms reflects the complexity of the prob-lem. Currently, different methods have different advantages as disadvantages, andhence, single superior method is not available. In the simplest approach, fixed circlesegmentation, all the spots are assumed to be circular and of same size. Circularmask of fixed radius is placed on each identified spot location. Everything insidethe mask is considered as foreground, and outside as background. Typically inreal-life microarrays, the hypothesis of totally circular spots with similar sizes is

4 Image analysis 39

Figure 4.2: A popular approach for gridding. Horizontal and vertical projections areused for separating each spot from each other. Each pixel is summed according to bothhorizontal and vertical axes. Thereafter, grid lines are drawn in local minima of bothprojections.

often overly optimistic. Therefore, the algorithm frequently classifies pixels incor-rectly. On the other hand, although classification errors can have a drastic effecton single spot measurement, fixed circle segmentation can also be considered as ahighly robust algorithm. At the time many more sophisticated segmentation meth-ods are mislead, for example, by high noise level, or contamination artifacts, theuse of fixed circle segmentation will guarantee a measurement from equal numberof pixels for each spot. A natural extension for this algorithm is adaptive circlesegmentation, which estimates the circle radius independently for each spot. More-over, an example of more sophisticated approach is watershed segmentation. Thealgorithm utilizes watershed transformation for morphologically filtered image. Anillustration of segmentation results for these three example algorithms is given inFigure 4.3.

A B C

Figure 4.3: Illustration of segmentation results for (A) fixed circle, (B) adaptive circle,and (C) watershed algorithm. Different approaches produce different results for the seg-mentation. In this example, watershed segmentation clearly distinguishes the spots betterfrom the background than other algorithms.


4.2.3 Information extraction

As the final stage in microarray image analysis, the intensity for each spot is esti-mated. The intensity of each pixel being proportional to the level of hybridizationon the specific array location, the total fluorescence can be estimated by using allpixels inside the segmented spots. In general, the fluorescence level is estimatedby calculating the mean of pixels inside the segmented spot. Occasionally also me-dian or trimmed averages can be used for the estimation. The results obtained in theinformation extraction stage are highly dependent on the success of segmentation.Because the division between the background and foreground pixels was done inthe segmentation stage, corrupted segmentation results may unavoidably give riseto incorrect expression estimates in the information extraction stage.

Typically, the measured fluorescent intensity is considered as a combinationof true signal arising from the hybridization of target molecules, and backgroundcomponent originating from unspecific hybridization, contamination and other ar-tifacts. For this reason, the level of true expression is estimated by subtracting anestimate of background level from the measured fluorescent intensity. Thenceforth,the intensity values are said to be background corrected. Whereas the estimation ofthe total fluorescence was straightforward, the estimation of the current backgroundlevel is more challenging.

Several approaches have been proposed for background estimation. Funda-mentally, the different background estimation approaches can be categorized intolocal and global methods. The distinction originates from the difference in in-terpreting the existence of additive background. While the global methods relyon assumption about constant background for the whole image, the local methodsassume that the background level varies locally throughout the image. However,global background estimation methods are generally considered as unreliable, sincethe background level can vary extensively throughout the whole array. On the otherhand, estimation of background level only in local neighborhood of each spot canbe highly sensitive to noise. A common approach for background estimation isto calculate the mean of pixels inside the local neighborhood of spot. The back-ground correction has proven to be a problematic issue since sometimes the esti-mated background level can exceed the level of estimated foreground level whichshould indicate negative expression level. For this, and other problems arising fromthe background correction, more robust estimators have been tried to develop.

4.2.4 Quality control

Image analysis offers sophisticated tools for automatic microarray quality control.Some of the spots may be corrupted due to the experimental errors or failures inthe printing stage, as illustrated in Figure 4.4. At worst, erroneous measurementscan introduce expression patterns leading to incorrect biological conclusions. Cur-rently, the quality of each microarray spot is often estimated individually by a hu-man experimenter. Naturally, the number of microarray spots being tens of thou-sands, such approach is extremely time consuming and prone to human errors.

Essential part of the automatic quality assessment is the selection of appropri-ate features. First, the features believed to reflect the quality of spots are selected.

4 Image analysis 41

Examples of such features could be spot morphology, size, intensity, and homo-geneity. Second, using image processing techniques these features are evaluatedin each spot. Finally by combining the results, an individual quality value can beassessed automatically for each spot. By assessing the spot quality automatically,entirely equivalent analysis can be guaranteed for every spot.

A B C

Figure 4.4: Varying quality of microarrays makes the image analysis more challenging.Here, some common errors are illustrated both in two and three dimensions: (A) Spot withhole inside, (B) spot contaminated with high intensity noise, and (C) connected spots.

4.2.5 Affymetrix GeneChip analysis

Affymetrix GeneChip platform differs extensively from normal cDNA microarraytechnology. More detailed description of GeneChips is given in Chapter 2. In im-age processing point of view, first fundamental difference between GeneChips andother microarrays is the image storage. While cDNA microarray images are oftenstored using popular TIFF format, GeneChip images are stored with Affymetrix’sself-developed image format. Therefore, raw GeneChip images are not compatiblewith normal image viewers.

Because of the commercial aspect, GeneChip images are traditionally ana-lyzed using Affymetrix’s softwares. Thus, the development of image analysis al-gorithms is mostly driven by Affymetrix’s own software development. The mainobjective of image analysis is calculating a single intensity value for each probecell on a scanned array. First, a grid separating each probe cell from each other isautomatically aligned on the image. Automatic gridding is guided by the physicalmarkers on the corners of the array. Later versions of Affymetrix’s analysis soft-ware are capable of fine-tuning the initial grid by locating the minimum variance ofcell pixels. Finally, the bordering pixels of the probe cells are excluded, and repre-sentative intensity values for hybridization levels are calculated. An intensity valuefor each probe cell is stored for the subsequent analysis. Thereafter, the majority ofpreprocessing is done using cell intensities; that is, single intensity values for eachprobe. In detail, this aspect is concerned in Chapter 2. Currently, image analysishas a lesser role in Affymetrix platform than in spotted microarrays. However, al-though Affymetrix has decided to give less weight for analyzing images, automaticimage analysis contains as many possible usage scenarios for GeneChip images


than for traditional microarray images.

Figure 4.5: The main difference between GeneChips and spotted microarrays are inphysical architecture. Instead of spots, GeneChip arrays consist of rectangular probe cellsattached to each other. However, despite the differences in physical architecture, bothsystems still share some similar concepts in image analysis stage. For example, bothplatforms use gridding as a primary image analysis stage.

Suggested reading

1. Gonzalez, R.C., Woods, R.E. (2002) Digital Image Processing, Prentice-Hall, New Jersey.

2. Russ, J.C. (1999) The Image Processing Handbook, CRC Press, Boca Raton.

3. Zhang, W., Shmulevich, I., Astola, J. (2004) Microarray Quality Control,John Wiley & Sons, New Jersey.

4.3 Using image analysis software

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on using different image analysis software.

5 Overview of data analysis 43

5 Overview of data analysis

M. Minna Laine, and Jarno Tuimala

In this book, we emphasize to microarray data analysis after the microarrayshave been hybridized, scanned and the images have been analyzed with an imageanalysis software. Before any experiments in the laboratory have been initiated,the experiment and its analysis should be planned carefully. Chapter 6 describespoints that need to be considered when designing the experiments.

With an enormous amount of data, we need standardized systems and tools fordata management in order to publish the results in a proper and sound way, as wellas to be able to benefit from other publicly available gene expression data. Theseaspects are discussed in Chapter 7.

The following schematic workflow outlines the basic analysis steps of ex-ploratory data analysis.

5.1 cDNA microarray data analysis

Chapter 4 gives an overview of image analysis. We then proceed to the data anal-ysis with the results from scanned images. At this point, images have been eval-uated, bad spots have been investigated and the spots have preferably been scoredwith flags indicating whether the spot was good, bad, or borderline. This is crucial,because in the later stages of the analysis the visual inspection of individual spotsis not possible.

Next steps in the analysis are preprocessing (Chapter 9), normalization (Chap-ter 10), and quality control (Chapters 8, 9, and 10). The goal of these analyses is toorganize results in a meaningful order: flags, controls, and experiments are pointedout and checked. Variation due to systematic errors are removed, and data fromdifferent chips are made comparable.

Statistics is needed in many steps during the analysis. Therefore, the wholeChapter 8 has been dedicated to basics of statistics. Statistical tools are used forevaluation of the raw data during the above-mentioned steps, and in the furtheranalysis to find significantly differentially expressed genes.

In these further analysis steps, statistically significant, quality-checked dataare separated from not interesting and not-trustworthy data (Chapter 9). The nextstep is to find the differentially expressed genes using statistical tools (Chapter11), or to group the good quality data (usually only a small fraction of the original


raw data) into meaningful clusters by e.g. clustering (Chapter 12). The goal ofclustering is to find similarly behaving genes or patterns related to time scale, timepoint, developmental phase or treatment of the sample.

At this point, we have already manipulated our data quite a lot, and spentconsiderable time with a computer. Despite that and depending on what we arelooking for, we may be at the very beginning of the challenging part of the dataanalysis. Next we need to link the observations to biological data, to regulationof genes, and to annotations of functions and biological processes. This part, datamining, is described in Chapters 13, 14 and 15.

5.2 Affymetrix data analysis

Putting model-based methods aside, exploratory data analysis using Affymetrixchips is very similar to cDNA microarray data analysis. The biggest differenceis normalization. If comparison analysis (see 2.7) is conducted, the Affymetrixdata can be treated similarly to a background corrected cDNA data. If single arrayanalysis is performed, the basic normalization scheme, e.g., RMA preprocessingcan be used.

5.3 Data analysis pipeline

To get an overview of the data analysis pipeline, consult Figure 5.1. It covers thebasic methods introduced in this book. The flow chart also helps to choose the rightmethod for the situation. The Table 5.1 contains a short list of analysis methods,and links to the chapters, where more details are available.

The flow chart should be read flexibly, though. For example, there can beseveral filtering steps instead of just one shown on the chart. Or, when there aremore than two conditions in the experiment, the data can be analyzed using twoconditions route. Note that all possible orders of analysis have not been shown forclarity, and the schema is only meant to help you in the analysis process, not to betaken as an absolute truth.

There are different opinions on which order the steps should be taken, andwhich portion of data should be transferred to the next step, so please, study therelevant literature to find out more about limitations and characteristics of differentanalysis steps. We would very much like to encourage you to make your owneducated decisions, and not to take this book with a face value!

5 Overview of data analysis 45

Figure 5.1: Overview of data analysis methods discussed in this book. Note that allpossible orders of analysis have not been shown.


Table 5.1: Chapters for methods shown in Figure 5.1.

Method Chapter

Experimental design 6

Data management 7

MIAME standard 7

MAGE standard 7

Basic statistics 8

Background correction 9

Calculation of expression change 9

Quality checking 9

Filtering 9

Normalization 10

Single slide methods 11

t-test 11

ANOVA 8

Clustering 12

Classification 12

PCA 12

Function prediction 12

Gene regulatory networks 13

Promoter sequence analysis 14

Annotations 14, 15

Ontologies 14, 15

6 Experimental design 47

6 Experimental design

Garry Wong

6.1 Why do we need to consider experimental design?

Good experimental design, above the myriad of other considerations, will likelyprovide you with the greatest amount of satisfaction and least amount of frustra-tion in executing a microarray project. In the early days of microarray research, asimple design of control v. treatment and replicates of each would suffice. How-ever, those days are rapidly going by the wayside and 30–80 chip experiments ina single design are becoming common. While the design depends primarily onthe scientific question that is being proposed (hypothesis), other considerations,such as appropriate controls, platforms, and statistical issues merit serious thought.This is to say nothing about costs! While many projects have similar objectives, itwould be foolish for us to design protocols in a guidebook without intimate knowl-edge of the system and the intricacies involved in the study. Moreover, many fieldshave specialized and/or traditional methodologies that require non-typical designs.Nonetheless, there are several variables in the experimental design that are commonto most studies, and we will try to describe them here. We hope that the consider-ation of the strengths and weaknesses of each variable on both the biological andstatistical level would help the individual investigator better design his/her own mi-croarry study. Most of the discussion in this chapter concerns biological aspectsof expression arrays, although places where SNP arrays are relevant are pointedout. Statistical issues to be considered are found in Chapter 8, especially in section8.10.

6.2 Choosing and using controls

The control serves as a reference for an experiment commonly termed the treat-ment. The treatment can be chemical (drugs, toxins), biological (viruses, microbes,transgenes) or environmental (stress, irradiation) in nature. The individual treat-ments can be given in a time series (time course) or at different doses (dose re-sponse). In all these cases, the control should be matched as closely as possiblegenetically. This can mean that the controls are sibs or that the animal used is aninbred strain, or a combination of the two. Controls with the same environmental


influences are usually dealt with by using litter siblings that have been raised iden-tically. Physiological matching can be done by taking the same sex, age, and healthstatus.

Controls for human tissues are problematic since age, cause of death, and post-mortum interval are difficult to match. Moreover, human tissues for research pur-poses are often stored for unequal intervals, which greatly affects RNA quality.Because of this potential confounder, many human studies have been done usinghuman cell lines rather than tissues. Another solution is to match tissue by takingdifferent regions from the same human tissue, one that is healthy to serve as thecontrol, and one that carries the disease as the treatment sample.

Controls for mice studies have a different problem in that some transgenicmice are crosses between xxx and yyy strains. Therefore, both transgenic and non-transgenic littermates may have quite different genetic backgrounds. The solutionwould be to backcross to one of the parental strains until both the control and trans-genic mouse have the same genetic background. This is somewhat time consumingsince mice have a reproductive cycle of 2–3 months. Another solution would beto make sure that the transgenic lines that are produced have a homogeneous back-ground.

Controls for cell based experiments generally consist of identical cultureswithout the physiological, physical, or chemical treatments. The controls may alsoinclude cells derived from other sources such as the equivalent or healthy tissues.When cells are cocultured, the controls become more complicated. Cultures ofeach of the individual constituent cell populations in addition to a coculture couldbe used as a control. In practice this means that if a coculture of A and B cells ismade, then the controls will be A alone, B alone, A and B together.

Because of all these potential variables that could effect the microarray results,it often makes sense to create a design with more than one control within the study.

6.3 Choosing and using replicates

Replicates are repeated experiments with the same sample (Table 6.1). Replicatesprovide a measure of the experimental variation, such as in RNA isolation, labelingefficiency, or in chip quality. It is highly recommended to include replicates in yourdesign, otherwise you will not know the inherent experimental variation. Dupli-cates are common practice and triplicates provide even more information. Somedesigns incorporate combinations of duplicates and triplicates. In some cases, itmay not be practical to perform replicates due to lack of tissue or other reason.In these cases, replicated spots on the chip can provide some idea of the varia-tion in hybridization. Including a large number of both controls and treatmentswithout replicates is another design strategy that has been used. Confirmation ofresults from non-replicated experimental designs using independent methods suchas quantitative PCR and Northern blotting is also one way to circumvent the needfor replicates.

6 Experimental design 49

6.4 Choosing a technology platform

For expression arrays, there are two principle platforms: spotted arrays and in situsynthesized arrays such as Affymetrix. The basic principles have been presentedin the introduction and detailed aspects are covered later in the book. Apart fromthe experimental design issues presented here, the Affymetrix platform uses onelabeling reaction per chip, so a control reaction is performed and hybridized andthen the treatment reaction is performed on a different chip. The results are thencompared after having been normalized by independent spiked standards. This isin contrast to expression arrays where there is a direct comparison between controland treated. While not completely obvious, the advantage of the Affymetrix plat-form is that multiple experiments can be compared directly if they are performedon the same chip series. Thus, many of the steps that can cause experimental vari-ation are controlled. The labeling, hybridization, and readout is done on identicalinstruments with identical protocols on identically designed chips, so other thannormalization to an external standard, all experiments performed on the same chipseries should be comparable. This adds tremendous power to the experimental de-sign. Since control and treatment are hybridized on different Affymetrix chips,the design requires twice as many chips compared to expression arrays. However,controls can be compared with all other samples so there is a small benefit in thisdesign if many controls are needed. Moreover, comparison to publicly availabledata can be performed. Affymetrix platform has the added advantage of multipleprobes for each gene, so in a sense, you get 16–25 measurements for each generather than the 1 or 2 from typical expression arrays. This is especially useful ifyou wish to hybridize to genes with degenerated sequences such as splice forms,highly polymorphic, or related species. The selection of gene chips available fromAffymetrix is more limited than currently available expression arrays, but containsthe major species used in scientific research including Saccharomyces cerevisiae(yeast), Arabidopsis Thaliana (mustard weed), Caenorhabditis elegans (worm),Mus musculus (mouse), and Homo Sapiens (human).

6.5 Gene clustering v. gene classification

It often makes more sense to identify genes and their related expression levels thatpredict the experimental group they are derived from (gene classification) than tofind groups of genes that act in a particular way after treatment (gene clustering).For classification studies, large numbers of samples for each treatment groups areneeded in order to robustly find a set of genes that predicts their original treatmentgroup. The large number of samples is also necessary to validate the predicted clas-sification based on training sets derived from the data. In this case of using geneexpression to predict human tumor type as a diagnostic tool, individual variationfrom human subjects, and variation in the samples due to which part of the tumorwas dissected, how the tissue was obtained, and the other experimental variablestypically suggests that random variation in the profiling is unavoidable, further in-dicating that sample sizes should be large.

In contrast, the number of genes profiled need not to be large if the genes


Table 6.1: Typical experimental designs and number of microarrays needed.

Experimental design Samples Replicates Cy3/Cy5 Affymetrix

Control v. Treated 1 1 1 2Control v. Treated 1 3 3 6Control v. Treated 5 2 10 12Control v. Treatments (4) 1 1 4 5Control (2) v. Treatments (2) 1 1 4 4Control (2) v. Treatments (2) vs.time points (4)

1 1 16 16

Control (2) v. Treatments (2) vs.time points (4)

1 3 48 48

Control (2) v. Treatments (2) vs.time points (4)

4 2 64 96-112

Controls (1) v. Treatments (2) vs.time points (4)

6 2 96 144

consistently predict the classification group. This is unfortunately not the usualcase.

6.6 Conclusions

The experimentalist is likely to know the most about the scientific question andtherefore have the greatest input into the experimental design. Nonetheless, con-sultations with a statistical expert prior to experiments could help to decrease exper-imental variation derived from the limitations listed above. Two excellent reviewarticles are listed at the end of this chapter. In the end, there are many factors influ-encing the successful outcome of a microarray study, and attention to as many aspossible will greatly aid in the final outcome.


1. Churchill, G. A. (2002) Fundamentals of experimental design for cDNA mi-croarrays. Nature Genetics supplement, 32, 490-495.

2. Imbeaud, S., Auffray, C. (2005) ’The 39 steps’ in gene expression profiling:critical issues and proposed best practices for microarray experiments. DrugDiscov. Today, 10, 1175-1182.

3. Yang, Y. H. (2002) Design issues for cDNA microarray experiments. NatureReviews, 3, 579-588.

7 Reporting results 51

7 Reporting results

M. Minna Laine and Teemu Toivanen

7.1 Why the results should be reported

Although microarrays work best as a screening tool, sometimes the final results thatyou wish to publish lean largely on the microarray data.

There are several good reasons why you should make your microarray datapublicly available. It will be beneficial for you and others. Analyzing several good-quality, well-documented experiments together will ease the task of finding actualcorrelations between the gene expression. Since the microarray experiments areexpensive, adding publicly available information from other microarrays into yourown findings will save money.

You should think of publishing your microarray results already from the verybeginning of the experimental work. Later on, it may be very difficult to “mine yournotebook” for all the details on how you performed your experiments and how youhandled your data.

Microarray data without detailed information about the experimental condi-tions and analysis steps are useless, therefore, they should always be accompaniedwith that supportive information. To be able to compare your data with other re-searchers’ experiments, all this information has to be reported in a common way,using a widely accepted standard. Such a standard is called Minimum InformationAbout a Microarray Experiment, MIAME.

Furthermore, major scientific journals, first ones being Nature and Cell, aswell as EMBO Journal, have already set a condition that the microarray data in thesubmitted paper have to be MIAME-compliant and a complete data set submittedto a public data repository, before they accept the paper for publication. This de-mand has become a prerequisite also for several other journals, in a same way aspublishing your sequence data in public databases, GenBank and EMBL. In manyinstances, the approved databases are Gene Expression Omnibus (GEO) at the Na-tional Center for Biotechnology Information (NCBI); ArrayExpress at the Euro-pean Bioinformatics Institute (EMBL-EBI); or the Center for Information BiologyGene Expression (CIBEX) database at DNA Data Bank of Japan (DDBJ).


7.2 What details should be reported: the MIAME standard

The MIAME standard outlines the minimum information that should be reportedabout a microarray experiment to enable its unambiguous interpretation and repro-duction.

The latest version of the MIAME standard can be found from the MGED (Mi-croarray Gene Expression Data) group’s web page, see http://www.mged.org/

miame. Simply, "MIAME is what one scientist should tell another scientist abouthis/her microarray experiment so that the other scientist could reproduce it andverify the results".

Microarray Gene Expression Data (MGED) Society is an international groupof people from different disciplines who has a common task to standardize themethods for presenting and sharing microarray data. In addition to MIAME buildup,MGED Society has several working groups developing microarray standards, suchas The Ontology Working Group (http://mged.sourceforge.net/ontologies/index.php).

The ontology is a defined vocabulary to describe and name different terms(events, data, methods, material) related to microarray research so that the microar-ray data can be annotated based on this ontology. MGED ontology is used as abasis for building the MAGE object model and markup language. MGED ontologyis an integral part of the MIAME standard, and they are being developed together.

7.3 How the data should be presented: the MAGE standard

MAGE (MicroArray and Gene Expression) is a standard for the representation ofmicroarray expression data and it is able to capture information specified by MI-AME.

MAGE has been developed by the EBI and Rosetta Biosoftware and has laterbecame an adopted specification of the OMG (Object Management Group) stan-dards group (http://www.omg.com), but it’s still an ongoing project. MAGE con-sists of three parts: An object model (MAGE-OM), a document exchange format(MAGE-ML), which is derived directly from the object model, and software toolk-its (MAGE-STK), which seek to enable users to create MAGE-ML.

7.3.1 MAGE-OM

MAGE-OM (Microarray Gene Expression Object Model) is an object model forthe representation of microarray expression data that facilitates the exchange ofmicroarray information between different data systems and organizations. MAGE-OM is a data-centric model that contains 132 classes grouped into 17 packagescontaining, in total, 123 attributes and 223 associations between classes. Enti-ties defined by the MIAME standard are organized as MAGE-OM componentsinto packages, such as Experiment, BioMaterial, ArrayDesign, BioSequence, Ar-ray, and BioAssay packages, and the gene-expression data into a BioAssayDatapackage. The packages are used to organize classes that share a common purpose,and the attributes and associations define further the classes and their relations. A


database created using MAGE-OM is able to store experimental data from differ-ent types of DNA technologies such as cDNA, oligonucleotides or Affymetrix. It isalso capable of storing experiment working processes, protocols, array designs andanalyzing results. MAGE-OM has been translated into an XML-based data format,MAGE-ML, to facilitate the exchange of data.

7.3.2 MAGE-ML; an XML-representation of MAGE-OM

MicroArray Gene Expression Mark-up Language - MAGE-ML - is an XML-basedfile format able to capture all MIAME required information, and it is derived di-rectly from MAGE-OM. It is important to remember that valid MAGE-ML filedoes not necessarily mean that it contains all the information required by MIAME,e.g. it can be used to transfer some manufacturer specific data without any expres-sion information. For example, the manufacturer of a chip may deliver to a user aMAGE-ML file containing information about the chip design and production pro-cess. Then the user can import the file directly to a local database. Because ofthe complexity of the model it’s not quite human readable (though it’s XML), but asoftware is needed to support importing and exporting MAGE-ML. Implementationof MAGE-ML include Rosetta Biosoftware, which supports it, and MIAMExpressthat produces files in that format. The EBI has developed a freely available soft-ware tool kit (MAGE-STK) that eases the integration of MAGE-ML into end users’systems.

7.3.3 MAGE-STK

The MAGE Software Toolkit is a collection of Open Source packages that imple-ment the MAGE Object Model in various programming languages. The suite cur-rently supports three implementations: MAGE-Perl, MAGE-Java, and MAGEC++.The idea is to be able to have an intermediate object layer that can then be used toexport data to MAGE-ML, to store data in a persistent data store such as a relationaldatabase, or as input to software-analysis tools.

7.4 Where and how to submit your data

7.4.1 ArrayExpress and GEO

There are two central repositories for microarray data, ArrayExpress (http://www.ebi.ac.uk/arrayexpress) at the EBI, and GEO (http://www.ncbi.nlm.nih.gov/geo/) at NCBI. ArrayExpress and GEO are designed to accept, hold anddistribute MIAME compliant microarray data. Both of them provide you with anaccession number that can be used in the publication analogous to GenBank ID ofa sequence. These are accepted databases to dispose your data by, e.g., Nature andScience.

ArrayExpress supports all the requirements specified by the MIAME standard,as developed by the MGED group and aims to store well-annotated data. In addi-tion to sending your data into ArrayExpress, you can search, browse and retrieve(in MAGE-ML format) other microarray data from it. Currently (December 2005)


ArrayExpress contains close to 1100 publicly available experiments having 30 000hybridizations , and GEO contains 2700 series of samples with 60 000 chips’ (sam-ple’s) data. In addition to ArrayExpress and GEO, a similar database called CIBEXis under development in Japan.

There are several species- or site-specific public data repositories available,such as the Stanford Microarray Database (SMD, http://genome-www5.stanford.edu/MicroArray/SMD/) (for review, see a paper by Penkett and Bähler, 2004).However, the way data are presented in many of these special gene expressiondatabases is quite diverse, or sometimes only the analyzed data have been madeavailable, which makes it difficult to retrieve and evaluate the data, and impossibleto co-analyze them together with your experiments. Besides direct submission us-ing MAGE-ML (to ArrayExpress and GEO) or SOFT (to GEO), microarray datacan be submitted to the public databases via web interfaces. Next we will descibesome basic aspects of such submission tools.

7.4.2 MIAMExpress

MIAMExpress (http://www.ebi.ac.uk/miamexpress/) is a MIAME-compliantmicroarray data submission tool. It has a web interface that allows you in a step-by-step manner to submit your microarray data to ArrayExpress database. First youwill create an account, log in to that account, and then select a submission type:Protocol, Array design or Experiment. Protocol and Array design can be submittedindividually, but the Experiment always need to be accompanied by details aboutProtocol and Array Design to complete the submission. In addition to MIAMEattributes, you can add other qualifiers. If the microarray data is being submittedto a journal, the journal name and status (submitted/in press/accepted) should beindicated. After filling the information about a certain protocol and array design,you next link that information to your experiment. The actual experiment results aresubmitted as scanned files (raw data), that can be either CEL files for Affymetrixdata, or .gpr files, which contain two wavelengths in a same file, for others. Ifyou wish, you can also add a data file corresponding to the transformation of allthe results generated from the raw datafiles in a given experiment. In that case,the submitted protocol should describe all steps necessary to allow a third party torecreate the transformed data file.

7.4.3 GEO

GEO web interface works in a similar manner, although the vocabulary differs; firstyou log in, identify yourself, and then define your platform (array type and spec-ifications, annotated gene list as tab-delimited text file, organism used, etc.) us-ing pop-up menues. Next you bring in sample data (normalized or scaled values).These tables contain columns with standard headings, which can be either requiredor optional, and non-standard headings (descriptive fields), which are user-definedand always optional. Raw values (such as Affymetrix CEL files) are brought intodatabase as supplementary data. Samples (yours and others in GEO database) canbe grouped in series such as dose-response or time course series, or as repeat sam-ples’ group.


7.4.4 Other options and aspects

There are several other MIAME-compliant data management systems or software.To see a current list, please visit http://www.mged.org/Workgroups/MIAME/

miame_software.html. As an example, one of these is an open-source applica-tion called BASE (BioArray Software Environment, http://base.thep.lu.se/).BASE consists of array production LIMS (see below), a relational database, and aweb interface for management and analysis of microarray data. Additional analysistools can be added as plug-ins. In the future, CSC may provide a centralized placeand tools to submit your microarray data to, e.g., ArrayExpress. Since microarraydata is connected to much more information than just that required by MIAMEstandards, it may be reasonable to set up a LIMS (Laboratory Information Manage-ment System) for your laboratory for a storage place to all the information relatedto the microarrays, especially if you use custom-made arrays instead of commercialones.

7.4.5 Your role in providing well-documented data

Although both main data repositories, ArrayExpress and GEO, support MIAME,and recently their submission tools have been improved so that it is ever easierto submit your data in an appropriate form following the MIAME checklist (Fig-ure 7.1), it is still every researcher’s own choice how well they will describe theirdata. The more time and efforts you use when describing and submitting well-annotated data, the more useful your data will be for the research community.


1. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeck-ert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland,T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C.,Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J.,Taylor, R., Vilo, J., and Vingron, M. (2001) Minimum information about amicroarray experiment (MIAME)-toward standards for microarray data. Nat.Genet., 29, 365-71.

2. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygu-nawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Garcia, L. G.,Oezcimen, A., Rocca-Serra, P., and Sansone, S-A. (2003) ArrayExpress-apublic repository for microarray gene expression data at the EBI, NucleicAcids Research, 31, 68-71.

3. Edgar, R., Domrachev, M., and Lash, A. E. (2002) Gene Expression Om-nibus: NCBI gene expression and hybridization array data repository, Nu-cleic Acids Research, 30, 207-10.

4. Gardiner-Garden, M. and Littlejohn, T. G. (2001) A comparison of microar-ray databases, Briefings in Bioinformatics 2, 143-158.


Experiment Design:

• The goal of the experiment - one line maximum (e.g., the title from the related publication)

• A brief description of the experiment (e.g., the abstract from the related publication)

• Keywords, for example, time course, cell type comparison, array CGH (the use of MGED ontology terms is recommended).

• Experimental factors - the parameters or conditions tested, such as time, dose, or genetic variation (the use of MGED ontologyterms is recommended).

• Experimental design - relationships between samples, treatments, extracts, labeling, and arrays (e.g., a diagram or table).

• Quality control steps taken (e.g., replicates or dye swaps).

• Links to the publication, any supplemental websites or database accession numbers.

Samples used, extract preparation and labeling:

• The origin of each biological sample (e.g., name of the organism, the provider of the sample) and its characteristics (e.g., gender,age, developmental stage, strain, or disease state).

• Manipulation of biological samples and protocols used (e.g., growth conditions, treatments, separation techniques).

• Experimental factor value for each experimental factor, for each sample (e.g., ’time = 30 min’ for a sample in a time courseexperiment).

• Technical protocols for preparing the hybridization extract (e.g., the RNA or DNA extraction and purification protocol), andlabeling.

• External controls (spikes), if used.

Hybridization procedures and parameters:

• The protocol and conditions used for hybridization, blocking and washing, including any post-processing steps such as staining.

Measurement data and specifications:

• Data

– The raw data, i.e. scanner or imager and feature extraction output (providing the images is optional). The data shouldbe related to the respective array designs (typically each row of the imager output should be related to a feature on thearray - see Array Designs).

– The normalized and summarized data, i.e., set of quantifications from several arrays upon which the authors base theirconclusions (for gene expression experiments also known as gene expression data matrix and may consist of averagednormalized log ratios). The data should be related to the respective array designs (typically each row of the summarizeddata will be related to one biological annotation, such as a gene name).

• Data extraction and processing protocols

– Image scanning hardware and software, and processing procedures and parameters.

– Normalization, transformation and data selection procedures and parameters.

Array Design:

• General array design, including the platform type (whether the array is a spotted glass array, an in situ synthesized array, etc.);surface and coating specifications and spotting protocols used (for custom made arrays), or product identifiers (the name ormake, catalogue reference numbers) for commercially available arrays.

• Array feature and reporter annotation, normally represented as a table (for instance see Tables 1, 2 below), including

– For each feature (spot) on the array, its location on the array (e.g., metacolumn, metarow, column, row) and the reporterpresent in the location (note that the same reporter may be present on several features).

– For each reporter unambiguous characteristics of the reporter molecule, including

� Reporter role - control or measurement

� The sequence for oligonucleotide based reporters

� The source, preparation and database accession number for long (e.g., cDNA or PCR product based) reporters

� Primers for PCR product based reporters

– Appropriate biological annotation for each reporter, for instance a gene identifier or name (note that different reporterscan have the same biological annotation)

• Principal array organism(s)

Figure 7.1: The MIAME checklist. Last update January 2005. Reference:http://www.mged.org/Workgroups/MIAME/miame_checklist.html.


5. Gollub, J., Ball, C.A., Binkley, G., Demeter, J., Finkelstein, D. B., Hebert, J.M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J. C., Schroeder,M., Brown, P. O., Botstein, P., and Sherlock, G. (2003) The Stanford Mi-croarray Database: data access and quality assessment tools, Nucleic AcidsResearch, 31, 94-96. http://genomebiology.com/2002/3/9/research/0046.1

6. Penkett, C. J. and Bähler, J. (2004) Navigating public microarray databases.Comp Funct Genom 2004; 5: 471-479.

7. Saal, L. H., Troein, C., Vallon-Christersson, J., Gruvberger, S., Borg, Å., andPeterson, C. (2002) BioArray Software Environment: A Platform for Com-prehensive Management and Analysis of Microarray Data, Genome Biology,3, software0003.1-0003.6.

8. Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S.,Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L.,Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J.,Hubley, R., Deutsch, D., Senger, M., Aronow, B. J., Robinson, A., Bassett,D., Stoeckert, C. J. Jr, and Brazma, A. (2002) Design and implementation ofmicroarray gene expression markup language (MAGE-ML), Genome Biol-ogy, 3, research0046.1-0046.9.

9. Stoeckert, C. J., Causton, H. C., and Ball, C. A. (2002) Microarray databases:standards and ontologies, Nature Genetics, 32, 469-473.


8 Basic statistics

Jarno Tuimala

8.1 Why statistics are needed

One important use of statistics is to summarize a collection of data in a clear and un-derstandable way. There are two basic methods: numerical and graphical. Graphi-cal methods are best at identifying patterns in the data. Numerical approaches aremore precise and objective. Since numerical and graphical approaches complementeach other, both are needed.

8.2 Basic concepts

8.2.1 Variables

A variable is any measured characteristic that differs between subjects. For exam-ple, if the weight of 50 individuals were measured, then weight would be a variable.The numerical values of weight for the subjects are called measurements, scores,data points or observations.

Variables can be quantitative or qualitative. Quantitative variables describecertain quantities of subjects, and can be measured on a continuous or discretescale. On the continuous scale, there is an infinite number of values variables cantake. Discrete variables can take on only a limited number of values. For example,weight is measured on a continuous scale, and school grades on a discrete scale.

Qualitative variables describe the attributes of the subjects, for example, haircolour. They are always measured on a discrete scale.

When an experiment is conducted, some variables are manipulated by the ex-perimenter, and others are measured from subjects. The former are factors or inde-pendent variables, the latter, dependent variables.

8.2.2 Constants

Measures of subjects, i.e. variables, which can take on only one value during theexperiment are called constants. For example, if only women are included in astudy, then the variable sex is constant during the study.

8 Basic statistics 59

8.2.3 Distribution

Distribution describes the scores and the number of scores the variable can take.A continuous variable can be described with a histogram, which is a graphicalrepresentation of the distribution (Figure 8.5). Distributions are more thoroughlycovered in the subsequent sections.

8.2.4 Errors

Errors are inaccuracies in measurements. In the laboratory analyses, there are oftenan innumerable amount of potential sources of errors, but only a few are usually ofparamount importance.

Errors can be systematic (biases) or random. Systematic errors affect the mea-surements such a way that you get a wrong estimate of whatever you are measuring.Systematic errors affect all the subjects similarly. Random errors lack such predic-tivity. For example, one laboratory technician can be biased, if his/her analysisresults are always consistently higher than the results acquired by other techni-cians. If the same technician drops the tube so that the liquid is spilled on the floor,but he/she then slurps the liquid back up and transfers it to the tube with a pipet(a procedure known as linoleum blot), a random error has been created. If all thetubes are always cast on the floor, that would systematically bias the results.

8.3 Simple statistics

8.3.1 Number of subjects

Number of subjects (N) or sample size both describe the same thing: How manysubjects are included in the study. Often the sample size is fixed before the experi-ment is conducted. For example, if 1 000 genes are spotted on the DNA microarray,then the sample size would be 1 000 genes.

8.3.2 Mean

The arithmetic mean (m) of a variable is calculated as the sum of all measurementsdivided by the number of measurements. The mean is a good measure of centraltendency for roughly symmetric distributions, but it can be misleading when usedfor describing skewed distributions, which due to the probable influence of extremescores. For skewed distributions, the trimmed mean or median usually gives moremeaningful results.

8.3.3 Trimmed mean

A trimmed mean is calculated by discarding a certain percentage of the lowestand the highest scores and then computing the mean of the remaining scores. Atrimmed mean is obviously less susceptible to the effects of extreme scores than thearithmetic mean, but it is also a less efficient descriptor than the mean for normaldistributions.


8.3.4 Median

The median is the middle of a distribution: half the scores are above the median andhalf are below. The median is less sensitive to extreme scores than the mean andthis makes it a better measure of central tendency for highly skewed distributions.

8.3.5 Percentile

The percentile is a certain cut-off, under which the specified percentage of the ob-servations lie. The median is the 50th percentile of the distribution. Similarly, the80th percentile is a point of the distribution under which 80% of the observationslie.

8.3.6 Range

The range is the simplest measure of spread: It is the difference between the largestand the smallest value of the distribution. Since range is based on only two valuesit is very sensitive to extreme values.

8.3.7 Variance and the standard deviation

The variance (d) and standard deviation (s or sd) are both measures of spread. Thevariance is calculated as the averaged squared deviation of observations from theirmean. For example, for the scores 1, 2, and 3, the mean is two, and the varianceis: [(1− 2)2 + (2− 2)2 + (3− 2)2]/3 = 0.667. The standard deviation is the squareroot of the variance:

√(0.667) = 0.827. Standard deviation is often more easily

interpreted than variance, because its unit is the same as the unit of measuments.

8.3.8 Coefficient of variation

The coefficient of variation (CV) describes the relative variability of the data. It isespecially good in situations, where distributions of unequal magnitude are com-pared. For example, the absolute variability measured in kilograms is very smallin a mouse population, but very large in an elephant population. In such cases, thecoefficient of variation gives a better description of the variability than the standarddeviation, because we are usually interested in how the variability is related to themean of the species, and because the CV is independent of the absolute values ofthe variable. The coefficient of variation is calculated as

CV = s

m,

where s is the standard deviation and m is the mean.Often the CV is reported as a percentage. In biological research, it is a widely-

used measure of reliability for replicated experiments.


8.4 Effect statistics

8.4.1 Scatter plot

The scatter plot is an important graphical tool for studying the spread and linear-ity of the data. In its simplest form, two variables are plotted along the axes, andmarks are drawn according to these coordinates. For example, green and red chan-nel intensities of DNA microarray experiments can be depicted as a scatter plot(Figure 8.1).

8.4.2 Correlation

The correlation of two variables represents the degree to which the variables arelinearly related. When two variables are perfectly linearly related, the points inthe scatter plot fall on a straight line. Two summary measures or correlation co-efficients, Pearson’s correlation and Spearmans’s rho, are most commonly used.Both of these measure range from perfectly positive linear relationship to perfectlynegative linear relationship, or from -1 to 1.

Pearson’s correlation between two variables is calculated as follows. First acovariance of the variables is calculated. If the observations of the two variables(X and Y ) form pairs like (xi ,yi ) covariance (sXY ) is as

sXY =∑

(xi − x)(yi − y)

n − 1,

where xi and yi are the iths observations of variables, x and y are the meansof the variables and n is the number of pairs of observations or the number ofobservations. Pearson’s correlation (r) is then calculated using the formula

r = sXY

sx sy,

where sx and sy are the standard deviations of the variables.Spearman’s correlation is based on ranks. Before calculating the value of

Spearman’s correlations all the observations in the two variables are arranged fromthe smallest to the largest, i.e., ranked. If the correlation is perfect, the smallestobservations of variable X corresponds to the smallest observation of variable Y ,and so forth until the largest observations. Compatability of ranks can be describedusing the the following formula:

d2i = [R(xi )− R(yi)]

2,

where R(xi ) and R(yi ) are the ranks of the observations of variables. Spear-man’s correlation is then derived from d 2

i by scaling:

rs = 1− 6∑

d2i

n3 −n,

where n is the number of observations. In many statistical programs Spear-man’s correlation is simply calculated using the formula for Pearson’s correlation,but using the ranks as observations.


It is not wrong to calculate the correlation between variables, which are notlinearly related, but it does not make much sense. If the variables are not linearlyrelated, the correlation does not describe their relationships effectively, and no con-clusions can be based on the correlation coefficient only.

Numerical and graphical tools often complement each other very effectively.Correlation and scatter plot are a good example of such a mutually beneficial pairof tools.

0 10000 20000 30000 40000

010

000

2000

030

000

Green channel intensity

Red

cha

nnel

inte

nsity

correlation = 0.962

Figure 8.1: An example of a scatter plot. The red and green channel intensities froma two-color DNA microarray experiment have been depicted as a scatter plot. Variablesseem to have a linear relationship, because a straight line can be drawn through the datapoints. In addition, the linear correlation between these variables is 0.962.

8.4.3 Linear regression

Linear regression is used for describing how much the dependent variable (the pre-dicted variable) changes if the independent variable (the predictor variable) changesa certain bit. It would be wrong to apply linear regression to non-linear data. There-fore, before applying linear regression, the linearity of the data should be checkedfrom a scatter plot (Figure 8.1). In addition, the normality of the dependent variableshould be checked before running the analysis (see the following sections).

Linear regression fits a straight line through the data points so that the squaresum deviation of the line from the data points is minimized. The result of the


procedure is a regression equation

Y = aX +b,

where b indicates where the regression line cuts the vertical axis. If only two vari-ables are used in the linear regression model, a is equal to the Pearson correlation,and indicates the slope of the regression line. It also indicates by which numberchange in X (independent variable) must be multiplied to give the correspondingchange in Y (dependent variable).

0 10000 20000 30000 40000

010

000

2000

030

000


Red

cha

nnel

inte

nsity

correlation = 0.962

Figure 8.2: The two linear regression lines fitted into two-color DNA microarray results.The results are different if green channel intensity (green line) is used as a dependentvariable instead of red channel intensity (red line).

Correlation is completely symmetric. The correlation between variables Aand B is the same as correlation between variables B and A. Linear regression doesnot have the same symmetric property. For linear regression it is absolutely vital toknow which is the dependent variable, in which we want to predict the change. Thisshould be decided when the experiment is planned. For example, we could try topredict how many papers we could print with a certain cartridge and machine, but itwould not be possible to predict which cartridge and machine we have, if we knowthat a certain number of papers have been printed. For microarray experiments,linear regression is sometimes used for normalization. This is generally not a good


idea, because it is not possible to decide which labeled sample (red or green) shouldbe used as an independent and/or dependent variable (Figure 8.2).

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Normal density

mean=0, sd=1x

dens

ity

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

Normal density

mean=5, sd=1x

dens

ity

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Normal density

mean=0, sd=2.5x

dens

ity

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

Normal density

mean=0, sd=0.5x

dens

ity

Figure 8.3: Normal distributions of different shapes. The mean defines the place ofthe distribution along the horizontal axis, and standard deviation (sd) defines the shapeof the curve. Small standard deviation means tight, and large standard deviation a flatdistribution.

8.5 Frequency distributions

8.5.1 Normal distribution

Normal distributions are a family of distributions that have the same general shape.They are symmetric around their mean with more measurements in the middle thanin the tails. Normal distributions are sometimes described as bell shaped. Theshape of the normal distribution can be specified mathematically in terms of themean and standard deviation (Figure 8.3).

A standard normal distribution is a normal distribution with a mean of 0 anda standard deviation of 1. Any normal distribution can be transformed to standardnormal distribution by the formula

Z = (X −m)

s,

where X is a score from the original normal distribution, m is the mean of the


original normal distribution, and s is the standard deviation of the original normaldistribution. This mathematical procedure is also called standardization.

There are biological and historical reasons for the widespread usage of normaldistribution: Many biologically relevant variables are distributed approximatelynormally. Mathematical statisticians also work comfortably with normal distri-butions, and many kinds of statistical tests can be and are derived from normaldistributions. Some of these tests will be described in the next chapters. We havealready had a peek at one application of normal distribution, linear regression.

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normal density

mean=0, sd=1x

dens

ity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

t density

mean=0, sd=1, df=1x

dens

ity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

t density

mean=0, sd=1, df=10x

dens

ity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

t density

mean=0, sd=1, df=100x

dens

ity

Figure 8.4: The mean of the t-distribution defines its place on x-axis. The shape ofthe distribution is defined by standard deviation (s) and degrees of freedom (df). Stan-dard deviation defines the tightness of the distribution, whereas df defines the amount ofresemblanse to normal distribution.

8.5.2 t-distribution

The t-distribution resembles normal distribution, and in mathematical terms it isan approximation of the normal distribution for small samples sizes. T-distributiondiffers from the normal distribution, because it has an additional parameter, degreesof freedom (df), which affects its shape. Degrees of freedom reflect the sample size.

Degrees of freedom can take on any real number greater than zero. A t-distribution with a smaller df has more area in the tails than the distribution with alarger df. As the df increases, the t-distribution approaches the standard normal dis-


tribution. For practical purposes, the t-distribution approaches the standard normaldistribution relatively quickly, so that when df = 50, the two are virtually identical(Figure 8.4).

Histogram of gmean

gmean

Fre

quen

cy

0 10000 20000 30000 40000

020

040

060

080

010

0012

00

Figure 8.5: A typical histogram of the DNA microarray intensity values of one channel(gmean). The distribution is highly skewed to the right. Median, mean and 68% standarddeviation are depicted as red, green, and blue vertical lines, respectively.

8.5.3 Skewed distribution

A distribution is skewed if one of its tails is longer than the other. Distributionshaving a longer tail on the right are positively skewed, and distributions having alonger tail on the left are negatively skewed. The best way to identify a skeweddistribution is to draw a histogram of the distribution (Figure 8.5). Numerically,a skewed distribution can be identified by comparing the mean and median of thedistribution. If the mean is larger than median, the distribution is skewed to theright. For left skewed distributions the mean is lower than median.

8.5.4 Checking the distribution of the data

Many statistical tests assume that the data is normally distributed, which meansthat the histogram drawn from the values of the variable resembles the normaldistribution. Even the most basic descriptive statistics can be misleading if the


distribution is highly skewed. For example, standard deviation does nor bear ameaningful interpretation if the distribution significantly deviates from normality(Figure 8.5).

Normality of the data is most easily checked from an appropriate histogram . Ifthe distribution is, as judged by eye, approximately symmetric and does not containmore than one peak, it can be assumed to be normally distributed (Figure 8.5 andFigure 9.6). Other graphical possibilities include a density plot, which is basicallya smoothed histogram (Figure 8.6).

−4 −2 0 2 4

0.0

0.5

1.0

1.5

2.0

Normal density

mean=0, sd=1intensity

dens

ity

Figure 8.6: Some real data compared to a normal distribution. Black, normal distribu-tion; red, non-transformed data; green, log-transformed data. The log-transformation wasperformed, because it often makes highly skewed distribution more normally distributed.The values have been standardized before plotting.

A more formal way to test for the normality of the data is the normal prob-ability plot. The sample values and the theoretical values assuming normality areplotted against each other in the normal probability plot. If the points fall on theline, the distribution is normal (Figure 8.7).

Both the histogram and normal probability plot are valid methods for checkingthe normality of the data, but the plot can detect much smaller deviations fromnormality than histogram.


−4 −2 0 2 4

020

000

4000

0

Normal Q−Q Plot

log−transformed raw dataTheoretical Quantiles

Sam

ple

Qua

ntile

s

−4 −2 0 2 4

911

1315

Normal Q−Q Plot

log−transformed raw data, no NAsTheoretical Quantiles

Sam

ple

Qua

ntile

s

−4 −2 0 2 4

020

000

4000

0

Normal Q−Q Plot

log−transformed raw dataTheoretical Quantiles

Sam

ple

Qua

ntile

s

−4 −2 0 2 4

911

1315

Normal Q−Q Plot

log−transformed raw data, no NAsTheoretical Quantiles

Sam

ple

Qua

ntile

s

−4 −2 0 2 4

02

46

Normal Q−Q Plot

standardized raw dataTheoretical Quantiles

Sam

ple

Qua

ntile

s

−4 −2 0 2 4

−1

01

23

Normal Q−Q Plot

standardized log−transformed dataTheoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 8.7: The normal probability plots. The raw data has been pushed through differentkinds of transformations, and the results are checked for normality. The standardized log-transformed data was drawn with green in the Figure 8.6. According to the plots, none ofthe data sets is normally distributed, because the datapoints do not fall on a straight line.

8.6 Transformation

Transformation means a mathematical procedure, where a new variable is con-structed or derived from the original one by applying a certain function or formula.Data is often transformed, because the original data does not fulfill the distribu-tional presumptions. For example, many statistical procedures assume that the datais normally distributed. If this is not the case, transformation can be applied in sucha way that the distribution becomes more normal-like.

8.6.1 Log2-transformation

Log2-transformation is often used with DNA microarray experiments. Usually, theintensity ratio is log2-transformed. The resulting new varible is called log ratio.The increase of one in the log ratio means that the actual intensity or expressionhas doubled.


0 10000 20000 30000 40000

010

000

2000

030

000

correlation=0.962


Red

cha

nnel

inte

nsity

Figure 8.8: A scatter plot where one probable outlier has been marked with a red dot.

8.7 Outliers

Outliers are by definition atypical or infrequent observations, data points which donot appear to follow the characteristic distribution of the rest of the data. Thesemay reflect the properties of the variable, or be due to measurement errors or otheranomalies which should not be included in the analyses.

Typically, we believe that outliers represent random errors, which we wouldlike to control or get rid of. Outliers can have many undesirable properties: Theyoften attract the linear regression line, which might lead to wrong conclusions.They can also artificially increase the correlation coefficient or decrease the valueof the legitimate correlation. The measure of spread, range, is unreliable if the dataincludes outliers.

Outliers are often excluded from the data after the superficial analysis withquantitative methods. For example, observations that are outside the range of 2standard deviations from the mean, can be discarded. However, definition of anoutlier is subjective, and in principle the decisions should be made on individualbasis for every suspicious observation.

There are two graphical methods with which to identify outliers, scatter plotand box plot (Figure 8.8 and Figure 8.9).


010

000

2000

030

000

4000

0

Typical box plot

Gre

en c

hann

el in

tens

ity

Minimum value25% quantile

Mean

Median

75% quantile

Maximum value

Figure 8.9: An example of a box plot. Outliers are often indicated with individual markson the top of the boxplot.

8.8 Missing values and imputation

The missing values can sometimes markedly change the results, if only mean ormedian values are inspected. Take, for example, the following experiment. Wewant to find the genes, which can be used for dividing patients into two cancersubtypes, those responding and not responding to the treatment. We measure theexpression values (log ratios) in the these subtypes, and plot the results.

Responsive group

No. of obs. 20 3 4 2 1

|------|-------------------------------|

Expression 0.0 mean 5.0 10.0

Non-responsive group

No. of obs. 0 3 4 2 1

|----------------------|---------------|

Expression 0.0 5.0 mean 10.0


In the example, the missing values influence the mean very much. In a sense,missing values draw the mean towards them. Even the median would not havehelped us here, because it would have scored 0 for the responsive group!

Missing values are usually either removed from the analysis or estimated usingthe rest of the data in a process called imputation. These methods are covered inmore detail in the next chapter.

8.9 Statistical testing

Statistical testing tries to answer the question: “Can the difference between theseobservations be explained by chance alone?” or “How significant is this differ-ence?”. Statistical testing can also be viewed as hypothesis testing, where twodifferent hypotheses are compared. For example, we can test whether cancer pa-tients and their healthy controls differ statistically significantly for the expressionof the gene X.

8.9.1 Basics of statistical testing

A suitable statistical test can give us an answer to the questions above. In practise,statistical testing is divided into several substeps:

1. Select an appropriate statistical test.

2. Select a threshold for p-value.

3. Form the pair of hypotheses you want to compare.

4. Calculate the test statistic.

5. Calculate degrees of freedom.

6. Compare the test statistic to the critical values table of the test statistic distri-bution.

7. Find out a p-value, which corresponds to the test statistic.

8. Draw conclusions.

These substeps are covered in more detail in the following subsections.

8.9.2 Choosing a test

There are at least a couple of hundred of different statistical tests available, but onlya few of those are covered here in more detail. Usually the means of certain groupsare compared in the microarray experiments. For example, we can test whether theexpression of a certain gene is higher in the cancer patients than in their healthycontrols. The tests introduced here compare the means of two or several groupswith each other.

When choosing a test, there are two essential questions, which need to beanswered: Is there more than two groups to compare, and should we assume thatthe data is normally distributed?


If two groups are compared, there are two applicable tests, the t-test and Mann-Whitney U test. If more than two groups are compared, an analysis of variance(ANOVA) or a Kruskal-Wallis test is used. The ANOVA procedure is covered inmore detail in section 6.10 of this book. Note, that if the ANOVA procedure isapplied to two group means only, it will produce the same results as the t-testsintroduced in the next sections.

If the data is normally distributed (see the tests for this in section 6.5.4), thet-test for two groups, or the ANOVA for multiple groups can be used for compar-isons. If the data is not normally distributed, and each group has at least five obser-vations, the Mann-Whitney U test or Kruskal-Wallis test can be applied. However,if less than five observations are available per group, it is better to use the t-test orANOVA.

Tests, which assume that the data is normally distributed, are called parametrictests. Non-parametric tests do not make the normality assumption.

8.9.3 Threshold for p-value

The p-value is usually associated with a statistical test, and it is the risk that wereject the null hypothesis (see section 6.9.4), when it actually is true. Before test-ing, a threshold for p-value should be decided. This is a cut-off below which theresults are statistically significant, and above which the results are not statisticallysignificant. Often a threshold of 0.05 is used. This means that every 20th time weconclude by chance alone that the difference between groups is statistically signif-icant, when it actually isn’t.

If the compared groups are large enough, even the tiniest difference can geta significant p-value. In such cases it needs to be carefully weighted whether thestatistical significance is just that, statistical significance, or is there a real biologicalphenomenon acting in the background.

8.9.4 Hypothesis pair

Before applying the test to the data, a hypothesis pair should be formed. A hypoth-esis pair consists of a null hypothesis (H0) and an alternative hypothesis (H1). Forthe tests described here, the hypotheses are always formulated as follows.

H0 = There is no difference in means between compared groupsH1 = There is a difference in means between compared groups

8.9.5 Calculation of test statistic and degrees of freedom

The test statistic is a standardized numerical description of the differences betweenthe group means. Depending on the test type, the actual mathematical formula forthe calculation of the test statistic is different. Here we will present these formulasfor a couple of t-tests.

The simplest t-test is the one-sample t-test. It is constructed to compare asample group mean with a certain hypothesis. The general formula of the test is

T = M −μs√n

,


where M is the sample mean, μ is the expected sample mean (hypothesis), s isthe standard deviation and n is the number of observations in the sample group.Degrees of freedom are calculated using the following formula:

d f = n − 1

The means of two samples are compared with an independent samples t-test.In order to make the test, we need to know whether the variances of the groups areequal. As a rule of thumb, the variances of the groups can be assumed to be equal,if the variance of the first group is not more than three times larger than the varianceof the second group. Assuming that the variances of the two-groups are not equal,the formula of the test statistic (Welsh’s t-test) is:

T = Xi − X j√s2i

ni+ s2

jn j

,

where Xi and X j are the means of the compared groups, s 2i and s2

j are the variancesof the compared groups, and n i and n j are the numbers of observations in thecompared groups.

The degrees of freedom are calculated with a formula:

d f =

(s2i

ni+ s2

jn j

)2

(s2i

ni)2

ni −1 + (s2j

n j)2

n j −1

,

where s2i and s2

j are the variances of the compared groups, and n i and n j are thenumbers of observations in the compared groups.

If the variances of the groups are equal, the test statistic (student’s t-test) iscalculated with the formula:

T = x̄1 − x̄2

s ×√

1n1

+ 1n2

,

where

s =√∑n1

i=1(x1i − x̄1)2 +∑n2i=1(x2i − x̄2)2

n1 +n2 − 2.

Degrees of freedom for the equal variances t-test are calculated as:

d f = ni +n j − 2

Formulas for the test statistics of non-parametric tests are not presented here,but they can be easily found from any statistical textbook.


8.9.6 Critical values table

A critical values table contains the standardized values of a distribution. The criticalvalue table for t-tests contains the standardized values of the t-distribution. Recall,that the shape of the t-distribution was partly defined by degrees of freedom. There-fore, dfs are also essentially needed when reading the critical values table for thet-distribution. Below (Table 8.1) is a sample of the ten first rows of a t-distributiontable

Table 8.1: Probability of exceeding the critical value.

df 0.10 0.05 0.025 0.01 0.005 0.001

1. 3.078 6.314 12.706 31.821 63.657 318.3132. 1.886 2.920 4.303 6.965 9.925 22.3273. 1.638 2.353 3.182 4.541 5.841 10.2154. 1.533 2.132 2.776 3.747 4.604 7.1735. 1.476 2.015 2.571 3.365 4.032 5.8936. 1.440 1.943 2.447 3.143 3.707 5.2087. 1.415 1.895 2.365 2.998 3.499 4.7828. 1.397 1.860 2.306 2.896 3.355 4.4999. 1.383 1.833 2.262 2.821 3.250 4.296

10. 1.372 1.812 2.228 2.764 3.169 4.143

From the table, read the value at the intersection of the degrees of freedom andthe set p-value threshold divided by two. For example, if we used p-value of 0.05and df = 9, the critical value would be 2.262.

8.9.7 Drawing conclusions

Compare the test statistic and the critical value from the table. If the calculatedtest statistic is larger, then the null hypothesis is rejected, and we can conclude thatthere is a difference between the groups with a certain p-value threshold.

8.9.8 Multiple testing

The more analyses are performed on a data-set, the more the results will meet theconventional significance level by chance alone. Often the p-value is corrected toaccount for this problem. One commonly used correction is the Bonferroni correc-tion, where the original p-value is adjusted by the number of comparisons to createa new corrected p-value against which those comparisons should be tested.

Another increasingly popular multiple testing correction method is false dis-covery rate (FDR). FDR controls for the expected proportion of false positivesinstead of the chance of any false positives as Bonferroni correction does. FDRthreshold is calculated from the distribution of observed p-values, but FDR is notinterpreted in a similar fashion as p-value. For example, a p-value of 0.2 is inmost cases utterly unacceptable. In contrast, if a hundred genes with an FDR of 0.2are reported, we should expect 20 of them to be correct. Especially in DNA mi-


croarrays studies FDR is often reported as a q-value, which is actually a "posteriorBayesian p-value". FDR seems to be more appropriate than Bonferroni correctedp-value for large datasets, e.g., microarray data. One widely used FDR method wasdeveloped by Benjamini and Hochberg (1995).

8.9.9 Empirical p-value

Permutation tests are a family of methods that can be used for calculation of em-pirical p-values. For example, an empirical t-test p-value for a gene discriminatingbetween two groups of samples, e.g., cancer patients and their healthy controls, canbe calculated as follows. First a t-test statistic for the original dataset is calculated.Then the sample labels, i.e., the group the individual samples belong to, are ran-domized very many times (10 000). A t-test statistic is recalculated for every ofthe randomized datasets, and the empirical p-value is the percentage of randomizeddataset that got a larger t-test statistic than the original dataset. For example, ifthe original t-test statistic was 4.0 and out of 10 000 randomization 40 got a largert-test statistic than 4.0, the empirical p-value would be 40 / 10 000 = 0.004. Thegenes having the smallest empirical p-values best dicriminate the groups from eachother. However, if the number of individuals in both groups is small (<10), it mightbe better to use critical values table for calculation of p-values.

8.10 Analysis of variance

The purpose of the analysis of variance (ANOVA) is to test for significant differ-ences between the means of several groups. If we are comparing two means, theANOVA will give the same result as the t-test. However, unlike the t-test, ANOVAdoes not specify which of the groups are significantly different from each other,only that there are significant differences.

It may seem odd that a procedure that compares means is called analysis ofvariance. The name is derived from the fact that in order to test for a statisticalsignificance between means, we actually need to compare or analyze variances.

8.10.1 Basics of ANOVA

At the heart of the ANOVA is a procedure where the variances are divided up orpartioned. Variance is computed as the sum of squared deviations from the overallmean, divided by the corrected sample size. Thus, given a certain sample size, thevariance is a function of sums of squares (squared deviations from the means), orSS. In the most simple form SS can be divided as SStotal = SStreatment + SSerror;The variance that is not explained by the treatment or group membership is thoughtbe due to experimental errors.

There are some assumptions the data should fulfill in order to be eligible for theANOVA analysis. First of all, the data should be normally distributed. Furthermore,variances of the compared groups should be similar. These assumption are robustto small deviation from the assumptions. The strictest assumption, for which thetest is not robust, is the assumption of independence. The observations should beindependent of each other. Therefore, a time series can not usually be analyzed by


the ANOVA.

8.10.2 Completely randomized experiment

A completely randomized experiment is the simplest ANOVA design. It is a basicdesign from which other designs can be derived by adding some constraints on thetreatment of experimental units or samples. Completely randomized ANOVA al-lows us to compare the means of different treatments, and is also known as the one-way ANOVA. For microarray experiments, this would mean, for example, compar-ison of the mean expression of different cell lines or different subarrays. Before theexperiment, subjects (here, chips) are randomly selected for the treatment groups.If randomization is not used, the effect of treatment might get masked by the effectof variation of the subjects.

Let us consider the following experiment (Table 8.2). We want to compare theeffect of three different bacterial treatments on the same cell line. The nine avail-able cell culture flasks are randomized into treatment groups. The experiment isconducted, and the mean expression of certain immunorelated genes is quantified.

The partition of variance is done in two phases. First, the SS within groups orSSerror (how much individual observations deviate from the group mean) is calcu-lated using the groupwise means (Table 8.3). This reflects the various errors in theexperiment. Then the SS between groups or SStreatment (how much the individualobservations deviate from the overall mean) is calculated using the overall mean(Table 8.4). It reflects the effect of treatment on different groups.

SStreatment has k-1 degrees of freedom, where k is the number of groups.SSerror has N-k dfs, where N is the total number of subjects, and k is the number ofgroups. Mean squares (MS) are calculated by dividing the SSs by the concommitantdfs.

The results are summarized in an ANOVA table (Table 8.5), which reportssum of squares, and the F-statistic with an associated p-value. The F-test statisticis calculated as

F = MStreatment

MSerror

The p-value reported in the table below has been read from the F-distributiontable of critical values with the appropriate dfs and F-statistic.

Table 8.2: Primary data.

Bacteria 1 Bacteria 2 Bacteria 3

Observation 1 2 6 4Observation 2 3 7 5Observation 3 1 5 3

Mean 2 6 4Overall mean 4


Table 8.3: Calculation of error terms.


Observation 1 2-2=0 6-6=0 4-4=0Observation 2 2-3=-1 6-7=-1 4-5=-1Observation 3 2-1=1 6-5=1 4-3=1

Sum of squares 0+1+1 0+1+1 0+1+1SS within groups 2+2+2=6

Table 8.4: Calculation of treatment effects.


Observation 1 4-2=2 4-6=-2 4-4=0Observation 2 4-3=1 4-7=-3 4-5=-1Observation 3 4-1=3 4-5=-1 4-3=1

Sum of squares 4+1+9=14 4+9+1=14 0+1+1=2SS total 14+14+2=30

Table 8.5: ANOVA table presents a summary of the results.

SS df MS F p-value

Effect 24 2 12 12 <0.01Error 6 6 1

The interpretation of the ANOVA p-value is similar to the t-test. Becausethe calculated p-value is lower than 0.05, we conclude that the null hypothesisis rejected, and the alternative hypothesis comes to power. More specifically, weknow that there is a difference between the treatment groups, but we do not knowwhich treatments differ from each other. T-test would give a definite answer, if weneed to know specifically which groups are different.

8.10.3 One-way and two-way ANOVA

The completely randomized ANOVA design presented above is also known as theone-way ANOVA, because the states of only one variable are studied. If there hadbeen two variables, say the treatment and the cell-line, then the procedure wouldhave been called a two-way ANOVA. There are many more experimental designsfor the ANOVA, which are more thoroughly covered elsewhere (Sokal, Biometry).

8.10.4 Mixed model ANOVA

When an experiment is performed levels of some independent variables are nor-mally manipulated by the researcher. For example, in the bacterial treatment ex-ample (see above) researcher decided which bacterial treatment the cell-lines re-


ceived. The independent variables that are manipulated by experimenter are calledfixed factors in the context of ANOVA analysis and experiment design. Levelsof all independent variables are not typically manipulated, either because it is notpossible or worth while. However, many things can affect the bacterial treatmentexperiment. For example, it might be that some culture flasks are slightly hotterthan others, possibly due to their placing in the incubator. Independent variablesthat are not manipulated by experimenter are called random factors.

ANOVA models that include both fixed and random factors are called mixedmodels. Mixed models are very versatile ANOVA tools, and they have many usesin data analysis of, e.g., DNA microarray data. One use of mixed models in DNAmicroarray data analysis is the removal of batch effects from the data. For example,it is quite often observed that chips hybridized on the same day cluster together inhierarchical clustering. This is not usually desired, and the effect can be removedfrom data using a suitable mixed model ANOVA.

8.10.5 Experimental design and ANOVA

Experimental design in respect to DNA microrray experiments is covered in moredetails in Chapter 6. Here we discuss only some specific issues related to ANOVAanalysis of data. Namely, ANOVA can be used for the analysis of DNA microarraydata, but in order to make the analysis as powerful as possible, we need to considersome experimental design issues.

8.10.6 Balanced design

The most powerful data analysis is possible, if the design is balanced. A designis balanced if all the groups have a similar number of observations. For example,if we are comparing gene expression in cancer patients and their healthy controls,the design is balanced in relation to samples if both groups have equal numbers ofindividuals and hybridized chips. Balanced designs are so powerful, because theyremove or diminish bias in the data.

Above, we discussed shortly the mixed model ANOVA that can be used forremoving batch effects from DNA microarray data. In practise, it is best to includebatch already in the experiment design, because this enables one to differentiatebetween the batch effect and actual treatment effect. For example, if we want tocompare two groups of samples, it is a good practise to hybridize equal amountsof samples from both groups every day. Also, the samples should appear equallyoften labeled with Cy3 and Cy5.

8.10.7 Common reference and loop designs

Common reference design (Figure 8.10) is to date the most used design in cDNAmicroarray experiments. It is called common reference design, because every sam-ple in the experiment is compared against the same reference sample (”commonreference´´). For example, suppose we wish to study how gene expression changesin a cell culture after treatment by some chemical. Then the first time-point (takenjust after adding the chemical) becomes a natural common reference against which


all the samples are hybridized; Sample RNA extrected at time-point zero and someother time-point are hybridized to every chip. Likewise, if we want to differen-tiate between several tumor subtypes, a sample of tumor RNA and some healthycommon reference RNA are hybridized to every chip. Common reference designis easily extended later on with extra chips, assuming the common reference isretained throughout the study. Also, it does not do great harm if one hybridiza-tion fails, something that can be catastrophic for loop designs. All the samplesare treated exactly the same way, and that can diminish a possibility of laboratorymistakes.

Loop design (Figure 8.10) is statistically more powerful than common refer-ence design. It can be utilized if all the samples are equally interesting and of highquality. The benefit of loop design is that because more replicates per sample arecreated than using the common reference design, the estimates of effect variance ofgene expression are halved. That is assuming that all the chip are good hybridiza-tions. If one of the chips in the loop desing fails or is of low quality, the errorvariance estimates are in fact doubled. This is a serious drawback, and loop designis seldom used in practise. In addition, it is not very easy to add new samples toloop designed experiment. However, there are many efficient (loop) designs, someof which are discussed in an article by Kerr and Churchill in more details. Loopdesign are commonly analyzed by ANOVA.

Figure 8.10: A. A common reference design using three microarrays. Design makespossible a comparison of the three different samples B. A loop design for the same samplesas in A. Comparision of samples is now more powerful, because the same sample is alwayshybridized on two chips.


8.11 Error models

Error model enables one to estimate measurent and sample-to-sample errors fromdata, e.g., DNA microarray measurements. Several different error models havebeen proposed, but here we discuss only the Rocke-Durbin model published in2001. The model is based on an observation that standard deviation of intensitymeasurements is proportional to the expression level, i.e., the higher the expres-sion, the higher the standard deviation. However, this proportionality cannot holdfor the genes that are not expressed at all or are present in very low quantities,because it would imply non-existent measurement error, which is in practise im-possible. Therefore, the error model incorporates two types of errors in the samemodel: absolute error, which dominates at low expression levels and proportional(or relative) error, which dominates at higher expression levels.

In principle, error model allows one to estimate standard errors even for genesthat don’t have any replicate measurements. This, of course, makes it possible toperform statistical testing with a dataset that does not contain any replicates. Inaddition, using error model for estimation of standard deviation for datasets thatcontain replicate measurements makes the estimation more precise.

8.12 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on doing some basic statistics using computer software.


1. Benjamini, Y., and Hochberg, Y. (1995) Controlling the false discovery rate:a practical and powerful approach to multiple testing, J. Roy. Statist. Soc.B., 57, 289-300.

2. Hopkins, W. G. (2003) A new View of Statistics, http://www.sportsci.org/resource/stats/index.html.

3. Kerr, M. K., and Churchill, G. (2002) Experimental design for DNA microar-ray experiments, accessible at http://www.cs.cornell.edu/Courses/

cs726/2001fa/papers/kerr.pdf

4. Pezzullo, J. C. (1999) Interactive Statistical Calculation Pages,http://members.aol.com/johnp71/javastat.html.

5. Ranta, E., Rita, H., Kouki, J. (1997) Biometria, Yliopistopaino, Helsinki.

6. Rocke, D. M., and Durbin, B. (2001) A model for measurement error forgene expression arrays, J. Comp. Biol., 8, 557-569.

7. Sokal, R. R., Rohlf, F. J. (1992) Biometry, Freeman and co, New York.

8. Statsoft Inc. (2002) Electronic Statistics Textbook, http://www.statsoft.com/textbook/stathome.html.

Part II

Analysis


9 Preprocessing of data

Jarno Tuimala

9.1 Rationale for preprocessing

Preprocessing includes analytical or transformational procedures that need to beapplied to the data before it is suitable for a detailed analysis.

The black-box approach, where data is fed into a program and the result popsout, is simply erraneous for statistical analysis, because the results coming out fromthe program can be statistically erraneously derived. In such cases, also the biologi-cal conclusions can be wrong. Statistical tests have often strick assumptions, whichneed to be fulfilled. Violation of assumptions can lead to grossly wrong results.

We strongly recommend that the researcher, even if he/she is not himself per-forming the data analysis, gets basic knowledge of the data, because the resultspresented by the bioinformatician or statistician are more easily interpretable, ifone is at least somewhat familiar with the data.

Here we will introduce some methods for checking the data for violation ofstatistical test assumptions. We also present some things to consider before andduring the data analysis. The methods are introduced in the order of applicability,although some methods are needed in several steps.

9.2 Missing values

There are often many missing values in microarray data. As you might recall, miss-ing values are observations (intensities of spots), where the quantification resultsare missing. In the context of microarrays, we define missing values as

• Missing because the spot is empty (intensity = 0).

• Missing because background intensity is higher than the spot intensity (back-ground corrected intensity < 0).

Missing values can lead to problems in the data analysis, because they easilyinterfere with computation of statistical tests and clustering. There are a coupleof options for the treatment of missing values: They may be replaced with esti-mated values in a process called imputation, or they can be deleted from the furtheranalyses.

9 Preprocessing of data 83

The default way of deleting missing data (in most of the software packages),for example while calculating a correlation matrix, is to exclude all cases that havemissing data in at least one of the selected variables; that is, by casewise deletion ofmissing data. However, if missing data are randomly distributed across cases, youcould easily end up with no "valid" cases in the data set, because each of the geneswill highly likely have at least one missing observation on some chip. The mostcommon solution used in such instances is to use the so-called pairwise deletion,where a statistic between each pair of variables is calculated from all cases thathave valid data on those two variables.

Another common method is the so-called mean substitution of missing data(mean imputation, replacing all missing data in a variable by the mean of that vari-able). Its main advantage is that it produces internally consistent sets of results.Mean substitution artificially decreases the variation of scores, and this decreasein individual variables is proportional to the number of missing data. Because itsubstitutes missing data with artificially created average data points, mean substi-tution may considerably change the values of correlations. Imputation is commonlycarried out for intensity ratios, but can also be done for raw data.

Different computer programs manipulate missing values very differently, anddrawing any consensus would be futile. At least statistical programs often offera possibility to define whether to use imputation, pairwise deletion or casewisedeletion.

9.3 Checking the background reading

There is some dispute on whether the background intensities should be subtractedfrom the spot intensities. At the moment, the background adjustion seems to becommonly used, but strictly speaking, it’s applicability should be checked.

Background intensities are assumed to be additive with foreground intensi-ties. Background and foreground intensities together form the spot intensity. Back-ground intensities should not vary multiplicatively with foreground intensities orspot intensities, because such phenomenon occurs usually when there has beensome kind of problems with hybridization. This can be assessed by a scatter plot ofbackground intensities against spot intensities. If the spot intensities are dependenton the background intensities, it is possible either not to apply any background cor-rection to the data (Figure 9.1 and Figure 9.2) or discard the deviating observationsfrom further analyses.

Another common problem with the background correction is that it might pro-duce “pheasant tail” images on the scatter plot (Figure 9.3). Pheasant tails areformed by observations, which have exactly the same intensity value on one chan-nel, but the other channels’ intensities can vary. In such cases long vertical orhorizontal straights of observations are produced, and the resulting data cloud re-sembles a pheasant tail in a scatter plot. This is especially common in the lowerend of the intensity distribution, and can be a sign that the scanner is not very reli-able below a certain cut-off intensity or that the image analysis software calculatesbackground intensities in a peculiar way. When pheasant tails are observed, back-ground uncorrected intensity values can be used for the analyses, or the deviating


0 500 1500 2500

020

000

raw data

mouse1$morphR

rmea

n

10.0 10.5 11.0 11.5

1012

14

log−transformed data

log2(mouse1$morphR)

log2

(rm

ean)

0 500 1000 1500

5000

1500

0

raw data vs. corrected data

r.ba

rmea

n

2 4 6 8 10

11.0

12.5

14.0

log2−data vs. corr. data

r.ba

rmea

n

Figure 9.1: Raw and log-transformed intensity values of the red channel plotted againstits background. Upper row contains uncorrected scatter plots, lower row background cor-rected scatter plots.

observations can be excluded from any further analyses.The background corrected intensity values are calculated spotwise by subtract-

ing the background intensity from the spot (foreground) intensity. The formulas forAffymetrix and two-color data are

Cy3′ = Cy3spot −Cy3background

Cy5′ = Cy5spot −Cy5background

Note, that Affymetrix arrays do not quantify background around the spots, butinstead use a certain mathematical formula (check the Affymetrix manual).

9.4 Calculation of expression change

There are three commonly used measures of expression change. Intensity ratiois the raw expression value, and log ratio and fold change are transformationallyderived from it.


0.5 1.0 1.5 2.0

050

010

00

ratio vs. green background

ratiogr

een.

back

grou

nd

0.5 1.0 1.5 2.0

010

0025

00

ratio vs. red background

ratio

red.

back

grou

nd

0.5 1.0 1.5 2.0

050

015

00

ratio vs. average background

ratio

aver

age.

back

grou

nd

−2 −1 0 1

9.5

10.5

logratio vs. log−background

log2(ratio)lo

g2(a

vera

ge.b

ackg

roun

d)

Figure 9.2: Intensity ratios (expression) plotted against the background of the channels.

Figure 9.3: An A versus M plot, where the pheasant tail is visible in the lower intensityvalues.


9.4.1 Intensity ratio

After background correction, expression change is calculated. The simplest ap-proach is to divide the intensity of a gene in the sample by the intensity level of thesame gene in the control. Intensity ratio can be calculated from background cor-rected or uncorrected data (here we use corrected data) The formula for two-colordata is

Intensity ratio = Cy3′

Cy5′

For Affymetrix data, substitute Cy3’ and Cy5’ with the appropriate intensitiesfrom the sample and control chips.

This intensity ratio is one for an unchanged expression, less than one for down-regulated genes and larger than one for up-regulated genes. The problem withintensity ratio is that its distribution is highly asymmetric or skewed. Up-regulatedgenes can take any values from one to infinity whereas all downregulated genes aresqueezed between zero and one. Such distributions are not very useful for statisticaltesting.

9.4.2 Log ratio

To make the distribution more symmetric (normal-like) log-transformation can beapplied. The most commonly used log-transformation is 2-based (log 2), but it doesnot matter whether you use natural logarithm (log e) or base 10 logarithm (log10) aslong as it is applied consistently to all the samples.

The formula for the calculation of log2-transformation is:

Log ratio = log2(Intensity ratio) = log2

(Cy3′

Cy5′

)

After the log-transformation, unchanged expression is zero, and both up-regulated and down-regulated genes can take values from zero to infinity. Logratiohas some nice properties compared with other measures of expression. It makesskewed distributions more symmetrical, so that the picture of variation becomesmore realistic.

In addition, normalization procedures become additive. For example, we havethe following intensity ratio results for two replicates (A and B) after normalization:

GeneA = 120

60= 2.0

GeneB = 30

60= 0.5

The mean of these replicates is 1.25 instead of 1, which would have beenexpected. If the 2-based logarithmic transformation is applied, the log ratios are:

GeneA = log2

(120

60

)= 1.0


GeneB = log2

(30

60

)= −1.0

The mean of these log ratios is 0, which corresponds to the mean intensity ratioof 1. Although log-transformation is not always the best choice for microarray data,it is used because the other transformations lack this handy additive property. Onedownside of the log-transformation is that it introduces systematic errors in thelower end of the expression value distribution.

9.4.3 Variance stabilizing transformation

Because log-transformation creates extreme differencies between small intensityor expression values, a new variance stabilizing transformations was proposed byHuber et al. Among other things, it uses arsinh-transformation, which is smoothand real-valued even if the expression values are negative. For large intensities thearsinh-transformation is more or less equivalent to log-transformation. Since mostof genes are typically expressed in low quantities in any experiment, the variancestabilizing transformation might prove superior to log-transformation. However,the normalization with arsinh-transformed data is not so straight forward as withlog-transformed data.

9.4.4 Fold change

Another means to make the distribution of intensity ratios more symmetric is tocalculate the fold change. The fold change is equal to the intensity ratio, when theexpression is higher than one. Below one, the fold change is equal to the inversedintensity ratio.

For values >1, fold change = Cy3′Cy5′

For values <1, fold change = 1(Cy3′/Cy5′)

The fold change makes the distribution of the expression values more sym-metric, and both under and over-expressed genes can take values between zero andinfinity. Note that the fold change makes the expression values additive in a similarfashion as the log-transformation.

9.5 Handling of replicates

Replicates are a very powerful way to reduce noisiness of the data. Noisiness (ran-dom errors) is an undesirable feature of the data, because it potentially abolishesinteresting information. It can result from many sources, some of which are hardto deal with.

9.5.1 Types of replicates

There are at least three different kinds of replicates that are potentially useful in thecontext of DNA microarray analysis. For example, if we treat a certain cell linewith a cancer drug, we can set up several culture flasks, and then harvest them as


biological replicates. For every culture flask, the isolated and labeled mRNA pop-ulation can be hybridized to several chips making a number of technical replicates.In addition, every hybridized chip can be quantified several times or using differentimage analysis software (software replicates).

These are all valuable sources of information about the variability or sources ofvariability in the data. In practise, it might be a good idea to make some biologicalreplicates and some technical replicates. Then the biological and technical variationcan be taken into account in the same experiment.

9.5.2 Time series

In a time series experiment expression changes are monitored with samples takenbetween certain time intervals. Although several replicates can be made per everytime point, it should be considered that these replicate chips can possible be madea better use of, if they are added to the time series as sampling points. That is, itshould be weighted whether a high precision in every time point is more valuablethan the additional information of expression changes new sampling points (timepoints) produce.

9.5.3 Case-control studies

For case-control studies, where for instance cancer patients are compared with theirnon-diseased controls, the individuals in both groups can be considered replicatesof that disease state. This is assuming that we are interested in the differencies ofthose two disease states instead of inter-individual variation.

9.5.4 Power analysis

Using power analysis we can easily compute the number of replicates that areneeded to reliably discover the genes that are expressed. For example, we wouldneed 11 replicates to reliably pick up the genes, which are at least 1.41-fold over-expressed at a p-value of 0.05. Similarly, 6 replicates are needed if genes with2-fold over-expression need to be quantified using a p-value cutoff of 0.01.

9.5.5 Averaging replicates

The goal of DNA microarray experiments is to semi-quantitatively analyze the geneexpression level in a certain material. Often the precision of the expression levelestimate is important. The estimate of the true expression necessarily becomesmore precise when more replicates are used for that purpose.

Replicates also produce data on variability between the measured states, forexample cancer patients and healthy controls. More importantly, replicates po-tentially reduce noise in the data. However, if replicates are averaged, we loseinformation about the inter-individual variability in the studied groups.

Assume that we have an experiment, where there are two biological replicatesand three technical replicates of both biological replicates, making a total of 6 repli-cates altogether. The correlation between any of the technical replicates from thetwo biological replicates is, at first, quite low (<0.5). Taking average of the tech-


nical replicates makes the correlation between biological replicates much higher(>0.8). This also hints that the noisiness of the data has been reduced. Of course,bad replicates (chips or spots) should ideally be removed before calculation of av-erages, even though averaging diminishes the influence of outliers on the results.

In practise, replicates are often averaged after normalization, but sometimeseven the raw intensity values are averaged. We feel that it would be better to averagereplicates after normalization, because then it is possible to take chipwise variationinto account in the normalizations.

9.6 Checking the quality of replicates

If the experiment includes replicates, their quality can be checked with simplemethods, using scatter plots and pairwise correlations or hierarchical clusteringtechniques, which are explained in the cluster analysis chapter in more detail. Of-ten the quality of replicates is checked before and after the normalization. If thecorrelation between replicates drops dramatically after certain normalization, theapplicability of the normalization method should be reconsidered.

9.6.1 Quality check of replicate chips

The first task is often to find out, how well two replicate chips resemble each other.In such cases the intensity or log ratios can be plotted along the axes of a scatterplot. Three replicates can be plotted using a three dimentional scatter plot. Ifthe replicates are completely similar, the data points in the scatter plot fall onto aperfectly straight line running through the origin of the plot. If the replicates arenot exactly similar, the observations form a data cloud rather than a straight line.The absolute deviation from a line can be better visualized, if a linear regressionline is fitted to the data.

Correlation measures the linear similarity quite well. It might give misleadinginformation, if the data cloud is not linear. Correlation coupled with a scatter plotgives much more information. In practise, the Pearson’s correlation coefficientbetween two replicate cDNA chips produced from the same mRNA pool (technicalreplicates) are often in the range of 0.6–0.9, and the correlation between similarlyhybridized Affymetrix chips is typically over 0.95. When biological replicates areperformed, the correlation between replicate chips usually drops from these values.For cDNA chips, a correlation of 0.7 between replicates may be considered good,whereas for Affymetrix chips a correlation over 0.9 may be considered a goodresult.

In addition, the scaling factor should be checked for the Affymetrix chips. Ifthe factor is larger than 50, the hybridization is probably bad.

Quality of several replicate chips is most easily checked by hierarchical clus-tering. For example, if the two replicate chips are placed closest to each otherin the dendrogram, they can be expected to be good replicates. Depending on thesimilarity measure, the hierarchical clustering can be applied either after (Euclidiandistance) or before the normalization (Pearson’s correlation).


9.6.2 Quality check of replicate spots

The quality of replicate measurements of one gene (one spot) can be assessed ina similar fashion to replicate chips, using correlation measures. However, this ismore laborous than with replicate chips, but it can sometimes be essential that badreplicate spots are found and removed from the data.

9.6.3 Excluding bad replicates

It is quite common to exclude bad replicates from further analyses. For example,if there are four replicate chips available for the same treatment, and one seems todeviate from every other replicate, then this most deviating one can be removedfrom any further analyses. This often results in a massive loss of data, becausewhole chips are removed from the analysis. Thus, it would be better to study thequality of replicates on a gene-wise basis, after initially studying the quality ofwhole chips using the tools described above.

In practise, there often are replicates which deviate from the others. These canbe results of modified experimental conditions (different mRNA batch, etc.) or justimportant biological variation, if different animals of human subjects are studied.Because of this, some caution should be exercised when removing bad replicates.

9.7 Outliers

Outliers in chip experiments can occur at several levels. There can be entire chips,which deviate from all the other replicates. Or there can be an individual gene,which deviates from the other replicates of the same gene. These deviations canbe caused by chip artifacts like hairs, scratches, and precipitation. Precipitation,among other causes, can also perturb one single spot or probe.

Outliers should consist mainly of quantification errors. In practise, it is oftennot very easy to distinguish quantification errors from true data, especially if thereare no replicate measurements. If the expression ratio is very low (quantificationerrors) or very high (spot intensity saturation), the result can be assumed to be an ar-tifact, and should be removed. Most of the actual outliers should be removed at thefiltering step (those that have too low intensity values), and some ways to identifydeviating observations were presented in the section on checking replicates.

In the absence of replicates, the highest and lowest 0.3% of the data (gene ex-pression values on one chip) is often removed, because assuming normality, suchdata resemble outliers. These values are outside the range of three standard de-viations from the distribution mean. This is often equivalent to a filtering, whereobservations with too low or high intensity values are excluded from further analy-ses.

Formally, a statistical model of the data is needed for the reliable removalof outliers. The simplest model is equality between replicates. If one replicatedeviates several standard deviations from the mean of the other replicates, it canbe considered an outlier and removed. The t-test measures standard deviation andgives genes, where outliers are present among replicates, a low significance.

In practise, it is easiest and best to couple the removal of outliers with filtering,


for example, using the following procedure: First, genes outside the range of threestandard deviations from the chip mean are removed. In the succeeding filteringsteps, the genes whose intensity is too low are removed, if needed. Next, geneswhich do not show any expression changes during the experiment are eliminated(this can be based on either log ratio or standard deviation of log ratios). What isleft in the end, is the good quality data.

9.8 Filtering bad data

Filtering is a process where observations which do not fulfill a preformulated pre-sumption are excluded from the data. For example, there is a certain limit of thescanner below which the intensity values can not be trusted anymore. Typically,the lowest intensity value of the reliable data is about 100–200 for Affymetrix dataand 100–1000 for cDNA microarray data. These cut-offs are likely to change as thescanners get more precise. The values below the cut-off point are usually removed(filtered) from the data, because they are likely to be artifacts.

The idea of filtering is simple. We want to remove all the data we do nothave confidence in, and proceed with the rest of the data. The results based on thetrustworthy data are often biologically more meaningful than the results based onvery confusing or noisy data.

The first step of filtering is flagging. Flagging is performed at the image anal-ysis step. It’s idea is to mark the spots which are, by eye, judged to be “bad”.Then, in the data analysis phase, the spots, which were flagged as “bad” are re-moved from the data. For example, the spots overlapped with a sizable dust particleshould be flagged as bad and removed at the data filtering step.

There are a couple of statistical measures that can be used for filtering. Someimage analysis programs give signal-to-noise or signal-to-background measure-ments for every spot on the array. These quality measurements can be used forfiltering out bad data. Often a cut-off point of 90–95% for signal-to-noise ratio isused, at least on control channel.

Signal-to-background can be calculated after the image analysis, from the in-tensity values:

spot signal on green channel

background signal on green channel

The spots can then be filtered on the basis of this quality statistic. Signal tobackground plots of both channels appear often very much the same as the originaldata (Figure 9.4).

It would also be expected that the signal to background ratio would increase asthe intensity of the channel increases, if the background is approximately the samein all areas of the chip (Figure 9.5). This is one hallmark of nice-quality data.

Filtering by the standard deviation of background intensities if often used, also.The observations that have an intensity value lower than background + 2–6 timesstandard deviation of the background intensities, are often removed from the data.


0 10000 30000

050

100

150

green channel stb

gmean

g.st

b

0 10000 30000

010

2030

red channal stb

rmean

r.st

b

0 10000 30000

010

000

3000

0

intensity>2*sd, both channels

g.qc

r.qc

0 10000 30000

020

000

original data

gmean

rmea

n

Figure 9.4: Raw intensities and log-transformed intensities of different channels plottedagainst their signal-to-background ratios (stb).

9.9 Filtering uninteresting data

After the removal of bad data, we are left with the good quality data, of whichmost is probably uninteresting for us. If the goal of the study is to find a coupleof dozens of genes for further studies of the biologically interesting phenomenon,it is a good idea to remove the uninteresting part of the data before clustering orclassification analyses. Uninteresting data comprises of the genes that do not showany expression changes during the experiment.

For example, assume that we have a time series, and we want to find the genesthat have a most upregulated expression at the end of the sampling. In such cases,genes that remain unchanging during the study, are removed from further analyses.This is because the clustering methods generally work better, if the uninterestingdata is not swamping the changes in the interesting part of the data.

Similarly, if we are interested in finding the genes that differentiate betweentwo groups of patients, the genes that do not show any change between thesegroups, can be removed from the analysis.

Most often the intensity ratio cut-offs for uninteresting data (not-changinggenes) are set at 0.5 and 2.0. These cut-offs also roughly correspond to the thresh-olds of reliable data: All the variation between these cut-offs can be assumed tobe due to random quantification errors of the non-changing genes. The modernchip technology is able to produce better chips, and in practise these cut-offs can


9 10 11 12 13 14 15

050

100

150

log2(gmean)

g.st

b

Figure 9.5: The signal-to-background ratios (stb) increase as the intensity of the spotincreases.

often be set on a lower level, for example to 0.7 and 1.5. The cut-offs can be bettercharacterized using a Volcano plot (see Chapter 11).

9.10 Simple statistics

There are some simple statistical measures of the data, which can be performedbefore or after normalization. They will give hints as to what statistical tests toapply or what is the distribution of the observations. These statistics should bechecked from the distribution of log-transformed data.

9.10.1 Mean and median

When the distribution of the intensity values is skewed, the median characterizesthe central tendency better than the mean. The median and mean can also be usedto check the skewness of the distribution. For symmetrical distributions mean andmedian are approximately equal.

9.10.2 Standard deviation

The standard deviation gives you an idea of how spread your data are. It is alsogood to check the coefficient of variation (standard deviation / mean). This shouldgive you a better idea of the real spread of the data, because the value of CV is


independent of the absolute values of the variable. For example, replicates shouldhave a small CV, although the value of standard deviation can vary a bit.

9.10.3 Variance

Sometimes you need to make a decision of the appropriate statistical test. Forexample, there are different kinds of t-tests for variables with equal or unequalvariances. As a rule of thumb, the variances can be assumed to be equal if thevariance of the first variable if less than three times the variance of the secondvariable.

9.11 Skewness and normality

Skewness and normality should be checked after normalization, but it is usuallyworthwhile, if they are checked before normalization, too. Then the distribution ofthe data (log-transformed intensity ratio) can be compared before and after normal-ization. This can be partly used for checking the success of the normalization.

Unacceptable histogram

Not normally distributedgmean

Fre

quen

cy

0 20000 40000

020

0040

00

Acceptable histogram

Not normally distributedgmean

Fre

quen

cy

0 10000 30000

050

015

00

Unacceptable histogram

Approximately normally distributedlog2(na.gmean)

Fre

quen

cy

3.0 3.2 3.4 3.6 3.8 4.0

010

0020

00

Acceptable histogram

Approximately normally distributedlog2(na.gmean)

Fre

quen

cy

3.2 3.4 3.6 3.8

010

020

030

0

Figure 9.6: Four histograms of the same data. The two upper histograms contain the rawdata, and the two lower histograms the log-tranformed data. The log-transformed data isclearly more normal-like than the non-transformed data.

Normality means how well your data fits to the normal distribution. This


should be checked, because most of the statistical procedures assume that the datais normally distributed. Even the most basic descriptive statistics can be misleadingif the distribution is highly skewed; standard deviation does not bear a meaningfulinterpretation if the distribution significantly deviates from normality.

The easiest method for checking normality and skewness of the distributionis to draw a histogram of the intensities (Figure 9.6). For checking the skewness,comparison of the mean and median is complementary to this graphical method.Note that there should be enough columns in the histogram in order to make theresults reliable.

There is also a more informative way to check for the normality, quantile-quantile plot. This should be used with the histograms as a more diagnostic methodfor observing the deviations from normality.

9.11.1 Linearity

It is also important to check the linearity of the data (log ratio). Linearity meansthat in the scatter plot of channel 1 (red colour) versus channel 2 (green colour), therelationship between the channels is linear. It is often more informative to producea scatter plot of the log-transformed intensities, because then the lowest intensitiesare better represented in the plot (Figure 9.7). In this kind of a plot, the data pointsfit a straight line, if the data is linear.

0 10000 30000

020

000

gmean

rmea

n

9 10 12 14

1012

14

log2(gmean)

log2

(rm

ean)

0 10000 30000

−1.

5−

0.5

0.5

rmean

logr

atio

0 10000 30000

−1.

5−

0.5

0.5

gmean

logr

atio

Figure 9.7: Checking the linearity of the data. Top row, red channel intensities versusgreen channel intensities; bottom row, log ratio versus different channel intensities.


Another way to test linearity is to plot the log ratio versus the intensity ofone channel (Figure 9.7). This should be done for both channels independently.In this plot the data cloud has been tilted 45% to right, and it should be easier toidentify non-linearity. Again, the data points should fit an approximately straightline, which in this case is horizontal.

Checking the linearity of the data helps to pick the right normalization method.It also provides information about the reliability of the data, especially in the lowerintensity range.

9.12 Spatial effects

Often the intensity values (especially background) vary a lot on the same chip de-pending on the location of the spot on the chip. This effect is known as spatialeffect (Figure 9.8).

Figure 9.8: Intensities of spots and their backgrounds spotted in the form of the originalchip in order to check for spatial biases. Background intensities seem to be highly variablein different areas of the chip. Especially the corners have an abnormally high background.These areas might be removed from further analyses.

Spatial effects are often most pronounced in the corners of the chip, wherethe hybridization solution has possibly spread thinner than in the other areas of thechip. Furthermore, corners of the chip can also dry up more easily than other areas


of the chip.Another source of spatial effects is the regional clustering of genes with a

similar expression profile. This is often caused by bad chip design, where theprobes have not been randomized onto the chip. For example, if genes acting in acell cycle regulation are spotted into one corner of the chip, the expression values inthe corner are probably dependent on the cell cycle stage of the sample material. Insuch cases, the spatial effect should not be removed by normalization procedures,because it contains real biological information instead of random noise.

The similar kind of spatial effect can also be checked for from the calculatedexpression values (Figure 9.9). There should not be any suspicious areas on thesubarrays (assuming good chip design), where the intensity values are much higherthan everywhere else on the same chip.

Figure 9.9: The measured gene expression spotted in the form of the original chip in orderto check for spatial biases.

If the spatial effect is identified as random noise, bad hybridization conditionsor other undesirable sources of errors, it can be compensated for by a suitable nor-malization method (subarraywise methods).


9.13 Normalization

Normalization can be regarded as a preprocessing step, but it is so central a methodfor the microarray experimentation that it warrants a chapter of its own. Thuswe will not discuss normalization in more detail here, but instead chapter 10 isdedicated for the issue.

9.14 Similarity of dynamic range, mean and variance

Many statistical procedures assume that across the experiment, the dynamic range(minimum and maximum), mean and variance of the chips are equal. These shouldbe checked after normalization. If they are not, the results of the statistical testsshould be interpreted with caution.

9.15 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on preprocessing using computer software.


1. Baldi, P., Hatfield, G. W. (2002) DNA microarrays and gene expression,Cambridge University Press, United Kingdom.

2. Huber W., von Heydebreck A., Sültmann H., Poustka A. and Vingron M.(2002) Variance stabilization applied to microarray data calibration and tothe quantification of differential expression. Bioinformatics, 18, suppl. 1(2002), S96-S104 (ISMB 2002).

3. Knudsen, S. (2002) A biologist guide to analysis of dna microarray data,John Wiley and Sons Inc., New York.

4. Kohane, I. S., Kho, A., Butte, A. J., (2002) Microarrays for an intergrativegenomics, MIT press, Massachusetts.

10 Normalization 99

10 Normalization

Jarno Tuimala, Ilana Saarikko, and M. Minna Laine

10.1 What is normalization?

There are many sources of systematic variation in microarray experiments that af-fect the measured gene expression levels. Normalization is the term used to de-scribe the process of removing such variation [12]. Normalization can also bethought of as an attempt to remove the non-biological influences on biological data.

Sources of systematic variation will affect different microarray experimentsto different extents. Thus, to compare microarrays from one array to another, weneed to try to remove the systematic variation, to bring the data from the differentexperiments onto a level playing field, so to speak. One goal of normalization isalso to make possible the comparison from one microarray platform to another, butthis is clearly a much more complicated problem and is not covered in this section.

The biggest problem with the normalization process is the recognition of thesource of systematic bias. There is a strong possibility that some or even most ofthe biological information will also be removed when normalizing the data. Thusit is good to keep in mind that the amount of normalization should be minimized toavoid losing the real biological information. In the next chapter some of the mostcommon sources of systematic bias will be introduced and also some methods forrecognizing the sources and dealing with the bias will be discussed.

10.2 Sources of systematic bias

10.2.1 Dye effect

Differences in dye (labeling) efficiencies are the most common source of bias andalso easily identifiable. This can be seen when the intensity of one channel on thearray is much higher than the other (see Figure 10.1, B and C). Dye effect can becorrected for by balancing the dyes using the assumption that both channels shouldbe equally “bright”. Extra information for dye balancing can be received from dye-swap experiments. Problems will occur when dye also has interaction effects, i.e.,the labeling efficiency may depend on the gene sequence.


10.2.2 Scanner malfunction

Scanners might show many different failures. When laser or PMT intensity valuesare wrongly adjusted, the scanner can cause the dye effects to show up. But mostof the scanner malfunctions are hard to deal with in silico and the best solution is tofix the scanner and repeat the scanning. For example, when lasers are misalignedthe two channels are slightly out of register. This can cause big problems if imageanalysis software does not allow the user to align images manually.

10.2.3 Uneven hybridization

Sometimes patterns are seen on the slide that can be caused by many differentreasons. The most common ones of these are described in the next three sections.

When spatial bias is seen on the slide, the first impression is usually that it iscaused by uneven hybridization. In most cases this is also true. Uneven hybridiza-tion can be recognized, for example, as lighter areas in the middle of the slide or onthe edges of the slide. This is usually very hard to fix numerically and it is recom-mended to aim for developing more consistent techniques for hybridization. Forsingle cases, spots that are not hybridized properly can be excluded from furtheranalyses. Background difference can also cause bias (see Figure 10.1, D).

10.2.4 Printing tip

Slides are usually printed using more than one pen (2, 4, 8, 16, ...). If any of thesepens work differently from others, for example a pen gets infected by hair or isdefective in some other way, the corresponding subarray could differ from othersubarrays. Quite commonly printing pens also wear out at a different rate. Oneway to see if one pen performs differently is to visualize the data by subarraysusing colors or regression lines so that the faulty subarray stands out. In somecases, printing tip error can be corrected for by applying different normalizationparameters to the subarrays.

Printing can also cause problems that are not categorized as spatial bias. Some-times some or all spots are irregular in shape or they might express as rings insteadof circles. It is likely that the detection of these spots will be inaccurate. Thisshould be taken into account during image analysis if possible.

10.2.5 Plate and reporter effects

These often ignored but very common effects are not artificial but are caused bybiology itself. Sometimes concentration of reporters (probes) on the slide mightdiffer and this can cause patterns on a slide. Usually this is easy to notice if theposition of the plates (assuming that concentration is constant on a plate) is known.Other quality control methods should also be used. Plate effects can be correctedfor by using methods as in the case of different printing tips.

More difficult to recognize and especially to distinguish from other sources ofbias is the bias caused by the biological role of reporters. If reporters are arrangedaccording to their biological function (as is done in many cases) this can also beseen as patterns on a slide. It is very important to note that this effect should not

10 Normalization 101

be normalized! The reporter effect needs to be taken into account when slides arebeing designed. Related reporters should never be grouped together.

10.2.6 Batch effect and array design

When large amounts of slides have been studied, it has been noted that slides fromthe same print run (or batch) often cluster together and also slides from an identicalprint design but different print run. Some of these effects might be due to mistakesin printing. This kind of bias is very hard to notice and the only way to prevent itis to keep track of printing (LIMS) and use of good pre- and post-printing qualitycontrol methods.

10.2.7 Experimenter issues

Equal to batch effect is the systematic bias caused by the experimenter. Experi-ments done by the same experimenter often cluster together more tightly than war-ranted by biology. A survey made at Stanford University showed that the experi-menter was one of the largest sources of systematic bias. The best but not a verypractical solution for the experimenter issue would be to let the same experimenterdo everything. Because this naturally isn’t possible, consistent hybridization tech-niques are needed as well as methods to recognize bias caused by the experimenter.

10.2.8 What might help to track the sources of bias?

If you have the possibility to hybridize one or several arrays where the same sam-ple has been used both as target and control i.e., the same sample labeled with bothdyes on the same array, or to study two replicate arrays as sample and reference,the results of both channels should be equal, a scatter plot should show a straightline, and optimally, your data should follow natural distribution. Studying self-selfhybridized arrays shows you clearly the sources of error, such as uneven incorpo-ration of the dyes, that are not dependent on your sample. It can also help you indeciding which normalization steps would minimize these errors.

Scatter plot (MA plot, see 10.5.5.) may be used to illustrate the different typesof effects due to intensity-dependent variation as shown in Figure 10.1.

10.3 Normalization terminology

There are three broad groups of normalization techniques, each of which can con-tain multiple methods. Classically, normalization means to make the data morenormally distributed. In microarray analyses, this is still the goal, but the method-ology has concealed the original idea.


Figure 10.1: Truncation at the high intensity end (A) is caused by scanner saturation.High variation at the low intensity end (B) is from larger channel- specific additive er-ror. High variation at the high intensity end (C) is from larger channel- specific multi-plicative error. The curvature in (D) is from channel mean background difference. Thecurvature in (E) is from slope difference. The split RI plots in (F) come from hetero-geneity. Figure adopted and modified from a paper entitled "Data transformation forcDNA Microarray Data" by Cui, X., Kerr, M. K., and Churchill, G. A. (Submitted toStatistical Applications in Genetics and Molecular Biology. The manuscript web site is athttp://www.jax.org/staff/churchill/labsite/pubs/index.html.

10.3.1 Normalization, standardization and centralization

In general, normalization means the process of transforming a statistic (e.g., inten-sity ratio) so that it will approximate normal distribution, or be more normal-like(see 7.9). In the microarray field, normalization means also the standardization andcentralization methods.

Usually a log-transformation works quite well as a normalization method formicroarray data. Log-transformation of intensity-ratios is good for several rea-sons. The simple ratio flattens all the underexpressed genes between 1 and 0. Log-


transformation removes this bias. In statistical terms, log-transformed data gives amore realistic sense of variation, and makes the variation of intensities and ratios ofintensities more independent of absolute magnitudes. Log-transformation also sta-bilizes the variance of high intensity spots. It evens out highly skewed distributions(e.g., makes more normal-like), and makes normalization (centralization) additive.After log-transformation the expected mean of the dyes is 0 as opposed to 1 in thecase of plain intensity ratios.

Standardization is the process of expanding or contracting the distribution ofa statistic so that the experimental values can be compared with those from anotherexperiment. In statistical terms standardization means converting the intensity ra-tios to Z-scores (see 8.5.1.), which are distributed as a standard normal distribution.

Most of the methods meant by normalization in the microarray field are cov-ered by the term centralization. It is a process of moving a distribution so thatit is centered over the expected mean (balancing the two channels). For the log-transformed intensity ratios, an intensity dependent centralization (e.g., lowess)might help to correct the dye bias.

10.3.2 Per-chip and per-gene normalization

Per-chip normalization compensates for the experimental biases introduced on theindividual chips. It assumes that the median intensity stays relatively constant dur-ing the experiment. If there are large differences due to the biological phenomenon,you can accidentally remove the relevant variation in your data by normalization.

Per-gene normalization accounts for the difference in the detection efficiencybetween different spots. It also enables us to compare the relative gene expressionlevels, as already pointed out. For two-color data, the per-gene normalization isusually not used by default, because the calculation of intensity ratios corrects forthe same bias.

10.3.3 Global and local normalization

When normalizing the data, we estimate some descriptors of the data (e.g., themean). If only one descriptor is used for normalizing the whole data for one chip,then a global normalization is performed.

Sometimes we have different estimates of means for different subarrays or fordifferent intensity ranges on the same chip. If these different estimates or a function(nonlinear regression, i.e. lowess) is used for the normalization of the concomitantsubarrays or intensity ranges, the local normalization method is applied.

10.4 Performing normalization

10.4.1 Choice of the method

There are several normalization methods and little consensus about which one touse, but here we will try to clarify the choice of methods and their applicabilitywith a few simple rules. The choice of the normalization method is coupled toexperimental design. Optimally, the method should be chosen before the actual


experiment is conducted. However, the selection of the best normalization methoddepends also very much on how the results turn out, e.g., the outcome of quality,success of different steps in printing, hybridization, and scanning can be evaluatedonly after the experiment has actually been done. Therefore, after inspecting thequality of your data in general (see previous chapter), it is also advisable to tryseveral different methods and see which one gives the most realible results for yourexperiment. In any case, inside one experiment the same normalization schemeshould be applied uniformly to all chips.

10.4.2 Basic idea

The basic idea behind all the normalization methods is that the expected meanintensity ratio between the two channels (two-color data) or two chips (one-colordata) is one, because only about 10–20% of all the genes in a cell are expressed atany one time. If the observed mean intensity ratio deviates from one, the data ismathematically processed in such a way that the final observed mean intensity ratiobecomes one. When the mean intensity ratio is adjusted to one, the distribution ofthe gene expression is centered so that different chips (arrays) can be compared.

Often, when a big enough number of genes are plotted on a chip, it can beassumed that most of them actually aren’t changing, and the mean intensity ratiocan be assumed to be one. If only a couple of dozen genes are plotted on the slide,this assumption does not necessarily hold, and some other means of normalizationshould be undertaken. This is especially true, if all the genes plotted on the slideare relevant to the studied hypothesis.

10.4.3 Control genes

If the whole genome of the organism is printed on the microarray, we can assumethat most of the genes are not expressed, or are expressed in very low quantities. Insuch cases all the genes on the chip can be used for normalization. If only a fewdozen or hundred of genes are present on the chip, some normalization standardsneed to be added on the chip. Suitable controls are housekeeping genes and spikedcontrols.

When using spiked controls, a known labeled DNA target is added to the la-beled cDNA samples. This spiked DNA control hybridizes to its probe on the DNAmicroarray chip. The spiked control should optimally be RNA or DNA from an-other species than the studied one that do not cross-hybridize, but behaves similarlyto your sample RNA or DNA. The expected intensity ratio of the spiked controlsis one. If the observed ratio deviates from expected, the observed ratio is used fornormalization.

Housekeeping genes are now known not to be very reliable controls, becausetheir expression often changes during the experiment. However, it is often possibleto find a set of genes that do not change during a certain experiment. Using such aset of non-changing genes may mean that you need to modify your arrays and re-doyour hybridizations after you have found the working set of genes you wish to useas controls.

Controls are needed in microarrays as in any quality-checked laboratory exper-


iment. Ideally, you would have several types of controls in different concentrationsscattered throughout the printed array. If you use Affymetrix products, chips con-tain controls for the experiment and for the hybridization of the correct target (per-fect match and mismatch,Chapter 2). More about controls can be read in Chapters6 (Experimental design) and 9 (Preprocessing of data).

10.4.4 Linearity of data matters

There are several mathematical procedures and different algorithms (median cen-tering, standardization, lowess smoothing) for normalization from which to choosefrom. Before deciding on the method, the linearity of data should be checked (seeChapter 9). If the data is linear, such procedures as median centering can be ap-plied. And the other way around, median centering can only be applied to lineardata. If the data is non-linear, lowess smoothing or an other local method shouldrather be applied. Moreover, if the data is normally distributed, it can be stan-dardized. Also for certain purposes, for example clustering, the spread of the datashould be standardized.

10.4.5 Basic normalization schemes for linear data

In order to make comparisons between chips of the same experiment possible, everysingle chip has to be normalized. In it’s simplest form this per-chip normalizationmeans median centering as described in section 9.5.2 and Table 10.2. In otherwords, the median intensity (or expression) of every chip is brought to the samelevel. Median centering does not change the spread of the data, which also meansthat the original information content of the data is not altered. Median centering isoften a suitable choice for one-color datasets.

In addition to per-chip normalization, per-gene normalization can be applied.This controls the hybridization efficiency differences between the individual probes.For two-color experiments, the calculation of the intensity ratio corrects for differ-ences between spots, and other normalization may not be needed. However, forone-color experiments, per-gene normalization is usually applied.

One-color experiment can also be normalized against another one-color chip(a reference chip), and handled thereafter as two-color experiment. With regard toAffymetrix data, it is good to remember that the data has already been normalizedand scaled to some extent using specific algorithms (to calculate results for perfectmatch and mismatches) (see Chapter 2).

If per-chip and per-gene normalizations are applied, both individual chips andgenes can directly be compared with each other. This is often the desired situation.

10.4.6 Special situations

Sometimes basic normalization schemes don’t perform well enough. If most of thegenes on a chip are likely to change, or there are spatial biases on the chip, moresophisticated methods should be used. Again, global methods should not be used,if the data is nonlinear, if there are spatial bias on the chip, or if the number ofexpressed genes varies a lot between individual chips. In such cases, local methods


should be used. As already mentioned, lowess smoothing is a a recipe for thenormalization of nonlinear data.

If the expression of most of the genes on the chip are expected to change duringthe experiment, a median centering using all the genes is not a viable option. In suchcases spiked controls or housekeeping genes can be used for the normalization (see9.5.8). These methods correct effectively for the differences between the channels,but if the amount of mRNA hybridized to individual chips varies a lot, errors arelikely to occur.

Spatial biases can be corrected in multiple ways. Probably the easiest methodis to use median centering subarraywise. In other words, a new median is calculatedfor every subarray (or a block of about 500 genes on a chip). This subarray specificmedian is then used for the normalization of that subarray. The same principle canbe applied if the printing tip groups deviate from each other. Sometimes the leftside of the chip has lower expression values than the right side of the chip. Suchspatial biases can also be corrected using the median centering inside one probecolumn or row.

10.5 Mathematical calculations

There are several ways to calculate the normalized intensity ratios. Here, we presenta few of the most commonly used ones. If you want to play around with these,please check the logarithmic rules of calculation before proceeding. For example,the mean centering (subtract the mean or median to get a mean or median of zero)with log-transformed data is equivalent to mean scaling (divide with a certain num-ber to get a mean or median of one) with untransformed data. The next formulasare presented for the log ratios. Usually the normalization is calculated using onlya control channel or a reference chip.

10.5.1 Mean centering

Calculate the mean of the log ratios for one microarray. Produce the centered databy subtracting this mean from the log ratio of every gene.

10.5.2 Median centering

Calculate the median of the log ratios for one microarray. Produce the centereddata by subtracting this median from the log ratio of every gene.

10.5.3 Trimmed mean centering

Remove the most deviant observations (5%) from the data for one microarray. Cal-culate the median of the log ratios of the remaining genes. For centered data, sub-tract this median from the log ratios of every gene.

10.5.4 Standardization

Equivalent to the Z-score calculation. See section 8.5.1.


10.5.5 Lowess smoothing

Lowess-regression is often performed in a space of A versus M. This MA plot, alsocalled RI plot, is equivalent to the usual scatter plot, but the data cloud has beentilted 45 degrees to the right. In other words, M is the log-transformed intensityratio, and A is the average intensity of the two channels or chips. Under and over-expressed genes are much easier to find from this scatter plot, and the linearity ofthe data is also easier to check. Thus, MA plots can reveal the intensity dependencyof log ratios, which appears as curvature, as well as extra variation at low intensityspots (see Fig. 9.1.).

M = log2(R)− log2(G)

A = log2(G)+ log2(R)

2After normalization the M and A can be transformed back to the dye intensities as

R =(

22A+M) 1

2

G =(

22A−M) 1

2

Lowess-normalization works as follows. A lowess-function smoothing is ap-plied to the data and a curve is estimated by the sliding window method. The nor-malized log-transformed intensity ratio is calculated by subtracting the estimatedcurve from the original values. (Note: If a linear function is used for the local re-gression, it is called lowess with a “w”. If a quadratic function is used for the localregression, it is called loess without the “w”.)

Figure 10.2 gives an example of how the lowess-regression treats real life data.

Figure 10.2: Centralization with a lowess-regression. On the left, the log2-normalizeddata (centered around zero) in an A versus M plot. On the right, lowess-normalized data.

Linlog transformation can also be used to remove intensity-dependent varia-tion. The linlog transformation combines linear and logarithmic transformationsthrough a smooth transition to take advantage of both methods, because the lin-ear transformation of raw data is more appropriate for low intensity spots and the


logarithmic transformation is more appropriate for high intensity spots. The linlogtransformation can precede lowess smoothing and in that way stabilize the varia-tion due to the additive error dominant at low intensity, and the multiplicative errordominant at high intensity in microarray data.

10.5.6 Ratio statistics

Data is assumed to follow a certain distribution. This distribution can then be usedto normalize the current data. The ratio statistics method is closely related to esti-mating an error model from the data.

10.5.7 Analysis of variance

If there are data from several replications of the same gene, the ANOVA methodcan be applied to centralize the data. With the ANOVA, many sources of error canbe taken into account at the same time.

10.5.8 Spiked controls

We assume that the intensity ratio from the spiked controls is one, if there is nodye-incorporation bias. If the ratio deviates from one, this information can be usedfor the calculation of the centered intensity values for the genes:

green intensity = red intensity

spiked control intensity ratio

The housekeeping genes are used for normalization in a similar manner.

10.5.9 Dye-swap experiments

Dye-swap experiments can be normalized with the lowess-regression, when theaverage of M and the average of A are plotted against each other as in the usuallowess procedure. New labeling methods should remove the dye-incorporationbias, but meanwhile this normalization method can be handy, especially in the casesthat indicate that the mRNA sequence may influence the labeling efficiency.

If we assume that the expression of any one gene in the original and the dye-swap experiment is of equal magnitude but opposite signs, the normalization ofdye-swap experiments can be performed in a similar way as the normalization ofnon-dye-swap experiments.

The normalized values for the dye-swap experiment can be then calculated as:

1

2

(log2

RG ′

G R′

)− c,

where R and G are the original and R’ and G’ are the dye-swap experiment intensi-ties of the red and green channels, respectively. The normalization constant can beestimated as

c = 1

2

(log2

R

G+ log2

R′

G′

)


In other words, the average of the original and dye-swap chip is calculatedfor every gene. Normalized values are calculated by subtracting the average of thechips from the averaged genewise intensities.

10.6 RMA preprocessing for Affymetrix data

Robust Multichip Average or RMA is a novel preprocessing method for Affymetrixchips. It consists of three parts, background adjustment, quantile normalization,and summarization by median polishing. These steps are outlined in more detailbelow.

Intensity values generated by PM probes only are used in RMA. The back-ground is assumed to be additive, so that the intensity of the PM probe is a sumof background and foreground (“spot”) intensities. It is also assumed that errorin intensity values is multiplicative, i.e., the larger the absolute intensity value, thelarger the error. In addition, background intensities are assumed to be normally dis-tributed and that foreground intensities are exponential. The values of backgroundand foreground intensities can then be estimated based on these assumptions, andforeground intensities can be adjusted accordingly.

Next, the background adjusted intensity values are normalized using a quantilenormalization procedure. Quantile normalization method assumes that the distri-bution of intensity values is similar on every chip. This holds well for genes withlow expression, but not necessarily for highly expressed genes. In pratice, quantilenormalization proceeds as follows. First, the intensity values on every chip are ar-ranged in ascending order. Arranged values for every probe are then averaged overall the chips. Last, the intensity values for the probes are replaced with their meanon every chip, and the means are arranged in the original order that is based on thearranged intensity values. Procedure can be illustrated with the following example(Table 10.1).

Expression values are calculated from rearranged mean (normalized) valuesusing median polishing. Median polishing is an iterative method, which aims tocentralize both column medians and row medians to one. In the case of RMAmedian polishing works on a matrix were every row corresponds to one gene andevery column to one chip. The median for every row is calculated and the median issubtracted from the intensity values so that the row median becomes one. Next, themedian for every column is calculated and the median is subtracted from the inten-sity values so that the column median becomes one. These two steps are repeateduntil the result converges, i.e., until the row and column median do not change anymore.

RMA implementations work with log-transformed data, so at the median pol-ishing phase the row and column medians are actually centered to zero. Accord-ingly, the expression values reported by RMA are already log-transformed.

10.7 Some caution is needed

Most of the aforementioned methods are used for per-chip normalization. Per-genenormalization is not worthwhile if you only have a few chips, because then you


Table 10.1: An example of quantile normalization for Affymetrix RMA-files. See text fordetails.

Original values

chip1 chip2 chip3gene1 2 4 4gene2 3 6 8gene3 3 3 9

Arranged values

chip1 chip2 chip3 meangene1 2 3 4 3gene2 3 4 8 5gene3 3 6 9 6

Mean substituted values


Rearranged mean values


can potentially introduce errors to your data. This is because the mean, standarddeviation, and regression curves can not be effectively estimated from a very smallnumber of observations. You can think how reliable it is to take genewise medianor mean from three values, if you have three chips and three genes or sopts in eachchip, compared to the median calculated in per-chip normalization, when one chipcontains thousands of spots. As one guideline, the experiment should consist of atleast five chips, if you want to perform the per-gene normalization using mean ormedian centering. Even more observations (at least 10-25) are needed in order touse standardization, regression or other advanced normalization method.

Furthermore, using very sophisticated normalization methods can lead to aphenomenon called overfitting, in which case the model (e.g., a linear regression)describes the variability of the data too well. This effectively removes biologicallyrelevant variation from the data, but it also introduces biases to the analysis. It iscommon that the correlation of two chips (especially replicates) slightly decreasesafter normalization. Whether this means that the biologically relevant informationhas been removed, and some noise added, is currently unknown.

The mean and median centering do not usually influence the standard deviationvery much, but the regression and other more sophisticated methods do. Usually


they move towards a higher standard deviation, which might mean more noisy data.Therefore, the normalization procedure should be as simple as possible, yet takingthe systematic errors into account.

10.8 Graphical example

A couple of examples how the normalization procedure affects the graphical repre-sentation of the data are presented in Figure 10.3. These examples show, how thenumber of genes affects the results. In the image a diminishing series of house-keeping genes is used for normalization.

10.9 Example of calculations

A short example of mean centering procedures is presented in Table 10.2. Meancentering can be applied either as per-chip normalization or per-gene normalization.With Affymetrix data, both procedures are often applied. If both procedures areapplied, the per-chip centering should be performed before the per gene centering.Note that the standard deviation is not affected by mean centering.

Figure 10.3: The four images represent the same data processed with per-chip meancentering. The number of genes used for the mean calculation has been varied. From topleft to bottom right: 5776 genes, 57 genes, 577 genes, and 6 genes. Note the varying slopeof the linear regression line fitted to the normalized data.


Table 10.2: An example of per-chip and per-gene mean centering. Because of roundingerrors, the results presented in the table are not necessarily products of individual observa-tions and their means.

Uncentered data

chip1 chip2 chip3 chip4 mean standard dev.gene1 2.12 2.01 4.37 2.01 2.63 1.16gene2 2.20 2.06 4.32 2.03 2.65 1.11gene3 2.18 1.90 4.37 1.90 2.59 1.20gene4 2.15 1.92 4.38 1.89 2.59 1.20gene5 2.14 2.00 4.52 1.99 2.66 1.24gene6 1.93 2.02 4.18 2.01 2.54 1.09gene7 2.26 1.96 4.19 1.98 2.60 1.07gene8 2.07 2.00 4.39 2.01 2.62 1.18gene9 2.25 2.06 4.34 2.04 2.67 1.12gene10 1.95 1.76 3.97 1.82 2.37 1.07

mean 2.13 1.97 4.30 1.97standard dev. 0.11 0.09 0.15 0.07

Per-chip mean centering

chip1 chip2 chip3 chip4 mean standard dev.gene1 0.00 0.05 0.07 0.04 0.04 0.03gene2 0.08 0.09 0.02 0.07 0.06 0.03gene3 0.06 -0.07 0.07 -0.07 -0.01 0.08gene4 0.03 -0.05 0.08 -0.08 0.00 0.07gene5 0.01 0.03 0.21 0.02 0.07 0.10gene6 -0.19 0.05 -0.13 0.04 -0.06 0.12gene7 0.13 -0.01 -0.11 0.01 0.00 0.10gene8 -0.05 0.03 0.09 0.05 0.03 0.06gene9 0.13 0.09 0.04 0.07 0.08 0.04gene10 -0.18 -0.21 -0.33 -0.15 -0.22 0.08

mean 0.00 0.00 0.00 0.00standard dev. 0.11 0.09 0.15 0.07

Per-gene mean centering

chip1 chip2 chip3 chip4 mean standard dev.gene1 -0.51 -0.61 1.74 -0.62 0.00 1.16gene2 -0.45 -0.59 1.66 -0.62 0.00 1.11gene3 -0.40 -0.69 1.78 -0.69 0.00 1.20gene4 -0.43 -0.67 1.80 -0.70 0.00 1.20gene5 -0.52 -0.66 1.86 -0.67 0.00 1.24gene6 -0.60 -0.52 1.64 -0.52 0.00 1.09gene7 -0.34 -0.64 1.60 -0.62 0.00 1.07gene8 -0.54 -0.62 1.77 -0.61 0.00 1.18gene9 -0.42 -0.61 1.67 -0.63 0.00 1.12gene10 -0.43 -0.61 1.60 -0.56 0.00 1.07

mean -0.47 -0.62 1.71 -0.62standard dev. 0.08 0.05 0.09 0.06


10.10 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on normalization using computer software.


1. Anonymous. Presentations from Normalization working group, MGED4meeting, Boston, February 2002. See latest information from the MGEDData Transformation and Normalization Working Group’s home page at http://genome-www5.stanford.edu/mged/normalization.html.

2. Beißbarth, T., Fellenberg, K., Brors, B., Arribas-Prat, R., Boer, J., Hauser,N. C., Scheideler, M., Hoheisel, J. D., Schutz, G., Poustka, A., and Vingron,M. (2000) Processing and quality control of DNA array hybridization data.Bioinformatics, 16, 1014-22.

3. Bolstad, B. M., Irizarry R. A., Astrand, M., and Speed, T. P. (2003), A Com-parison of Normalization Methods for High Density Oligonucleotide ArrayData Based on Bias and Variance. Bioinformatics, 19, 185-193.

4. Brazma, A., and Vilo, J. (2000) Gene expression data analysis. FEBS Lett.,480, 17-24.

5. Cui, X, Kerr, M. K., and Churchill, G. A.. (2003) Data Transformations forcDNA Microarray Data. Statistical Applications in Genetics and MolecularBiology, 2, Article 4.

6. Dudoit, S., Y. H. Yang, M. J. Callow, and Speed, T. P.. Statistical methodsfor identifying differentially expressed genes in replicated cDNA microarrayexperiments (Statistics, UC Berkeley, Tech Report # 578).

7. Eickhoff, B., Korn, B., Schick, M., Poustka, A., and van der Bosch, J. (1999)Normalization of array hybridization experiments in differential gene expres-sion analysis. Nucleic Acids Res., 27, e33.

8. Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K.J., Scherf, U., Speed, T.P. (2003), Summaries of Affymetrix GeneChip probelevel data Nucleic Acids Res., 31, e15.

9. Irizarry, RA, Hobbs, B, Collin, F, Beazer-Barclay, YD, Antonellis, KJ, Scherf,U, Speed, TP (2003) Exploration, Normalization, and Summaries of HighDensity Oligonucleotide Array Probe Level Data. Biostatistics, 4, 249-264.

10. Quackenbush J. (2001) Computational analysis of microarray data. Nat. Rev.Genet., 2, 418-427.

11. Schuchhardt, J., Beule, D., Malik, A., Wolski, E., Eickhoff, H., Lehrach, H.,and Hanspeter, H. (2000) Normalization strategies for cDNA microarrays.Nucleic Acids Res., 28, e47.


12. Sherlock, G. (2001) Analysis of large-scale gene expression data. Briefingsin Bioinformatics, 2, 350-362.

13. Yang, Y. H., Dudoit, S., Luu, P. and Speed, T. P. (2001) Normalization forcDNA Microarray Data. Proceedings of Photonics West 2001: Biomedi-cal Optics Symposium (BiOS),The International Society of Optical Engi-neering (SPIE), San Jose, California. http://www.stat.berkeley.edu/

users/terry/zarray/Html/normspie.html

14. Yang, Y. H., Dudoit, P., Luu, P., and Speed, T.P. (2001) Normalization forcDNA microarray data. In M. L. Bittner, Y. Chen, A. N. Dorsel, and E. R.Dougherty (eds), Microarrays: Optical Technologies and Informatics, Vol.4266 of Proceedings of SPIE.

11 Finding differentially expressed genes 115

11 Finding differentiallyexpressed genes

Jarno Tuimala

After removing the bad quality data we are left with reliable data. As we havealready seen, good quality data can be further filtered so that only the genes thatshow some changes in the expression during the experiment are preserved in thedataset. Often the differentially expressed or otherwise interesting genes are storedas simple lists of gene names, i.e., genelists. Most typically genelists are createdby driving the data through a certain filter, but they can also include lists of genes,which have been produced by clustering analyses or other more complex methods.

11.1 Identifying over- and under-expressed genes

Sometimes it is sufficient to get information about genes that are either under- orover-expressed during the experiment. For example, we might be interested ingenes that have an elevated expression because of a drug treatment. Such genesare most easily found by simple filtering. Simple filtering (by absolute expressionchange) can even be used for experiments, where there are no replicates.

11.1.1 Filtering by absolute expression change

If the log-transformed data is used for analysis, the overexpressed genes have anexpression above zero and under-expressed genes have an expression below zero.However, often the experimental errors are of the order of two, and for finding theover- and under-expressed genes the cutoffs of -1 and 1 are used. All the geneswhich fulfill the filter criteria are saved in a new genelist.

The method can also be used for filtering by expression duration. If the ex-pression change has been at least x for a duration of y, the gene names are savedinto a new genelist.

This kind of genelist can be created in any spreadsheet program, and the actualdata-analysis can be very simple to perform.


11.1.2 Statistical single chip methods

The crude filtering by absolute expression change is not always the optimal methodfor finding the differentially expressed genes, because the information about thereliablity of the expression change is lacking. In the absence of replicates, thereare still a few statistical methods that can be used, if we want to have statisticalsignificance values for expression changes. However, using such methods carriesthe risk that the most unstable probes or mRNAs are identified as differentiallyexpressed.

In the next four sections we will go through some of these methods.

11.1.3 Noise envelope

Noise envelope means simply a method where the standard deviation (calculatedwith a sliding window) is used as a cut-off for under- and over-expressed genes.

The variance in the reported gene expression level is a function of the expres-sion level so that the variance is higher in the lower expression values and lowerwhen the expression is high. Thus, the identification of the differentially expressedgenes calls for a statistical model of the variance or noise. The simplest modelcould be constructed on the basis of the standard deviation of expression values.

Such a model can be constructed relatively easily using statistical programs.The method relies on a segmental (sliding window) calculation of standard devia-tion. A datapoint refers to an (x, y) pairing, where x is the absolute intensity valueof a gene from the hybridization, and y is the corresponding intensity ratio value.Using all data points in a given sliding window of expression values, the standarddeviation of the intensity ratios is calculated. The average of expression valueswithin the window are then paired with the average intensity ratio value within thesame window plus the number of standard deviations (usually 2 or 3) specified bythe experimenter. This new pair becomes the candidate upper noise envelope point.A candidate lower noise envelope points are similarly determined.

Using these noise envelopes, the insignificant expression changes are removed:The region between the upper and lower noise envelopes contains genes with in-significant changes.

11.1.4 Sapir and Churchill’s single slide method

After logarithmic transformation and normalization, Sapir and Churchill’s methodfits a linear regression function to a scatter plot of sample intensity versus controlintensity to yield orthogonal residuals. These residues represent the gene expres-sion changes in the experiment. The expectation-maximization (EM) algorithmapplying the Bayes’ rule is used to obtain estimates of the proportion of differen-tially expressed genes. Using the Bayes’ rule and information on the residuals, the(posterior) probabilities of differential expression are calculated.

The method is implemented in the R language package sma. The results can beobtained either as a genelist or in a form of a scatter plot. The scatter plot exampleis given here (Figure 11.1). The residuals are modeled as a uniform distribution, sothe cut offs are constant for all intensity values.


8 10 12 14

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

A

M

Figure 11.1: A scatter plot with Sapir and Churchill’s method with the cut off linesmarked. The vertical lines represent the cut offs for up- and downregulated genes. Notethat the axes of the plot are normalized log ratio (M) and average intensity (A).

11.1.5 Chen’s single slide method

The idea central to Chen’s method is that the gene expression levels are determinedby the intrinsic properties of each gene, which means that the expression levels varywidely among genes. Therefore it is inappropriate to pool statistics on gene expres-sion differences across the microarray. Assuming that the green and red channelintensities are normally distributed and have similar coefficients of variation and asimilar mean, the confidence intervals for actual differential expression are calcu-lated using the data derived from the maximum likelihood estimate of the coeffi-cient of variation and a set of genes, which do not show any expression change onthe chip.

The method is implemented in the R language package sma. The results can beobtained either as a genelist or in a form of a scatter plot. The scatter plot exampleis given here (Figure 11.2).


8 10 12 14

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

A

M

Figure 11.2: A scatter plot with Chen’s method with the 95% (inner, light blue) and 99%(outer, dark blue) cut off lines marked. The vertical lines represent the cut offs for up-and downregulated genes. Note that the axes of the plot are normalized log ratio (M) andaverage intensity (A).

11.1.6 Newton’s single slide method

Newton’s method models the measured expression levels using terms that accountfor the sources of variation. The first obvious source is measurement error. Thesecond source of variation is due to the different genes spotted onto the microarray.Newton’s method can be viewed as a hybrid of Chen’s and Sapir and Churchill’smethods, because it models the variability on the slide very similarly to Chen’smethod, but the mathematical calculations are done with the EM-algorithm similarto the one Sapir and Churchill used. Results are presented as log-odds ratios thatthe gene is differentially expressed.

The method is implemented in the R language package sma. The results canbe obtained either as a genelist or in the form of a scatter plot. The scatter plotexample is given here (Figure 11.3).


8 10 12 14

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

A

M

Figure 11.3: A scatter plot with Newton’s method with the log-odds ratios 0, 1 and 2 cutoff lines marked. The vertical lines represent the cut offs for up- and downregulated genes.Note that the axes of the plot are normalized log ratio (M) and average intensity (A).

11.2 What about the confidence?

Say, we have found a gene that is threefold upregulated after the drug treatment.How do we know that the result is not just an experimental error? To assess thatwe need to determine the statistical significance of the upregulation of the gene.Significance can only be assessed, if replicate measures of the same gene wereperformed during the experiment. The chips (or expression of invidual genes onthem) can be compared by the t-test (two conditions) or ANOVA (more than twoconditions).

11.2.1 Only some treatments have replicates

If the gene expression has been measured for control and treatment, but only treat-ment has been replicated, the experimental error can be crudely estimated fromthe variation between the replicates. The standard deviation is determined for ev-ery gene, and the expression change is compared with the standard deviation. Themore the change exceeds the standard deviation between replicates, the more sig-nificant the gene is. Note that this method does not allow for the calculation ofp-values.


11.2.2 All the treatments have replicates: two-sample t-test

The statistical basis for the t-test has been covered in detail elsewhere (see 6.9), butbriefly, the two-sample t-test looks at the mean and variance of the two distributions(say, control and treatment chip log ratios), and calculates the probability that theywere sampled from the same distribution. The t-test can be applied successfully insituations, where both the control and treatment have been repeated.

Here (Figure 11.4), we present an example of a graphical t-test result, whichhas been produced using R language supplemented with package sma.

Figure 11.4: R-package sma automatically produces these four plots for t-test results.The genes that have an expression pattern most significantly deviating from zero are high-lighted with a colored star.

After calculating the t-test p-values for the replicated genes, the ones with thelowest p-value (marked with a star in the images in Figure 11.4) can be saved intoa new genelist and used in further analyses, for example cluster analysis. Theseare the genes that most significantly differ between two conditions, say control andtreatment mice.


11.2.3 All the treatments have replicates: one-sample t-test

Up- and downregulated genes can be more effectively found, if the chips are repli-cated. In such cases, the one-sample t-test for the deviation from zero (expectedmean) can be performed. Finding the differentially expressed genes is even moreeasier, if the so called Volcano plot is produced (Figure 11.5). This plot of log ratioversus statistical significance shows also which is the lowest expression change wecan be statistically confident in. In our example, 2-fold down-regulated genes havea p-value of 0.01. In contrast, 1.4-fold upregulated genes have a similar p-value.This indicates that new chip technology coupled with an appropriate number ofreplicates can produce reliable data even for very small expression changes.

−4 −2 0 2 4

01

23

4

log2(ratio)

−lo

g10(

p)

Figure 11.5: Volcano plot, where the statistical significance by one-sample t-tests[-log10(p)] is plotted against normalized log ratio [log2(ratio)]. Vertical dashed lines rep-resent a 2-fold difference between the control and the sample. Horizontal dashed linerepresents the t-test p-value of 0.01, and the solid horizontal line represent p=0.001.

11.2.4 Non-parametric tests

Instead of parametric t-test, which assumes that the expression values are nor-mally distributed, non-parametric tests like Mann-Whitney U test (two groups) orKruskal-Wallis test (two or more groups) can be applied, especially if the expres-sion values are not normally distributed.


11.3 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on finding differentially expressed genes using computer software.


1. Chen, Y., Dougherty, E. R., and Bittner, M. L. (1997) Ratio-based deci-sions and the quantitative analysis of cDNA microarray images. Journal ofBiomedical Optics, 2, 364-374.

2. Newton, M. N., Kendziorski, C. M., Richmond, C. S., Blattner, F. R., andTsui, K. W. (1999) On differential variability of expression ratios: Improv-ing statistical inference about gene expression changes from microarray data.Technical Report 139, Department of Biostatistics and Medical Informatics,UW Madison, http://www.stat.wisc.edu/∼newton/papers/

abstracts/btr139a.html.

3. Sapir and Churchill (2000), Estimating the posterior probability of differen-tial gene expression from microarray data, http://www.jax.org/research/churchill/.

4. SMA-package for R, http://www.stat.berkeley.edu/users/terry/

zarray/Software/smacode.html.

12 Cluster analysis of microarray information 123

12 Cluster analysis ofmicroarray information

Mauno Vihinen and Jarno Tuimala

12.1 Basic concept of clustering

Microarray experiments generate mountains of data, which has to be stored andanalyzed. For humans it is difficult to handle very large numeric data sets. There-fore a general concept is to try to reduce the dimensionality of the data. A numberof clustering techniques have been used to group genes based on their expressionpatterns. The user has to choose the appropriate method for each task. The basicconcept in clustering is to try to identify and group together similarly expressedgenes and then try to correlate the observations to biology. The idea is that coregu-lated and functionally related genes are grouped into clusters. Clustering providesthe framework for this analysis. The hard part is to analyze the biological processesand consequences. Clustering can be a very useful tool together with visualizationof the information.

Before clustering a large number of requirements have to be met. The data hasto be of good quality, the chip design should be correct, one should have interestinggenes contained in the chip, sample preparation has to be flawless, data has tobe properly treated (outliers, normalization etc). Clustering will not improve thequality of the data. If poor-quality data is used then also the outcome is useless:garbage in, garbage out.

12.2 Principles of clustering

Clustering organizes the data into a small number of (relatively) homogeneousgroups. Unusually normalization of the expression values is used. At this stage,it is of interest to look at the changes in expression patterns, not to follow the ac-tual numeric changes. Thus, the methods are used to find similar expression motifsirrespective of the expression level. Therefore, both low and high expression levelgenes can end up in the same cluster if the expression profiles are correlated byshape.

The majority of clustering methods has been available already for long. Dur-


ing the last few years, some new methods dedicated to microarray analysis havebeen developed, too. The methods can be described and classified in differentways. Here, it is not possible to go into details of all the methods, so only themost commonly used methods are described.

Clustering methods can be grouped as supervised and unsupervised. Super-vised methods assign some predefined classes to a data set, whereas in unsuper-vised methods no prior assumptions are applied.

Hierarchical clustering, K-means, self-organizing maps (SOMs), principal com-ponent analysis (PCA), and K-nearest neighbor (KNN) have been commonly used.There are also other methods, such as multidimensional scaling (MDS), minimumdescription length (MDS), gene shaving (GS), decision trees, and support vectormachines (SVMs).

12.3 Hierarchical clustering

Hierarchical clustering is a statistical method for finding relatively homogeneousclusters. Hierarchical clustering consists of two separate phases. Initially, a dis-tance matrix containing all the pairwise distances between the genes is calculated.Pearson’s correlation or Spearman’s correlation are often used as dissimilarity es-timates, but other methods, like manhattan distance or euclidian distance can alsobe applied. If the genes on a single chip need to be clustered, the euclidian dis-tance is the correct choice, since at least two chips are needed for calculation ofany correlation measures.

After calculation of the initial distance matrix, the hierarchical clustering al-gorithm either iteratively joins the two closest clusters starting from single clusters(agglomerative, bottom-up approach) or iteratively partitions clusters starting fromthe complete set (divisive, top-down approach). After each step, a new distancematrix between the newly formed clusters and the other clusters is recalculated. Ifthere are N cases, N-1 clustering steps are needed. There are several methods ofhierarchical cluster analysis including

• single linkage (minimum method, nearest neighbor)

• complete linkage (maximum method, furthest neighbor)

• average linkage (UPGMA).

For a set of N genes to be clustered, and an NxN distance (or similarity) matrix,the hierarchical clustering is performed as follows:

1. Assign each gene to a cluster of its own.

2. Find the closest pair of clusters and merge them into a single cluster.

3. Compute the distances (similarities) between the new cluster and each of theold clusters using either single, average or complete linkage method.

4. Repeat steps 2 and 3 until all genes are clustered.


Step 3 can be performed in different ways depending on the chosen approach.In single linkage clustering, the distance between one cluster and another is con-sidered to be equal to the shortest distance from any member of one cluster to anymember of the other cluster. In complete linkage clustering, the distance betweenone cluster and another cluster is considered to be equal to the longest distancefrom any member of one cluster to any member of the other cluster. In averagelinkage, the distance between one cluster and another cluster is considered to beequal to the average distance from any member of one cluster to any member of theother cluster.

The hierarchical clustering can be represented as a tree, or a dendrogram.Branch lengths represent the degree of similarity between the genes. The methoddoes not provide clusters as such. Conceptually, different clusters and sizes of clus-ters can be obtained by moving along the trunk or branches of the tree and decidingon which level to put forth branches (cut the tree).

Hierarchical clustering is often applied in the analysis of patient samples toorganize the data based on the cases. Indeed, for patient samples the hierarchicalclustering method is most often the best option. Usually, in addition to patient-based organization, genes are also clustered by applying two-way clustering. Onone axis are the samples (patients) and on the other axis the genes (Figure 12.1).

Figure 12.1: An example of hierarchical clustering.

12.4 Self-organizing map

Kohonen’s self-organizing map (SOM) is a neural net that uses unsupervised learn-ing for which no prior knowledge of classes is required. SOMs are usually usedto visualize and interpret large high-dimensional data sets. In SOM, every input is


connected to every output via connections with variable weights. Also, the outputnodes are highly interconnected. SOM tries to learn to map similar input vec-tors (gene expression profiles) to similar regions of the output array of nodes (Fig-ure 12.2). The method maps the multidimensional distances of the feature space totwo-dimensional distances in the output map. The SOM algorithm is iterative.

The map attempts to represent all the available observations with optimal accu-racy using a restricted set of models. At the same time, the models become orderedon the grid so that similar models are close to each other and dissimilar modelsfar from each other. So the order and organization of the nodes (tentative clusters)contain more information than just the actual partition of genes to clusters.

In SOMs, the number of clusters has to be predetermined. The value is givenby the dimensions of the two-dimensional grid or array. One has to experimentwith the actual number of clusters. Most programs facilitate the analysis of all thegenes within a cluster or clusters. One should find such array dimensions that thereis a minimum number of poorly fitting genes in the clusters. The actual number ofclusters is also difficult to assign, but the square root of the number of genes is agood initial estimate. However, it depends on the data set.

Figure 12.2: An example of self-organizing map.

12.5 K-means clustering

K-means is a least-squares partitioning method for which the number of groups, K,has to be provided. The algorithm computes cluster centroids and uses them as newcluster seeds, and assigns each object to the nearest seed (Figure 12.3). However,it is also possible to estimate K from the data, taking the approach of a mixturedensity estimation problem.

1. The genes are arbitrarily divided into K centroids. The reference vector, i.e.location of each centroid, is calculated.

2. Each gene is examined and assigned to one of the clusters depending on theminimum distance.


3. The centroid’s position is recalculated.

4. Steps 2 and 3 are repeated until all the genes are grouped into the final re-quired number of clusters.

During the course of iterations, the program tries to minimize the sum, overall groups, of the squared within-group residuals, which are the distances of theobjects to the respective group centroids. Convergence is reached when the objec-tive function (i.e. the residual sum-of-squares) cannot be lowered any more. Theobtained groups are geometrically as compact as possible around their respectivecentroids (Figure 12.4).

K-means partitioning is a so-called NP-hard problem (there is no known algo-rithm that would be able to solve the problem in polynomial time), thus there is noguarantee that the absolute minimum of the objective function has been reached.Therefore, it is good practise to repeat the analysis several times using randomlyselected initial group centroids, and check whether these analyses produce compa-rable results.

Figure 12.4: An example of k-means clustering with four clusters.

12.6 KNN classification - a simple supervised method

K-nearest neighbor (KNN) is a simple classification method. It has been usedfor finding a set of genes that differentiates two or more groups of samples (here,chips). For example, Golub et al. (1999) successfully applied the method for dis-tinguishing between acute lymphoblastic and acute myeloid leukemia. Here, themethod introduced by Golub is outlined.

The idea is to build a classifier, i.e., find the set of genes that differentiates thegroups of interest. For optimal performance we would need a training set and a testset. Training set is used for building the classifier, and the test set is used for the


Figure 12.3: An example of K-means algorithm. A Nine genes are to be clustered. BGenes are divided into three clusters (black, red and green) randomly. C Cluster centroids(filled circles) are calculated as an average of gene expression for every cluster, and thedistance of genes to these centroids is calculated. D Genes that are closest to the blackcluster centroid are reassigned to black cluster, and the same is repeated for red and greenclusters. E Cluster centroids are recalculated using the newly assigned genes. F Steps C-Eare repeated until the genes stabilize to their clusters, and the final results is reported forthe user.


validation of the classifier. If no test set is available, the classifier can be validatedusing a leave-one-out cross-validation, but the performance is likely to be inferiorto the case where a test set is available.

The KNN method has three phases, neighborhood analysis, class prediction,and validation. In neighborhood analysis, every gene is represented by an expres-sion vector that consists of the gene’s expression in every tumour sample. Thetumour classes are represented by an idealized expression vector, where one groupis marked with 0, and the other group with 1. A correlation between the gene’sexpression vector, and the idealized expression vector is calculated using some cor-relation measure, and the result is verified using permutation testing. Only thegenes with the best correlation measures are selected for the class prediction phase.

In the class prediction phase, the group is predicted for every sample. Theprediction of the sample group is based on votes. Each gene that was selected inthe neighborhood analysis phase votes for either of the groups. The vote is basedon the similarity (correlation) of the gene’s expression to the mean expression ofthe groups. If the gene’s expression is closer to the group 1, it gives a vote forgroup 1, and is it’s expression is closer to the mean of the group 0, it votes for 0.The votes for each group are summed to obtain total votes. The sample is assignedto the group with a higher number of votes, i.e., a majority rule decision is made.

In the last phase the performance of the classifier is tested or validated. Vali-dation is needed to rule out the effect of sampling error on the construction of theclassifier. The most reliable way to test the classifier is to collect a new group ofsamples, and test how well the classifier performs on them. If the performance isbad, a more reliable classifier might be built, and the whole process starts again.If the performance is good, the classifier might be used, e.g., to construct a clini-cal test for the differentiation of the groups, if needed. If no test set is available,cross-validation can be used. In cross-validation, one by one, the samples are leftout of the training set, the classifier is built, and it’s preformance is tested on thesample that was left out. This leave-one-out procedure goes through every sampleof the set, and stops when all the samples have been left out once. The genes thatperform consistently for every cross-validation sample (or step), are then selectedfor the final classifier.

KNN usually performs quite well, when the number of nearest neighbors (K)used for building the classifier is small (10-20% of the group), otherwise the mis-classification could become a problem if the groups are not well separated.

12.7 Principal component analysis

Objectives of principal component analysis (PCA) are to discover or to reduce thedimensionality of the data set and to identify new meaningful underlying variables.PCA transforms a number of (possibly) correlated variables into a (smaller) num-ber of uncorrelated variables called principal components (Figure 12.5). The basicidea in PCA is to find the components that explain the maximum amount of vari-ance possible by n linearly transformed components. The first principal componentaccounts for as much of the variability in the data as possible, and each succeedingcomponent accounts for as much of the remaining variability as possible.


PCA can be also applied when other information in addition to the actual ex-pression levels is available (this applies to SOM and K-means methods as well).The method provides insight into relationships, but no matter how interesting theresults may be, they can be very difficult if not impossible to interpret in biologicalterms.

Figure 12.5: An example of principal component analysis. The two most significantprincipal components have been selected as the axes of the plot.

12.8 Pros and cons of clustering

There are no objective rules for determining the correct number of gene expressionclusters. Therefore, an extensive manual analysis is required to find out the optimalnumber of clusters. The expression patterns can be visually inspected, and thedistribution of every gene from the centroid analyzed along with the variation fromthe other genes within clusters. In suboptimal clusters, some genes do not followthe general expression pattern within a cluster due to clearly different expressionprofiles. It is important to find out a working solution for the number of clusters,because the subsequent analyses and data mining rely on the partition obtainedby clustering. The only exceptions are methods using supervised learning, i.e., inaddition to clustering information other data is used to organize the genes. Theproblem with supervised methods is that they may force the data to behave in acertain way. After all, we cannot know what is the correct partitioning for a givendata set. The usefulness of the data for interpretation of biological processes is infact the true test for the correctness of clustering.


Clustering methods provide a relatively easy way to organize the cluster in-formation. Together with visualization methods they allow for the user an intuitiveway of looking at, understanding and analyzing the data. Different clustering meth-ods and the same method with different premises produce different end results, sothe user has to try to find out a useful result. In many articles the analysis is finishedat clustering. This is a mistake, since clustering is actually only a starting point formore detailed and interesting studies.

For the clustering methods to work properly, the data should be well separatedmeaning that clusters are easily separable. However, this is usually not the case inmicroarray studies. Diffuse or interpenetrating groups cause problems in the de-termination of cluster boundaries during the clustering, leading to different resultswhen using different methods and even the same method with different parameters.Therefore, the user has to consider clusters as a working partitioning, not as thefinal truth.

Single linkage is not the best approach for hierarchical clustering, because theresult can be remarkably skewed. The method is good for picking outliers that areconnected in the very last steps of the process. Complete linkage tends to producevery tightly packed clusters. The method is very sensitive for the quality of thedata. K-means is one of the simplest and fastest clustering methods. The result isdependent on the initial location of centroids. This problem can be overcome byapplying a number of intialization approaches.

KNN is the simplest classification method, and as such can only fit a linear dis-criminator to the dataset. As DNA microarray data is typically multidimensional,and groups can not necessarily be distinguished using a linear discriminator, KNN’sperformance might not be optimal. If the KNN classification produces many mis-classifications, it might be better to use, e.g., support vector machines that can fitpolynomial discriminators, instead.

12.9 Visualization

Gene expression experiments may include millions of individual data items. It isnot possible to comprehend such an overwhelming amount of information withoutproper visualization methods. Scatter plots are very useful for the initial analysisand comparison of data sets (Figure 12.6).

It is customary to present the clustering results by grouping genes of clustersnext to each other (Figure 12.7). In SOM based analysis, also the order of the clus-ters contains information about the relationships between clusters. For the manualanalysis of the goodness of clustering, one has to look at the expression patternswithin the generated clusters. Genes within a cluster should follow the average ex-pression pattern of the cluster. Because every gene has to be assigned to a cluster,genes with unique expression patterns do not fit well in any group. The reason forthis behavior may be truly different expression profile or it may originate from ex-perimental errors or difficulties. It is impossible to find the reason without furtherexperimental tests unless the behavior is consistent in independent experiments.


Figure 12.6: An example of scatter plot.

Figure 12.7: An example of expression pattern within one cluster.

In microarray studies, usually a red/green color scheme is applied. The dataare shown in a matrix format, with each row representing all the hybridization re-sults for a single gene of the array, and each column representing the measuredexpression level for all genes at a single time point. To visualize the results, theexpression level of each gene is represented by a color, red representing overex-


pression and green underexpression within each row. The color intensity representsthe magnitude of deviation. It is naturally possible to use other colors, too, and ithas been recommended to use a blue/yellow scheme instead, and avoid the prob-lems with color blind persons. Most visulization packages allow free choice of thecolors.

Red/green figures can be found from most microarray analysis articles. Thesefigures provide a general overview of the data. With large data sets, it is not possibleto identify or even mark individual genes into such figures. Often either moredetailed figures of certain types or groups of genes are provided or the whole datais in a (web accessible) appendix. From the beginning of 2003, major journals haverequired the data to be submitted to data repositories (see chapter 7).

12.10 Function prediction

Genes that fall into the same cluster might have a similar transcription response to acertain treatment. It is likely that some common biological function or role is actingin the background. If the cluster has a bunch of genes with known and similar func-tions, the characteristics of unknown genes in the cluster may be inferred. Functionprediction can be assisted with additional data on sequence similarity, phylogeneticinference based on sequences, posttranslational modifications, cellular destinationsignals, and so on. Taken together, a sufficiently large number of common featurescan be used to make fairly accurate predictions of gene functions.

12.11 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on clustering and visualization using computer software.


1. Brazma, A., Vilo, J. (2000) Gene expression data analysis, FEBS Lett., 480,17-24.

2. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C.D., Lander, E. S. (1999) Molecular classification of cancer: class discoveryand class prediction by gene expression monitoring, Science, 286, 531-7.

Part III

Data mining

13 Gene regulatory networks 135

13 Gene regulatorynetworks

Tomi Pasanen, and Mauno Vihinen

13.1 What are gene regulatory networks?

By combining gene expression analysis, perturbations or treatments, and mutationsof genes we can study processes like signal transduction and metabolism yieldinginformation on molecular effects or functions of specific genes. However, geneexpression data permits us to go beyond the traditional line of research and to studyfiner structures of molecular pathways exposing causal regulation relations betweengenes. The finer structures do not only describe the nature of interactions betweengenes, inhibition or activation, but also exemplify direct and indirect effects ofgenes. For example, is gene A regulating gene B or vice versa, is the regulationdirect or indirect where there is a mediating gene C (or maybe many) so that Aregulates C and then C regulates B. Inspecting the finer structures, which arecalled regulatory networks, gives us a more intricate view of molecular interactionsoffering further possibilities for medical interventions.

Inferring regulatory networks from expression data, a process which is calledreverse-engineering, is not a computationally simple problem because an enormousamount of time is needed even when a trivial approach is applied. Therefore, wesketch principles and methods with an example approach applicable for the infer-ence process in next sections.

13.2 Fundamentals

It is customary to visualize regulatory dependencies between genes by a graphG(V , E) with a vertex set V consisting of the genes examined and an edge setE , see Figure 13.1. A directed edge E (i, j ) between vertices Vi and Vj representsregulation between genes attached to vertices Vi and Vj , and the direction of theedge, denoted by Vi → Vj , reflects the order of regulation where gene Vi regulatesgene Vj (up- or downregulation). Node Vi is called a parent of V j and node V j iscalled a child of Vi . To present logical structures of group regulations, which ap-pear extensively in nature, hyper edges should be used, where an edge is adjacent to


several genes, but these are used quite rarely because of the convenience of simpledrawings. A graph like G is called gene regulatory network.

V1

V2

V3 V4

Figure 13.1: Graph G consisting of four nodes V1, V2, V3, V4, and tree edges E(1,2), E(1,3),and E(3,4).

By inspecting a putative regulatory graph, like G, build from earlier experi-ments, we can pose several questions. First of all, is the structure right, or if thereare several possible structures with respect to the experiments done so far, whichone is right? So a graph represents hypotheses based on current knowledge. Bylocating central nodes or nodes of interest we can conduct additional experimentsto confirm the dependencies around nodes or get an idea which of the graphs isthe correct one. Choosing the best node to inspect more carefully is not trivial,at least not, when we try to make distinctions between putative graphs and we tryto minimize experiment costs at the same time. To this category belongs also thequestion: is the order of regulation correct? Manual perturbation, like the deletionof a gene, might give a clear clue on the order of regulation. Finally, if we stickto the obtained structure there is still questions on the strength of each regulation,whether they are they right.

But before all the above questions, we need to know how to construct such agraph directly from data, because, in theory, any kind of an edge is possible betweenany set of vertices, and so the problem is to infer the true connections based on thedata produced by the experiments. This approach is called reverse-engineeringof gene regulatory networks. It is interesting to note that similar to biochemicalreactions, sequences, and three dimensional structures having their own motifs, itis clear that regulatory networks have to have their own regulatory motifs or circuitmotifs based on the lower level motifs.

To infer a regulatory network, several distinct expression samples of the genesof discourse are needed. In a time series analysis, the expression levels are recordedalong fixed time points, while in a perturbation analysis, the expression levels arerecorded separately for each manual perturbation (over/under, deletion) of specificgenes. A fundamental difference between perturbation data and time series data isthat perturbation allows a firm inferring order of regulation while time series datacan only reveal the probable regulation direction. The problem can be overcome bycombining a few manual perturbation experiments with time series data. In bothcases, the genes are monitored at fixed points inducing an expression matrix

{Xi [ j ] | 1 ≤ i ≤ n,1 ≤ j ≤ m} ,

where Xi [ j ] denotes the expression level X i of gene or node Vi at point j , seeTable 13.1. Based on the data, we can try to infer the most prominent regulatory


Table 13.1: Top, part of the table for time series expression data of genes SWI5, CLN2,CLN3, and CLB1 [1]. The mean of whole data was 0. Bottom, the expression levels arediscretized so that values higher than or equal to the mean get a value of 1 and values lowerthan the mean get a value of 0.

gene t1 t2 t3 t4 t5 t6 t7 t8 t91 (SWI5) -0.41 -0.97 -1.46 0.16 0.74 0.72 1 0.77 0.32 (CLN3) 0.49 0.62 0.05 -0.13 0.02 0.04 -0.14 0.24 0.913 (CLB1) 0.6 -0.53 -1.37 1.03 1.13 1.27 1.04 1 0.074 (CLN2) -1.26 1.6 1.54 0.31 -0.14 -0.88 -1.7 -1.88 -1.7

1 (SWI5) 0 0 0 1 1 1 1 1 12 (CLN3) 1 1 1 0 1 1 0 1 13 (CLB1) 1 0 0 1 1 1 1 1 14 (CLN2) 0 1 1 1 0 0 0 0 0

subnetworks, or, if we know the structure of some subnetwork beforehand, we cantry to model mathematically the regulations between nodes in the subnetwork.

Because the number of possible networks increases super-exponentially on thenumber of genes, prominent regulatory networks are usually searched for by usingsupervised approaches, where some intrinsic partial knowledge about regulationsbetween genes are formulated by implicit or explicit rules. For the mathematicalmodeling of regulation inside a network, there are many approaches like Bayesiannetwork, Boolean network and its generalization, ordinary and partial differentialequations, qualitative differential equations, stochastic master equations, Petri nets,transform grammars, process algebra, and rule-based formalisms, to mention some.

We will use Bayesian network as a sample tool for several reasons: its frame-work offers a clear separation of structure and parameter optimization, and addingpredefined rules and information is easy. Moreover, it is widely used for microarraydata.

13.3 Bayesian network

Bayesian network modelling consists of two parts: the qualitative part, where direct(causal) influences between nodes V are depicted as directed edges E in graphG, and the quantitative part, where local conditional probability distributions ofexpression levels are attached to nodes V of G. Thus, expression levels X i ofnodes Vi are considered as random variables and the edges represent conditionaldependencies between distributions of the random variables.

Let us consider a binary case where a gene can be “off”, denoted by 0, or“on”, denoted by 1, but not both (see the bottom values of Table 13.1). Having afixed structure like graph G in Figure 13.1, the conditional probabilities are easy tocalculate, see Table 13.2.

The benefit of this modeling is that we need to store only O(2 kn) parameterswhen each node has at most k parents, i.e., one “success” probability for each rowin each table just calculated.


Table 13.2: Conditional probabilities based on the structure of G and the bottom valuesof Table 13.1.

Pr(X1 = 1)6/9

X1 Pr(X2 = 1)1 4/60 1

X1 Pr(X3 = 1)1 10 1/3

X3 Pr(X4 = 1)1 1/70 1

This is clear advantage over O(2n) parameters when the expression value of anode depends on the expression values of all other nodes. Thus, parameters pertainonly to local interaction.

Underlying the above discussion there are some basic probability rules that areexamined next. The joint probability for random variables X and Y (not necessarilyindependent) is calculated using the conditional probability:

Pr(X ∩Y ) = Pr(X ,Y ) = Pr(Y | X)Pr(X) = Pr(X | Y )Pr(Y ) ,

where notation Pr(X | Y ) denotes the probability of observation X after we haveseen evidence Y , see Figure 13.2. Reorganizing the equation induces Bayes’ rule

Pr(X | Y ) = Pr(Y | X)Pr(X)

Pr(Y )

while generalization of it forms the chain rule

Pr(K , X ,Y , Z ) = Pr(Z | K , X ,Y )Pr(K , X ,Y )

= Pr(Z | K , X ,Y )Pr(Y | K , X)Pr(K , X)

= Pr(Z | K , X ,Y )Pr(Y | K , X)Pr(X | K )Pr(K ) .

If variables X and Y are independent, then

Pr(X ∩Y ) = Pr(X)Pr(Y ) ,

and if variables X and Y are independent for given a value of Z , then

Pr(X | Z ,Y ) = Pr(X | Z ) .

Based on our dependency graph G, probability Pr(X 1, X2, X3, X4) can be simplifiedas follows

Pr(X1, X2, X3, X4) = Pr(X1)Pr(X2 | X1)Pr(X3 | X1)Pr(X4 | X3) .

For example, Pr(X4 | X1, X2, X3) = Pr(X4 | X3) because for given the value of X 3,X4 is independent of the values of X 1 and X2. So, it is enough to consider onlydirect local dependencies from parents when evaluating the probabilities (for X 4 theonly parent is X3). We see that in Bayesian networks the probability calculationsfollow in a natural way the structure of a graph.


�

X YX ∩Y

Figure 13.2: A Venn diagram of the probability space �, where the areas denote proba-bilities, for example, Pr(X), is the left circle area divided by the total area of rectangle �.It is easy to see that Pr(X ∩Y ) = Pr(Y | X)Pr(X).

13.4 Calculating Bayesian network parameters

Having graph G and the expression matrix D = {X i [ j ]}, our aim is to obtain dis-tribution dependency parameters θ = {θ1,θ2, . . .} that are fitted best to the structureof G and data D. This is done by using the maximal likelihood principle where wemaximize the likelihood function L(θ : D):

L(θ : D) = Pr(D : θ ) =m∏

j=1

Pr(X1[ j ], X2[ j ], . . . Xn[ j ] : θ ) .

By using the structure of G (see again Figure 13.1) we can decompose our opti-mization problem to independent subproblems:

L(θ : D) =m∏

j=1

Pr(X1[ j ] : θ1) ·m∏

j=1

Pr(X2[ j ] | X1[ j ] : θ2)

·m∏

j=1

Pr(X3[ j ] | X1[ j ] : θ3) ·m∏

j=1

Pr(X4[ j ] | X3[ j ] : θ4) ,

or, in general, ∏i

Li (θi : D) .

The network structure decomposes the parameter set θ to independent parametersθ1, θ2, θ3 and θ4, which are computationally fast to estimate.

For simplicity, we consider again the binary case, see bottom of Table 13.1.Since V1 has no parents, i.e., it is independent of other nodes, its likelihood functionL1(θ1 : D) corresponds to the binomial probability where the probability of gettingvalue 1 is θ1 while 1− θ1 is the probability of getting value 0. Thus,

L1(θ1 : D) = Pr(D : θ1) =m∏

j=1

Pr(Xi [ j ] : θ1) = (θ1)N(X1=1)(1− θ1)m−N(X1=1) ,

where N(·) denotes the number of each occurrence in data with respect to the re-quirements inside parentheses. The estimator θ̂1 for θ1 maximizing the product is


simply the already calculated probability for Pr(X 1 = 1), i.e.,

θ̂1 = N(X1 = 1)

N(X1 = 1)+ N(X1 = 0)= 6/9.

Similarly, the conditional estimators for parameters θ (2|X1), θ(3|X1) and θ(4|X3) arethe probabilities already calculated:

θ̂(2|X1=1) = 4/6, θ̂(2|X1=0) = 1,θ̂(3|X1=1) = 1, θ̂(3|X1=0) = 1/3,θ̂(4|X3=1) = 1/7, and θ̂(4|X3=0) = 1.

Together these estimators form θ̂ = {θ̂1, θ̂(2|X1), θ̂(3|X1), θ̂(4|X3)}, which is the bestexplanation for the observed data D restricted to the structure of G, and we cancompute

L(θ : D) = (θ̂1)6 · (1− θ̂1)3 · (θ̂(2|X1=1))4 · (1− θ̂(2|X1=1))2

· (θ̂(3|X1=0))1 · (1− θ̂(3|X1=0))2 · (θ̂(4|X3=1))1 · (1− θ̂(4|X3=1))6

= (6/9)6 · (3/9)3 · (4/6)4 · (2/6)2 · (1/3)1 · (2/3)2 · (1/7)1 · (6/7)6

as the maximal likelihood value for graph G. Note that we have dropped out terms1x = 1 and 00 = 1.

Usage of more states for genes like “low”, “medium”, and “high” follows themultinomial theory conveniently generalizing the binomial theory. Using the pa-rameters, we can formulate hypothetical behaviors of genes in different situationsfixing some expression levels and observing others; moreover, updating the param-eters is easy as new data arrives.

13.5 Searching Bayesian network structure

Given a graph G, we know now how to calculate the parameter set θ G maximizingthe likelihood score L(θ G : D). To restrict the search space of possible networks, itis practical to have constrains on which kind of networks are allowed, that is, whichdependencies are possible between genes. This helps a lot because of the super-exponential number of possible networks; for example, the number of possiblegraphs consisting of four nodes is 64 since there are

(42

)possible (undirected) edges,

each of which can be taken to form a graph making 2 (42) different graph in total. For

evaluation of the resulting candidate networks, we use again the likelihood scorecalculation of a network structure using maximum likelihood estimates of θ G whenthe structure is searched

L(G,θG : D) =∏m

Pr(X1[m], X2[m], . . . , Xn[m] : G,θG )

∏m

∏n

Pr(Xn[m] | Parents(Xn)[m] : G,θGn ) .

Here, Parents(Xn) denotes the expression values of all parents of node Vn and θGn

denotes the parameter of node Vn in graph G. Instead of direct calculation it is more


V1V3

V2

V4

Figure 13.3: A different structure explaining conditional dependencies between the ex-pression values of nodes V1, V2, V3, V4.

convenient to use a logarithm of L(G,θ G : D) for comparing different structures.After simplifying we have a surprisingly simple formula

log(L(G,θ G : D)) = mn∑

i=1

(I(Vi ,Parents(Vi ))− H(Vi )) ,

where H(Vi ) is the entropy of expression values X i of node Vi

H(Vi ) =∑

x=0,1

Pr(X1 = x) lg1

Pr(X1 = x)

which can be ignored because it is the same for all network structures. The functionI(Vi ,Parents(Vi )) is the mutual expression information between node Vi and itsparent nodes Parents(Vi ) = {P1, P2, . . .} defined by∑

x=0,1(∀i):xi =0,1

Pr(Xi = x⋂

i Pi = xi ) lgPr(Xi = x

⋂i Pi = xi )

Pr(Xi = x)∏

i Pr(Pi = xi ).

Function I(Vi ,Parents(Vi )) ≥ 0 measures how much information the expression val-ues of nodes Parents(Vi ) provide about Vi . If Vi is independent of parent nodes,then I(Vi ,Parents(Vi )) has the value of zero. On the other hand, when Vi is totallypredictable for given values of Parents(Vi ), then I(Vi ,Parents(Vi )) reduces into theentropy function H(Vi ). It should be noted that in general I(X ,Y ) �= I(Y , X) so thedirection of edges matters.

Is the structure presented in Figure 13.1 optimal with respect to the data? Whatabout the structure in Figure 13.3. Because our data set is small we do not resort tothe above calculations but we calculate the parameter set θ̂ :

θ̂2 = 7/9, θ̂4 = 3/9,θ̂(3|X2=0,X4=0) = 1, θ̂(3|X2=0,X4=1) = 1,θ̂(3|X2=1,X4=0) = 1, θ̂(3|X2=1,X4=1) = 0,θ̂(1|X3=0) = 1/3, and θ̂(1|X3=1) = 1.

The comparison of these figures with the previous ones shows that the new graphis more probable with respect to data.

13.6 Conclusion

Our example data was complete without any missing observations of expressionvalues, which happens quite rarely in reality. Two of the most familiar conventions


to deal with the problem of missing expression values is to omit the data rows miss-ing some values or substituting the most common values for the missing values. Ofcourse, what is the best policy, is an everlasting topic to debate on.

Even a more difficult problem is to decide how the discretization from expres-sion values to logical “on/off” values should be done. Can we really use a globalstep value like in our example or should we use use gene-specific values? On theother hand, can we use a single limit because then a small variation in measure-ments can toss a gene between the “on” and “off” state without any real reason?One cure for the last problem might be to use fuzzy logic, where a gene has anincreasing grade of being “on” and after a fixed point it is completely or fully “on”;similarly for “off” but reversed.


1. D’haeseleer P., Shoudan Liang S., Somogyi R. (2000) Genetic network infer-ence: from co-expression clustering to reverse engineering, Bioinformatics,16, 707-726.

2. Pe’er, D., Regev, A., Elidan, G., Friedman N. (2001) Inferring subnetworksfrom perturbed expression profiles, Bioinformatics, 17, 215S-224.

3. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,M. B., Brown, P. O., Botstein, D., Futcher, B. (1998) Comprehensive identi-fication of cell cycle-regulated genes of the yeast Saccharomyces cerevisiaeby microarray hybridization. Mol. Biol. Cell, 9, 3273-97.

14 Data mining for promoter sequences 143

14 Data mining forpromoter sequences

Martti Tolvanen, and Mauno Vihinen

14.1 General introduction

In all gene expression studies there is an underlying element of gene regulation. Inorder to be expressed, the gene has to be transcribed to RNA in a regulated pro-cess. Therefore, many expression experiments can be used to probe the regulationpatterns. This is most obviously the case when you observe gene expression in dif-ferent conditions or gene expression over time in some process, but comparisons ofdifferent cells or tissues or diseased vs. healthy tissues may benefit from analysisof regulation, too. Whatever your study, there will be changes in expression levelsof some genes, either upregulation or downregulation. How the expression levelscan be correlated to mechanisms of regulation is the topic of this chapter.

We start with an introduction about gene regulation and promoters and thenturn to the practical issues of bioinformatics in promoter datamining: where yourdata comes from and how you can analyze it. We pay special attention to select-ing your input data, i.e. finding the promoter region sequences (Input data I). Thisincludes a discussion of the pertinent problem of identifying the genes on yourchip (an obvious prerequisite), which should actually be part of your data prepro-cessing. Input data II discusses selection of matrices which define transcriptionfactor binding sites. The analysis part is shorter and divided into three sections: 1)searching known regulatory sites (“pattern matching”, 2) searching potential regu-latory patterns that are not known a priori (“pattern recognition”) and 3) analyzingalready-assigned regulatory patterns and sites.

14.2 Introduction to gene regulation

Changes in expression levels of genes can be due to a number of reasons. First,there is variable chromatin accessibility: some regions of DNA are packed tootightly to allow strand separation or even binding of various factors. One well-known mechanism of chromatin packing is based on methylation of cytidines inCpG islands to effect gene silencing via heterochromatin formation, and another


one is related to histone acetylation/methylation. Second, there are specific tran-scription factors which may inhibit or activate transcription. The balance betweenactivating and inhibiting transcription factors may determine the expression levelin loosely packed euchromatin. Third, there are some genes that seem to be on “bydefault” without much apparent regulation, the so-called maintenance or householdgenes which are needed in all cell types. “Household gene”, however, does not im-ply constant expression; in any gene the expression level will fluctuate over timedue to the general metabolic activity of the cell, especially the activity and amountof parts in the transcription machinery itself. If maintenance genes are used fornormalization of array data it is necessary to verify the constant expression e.g. byRT-PCR in the system and conditions of the experiment. (See also the Chapter 10,Normalization)

Data mining for gene regulation is currently feasible only in the second caseabove, i.e. in search and study of transcription factor binding sites. The regulationof chromatin structure changes are still poorly understood even in the experimentallevel, so there is no basis for theoretical analyses either. However, analyses ofDNA methylation are possible now, and may soon be extended to global analysesof methylation of the genome (“methylome” analyses) in various conditions.

Most known transcription factor binding sites (TFBS) are located close to thetranscription start site (TSS), particularly in the 500 bp directly upstream (5’) ofTSS. This upstream region is often referred to as the promoter region or proximalpromoter region (but sometimes the binding sites themselves are called promot-ers, and sometimes the word promoter refers to the so-called “basal promoter”, orthe binding region of the RNA-polymerase and other regular components of thetranscription machinery). Other regions that activate transcription (enhancers) mayoccur almost anywhere downstream (3’) or upstream from the gene, even 20 kBaway from the promoter region, or in introns. The methods for the search of reg-ulatory elements that we describe could be used for potential enhancer regions aswell as for the promoter regions, but commonly only promoters are analyzed. Thisis because a larger search space weakens the signal-to-noise ratio, and the TFBSare quite short and vague, i.e. weak signals. Therefore, we discuss mainly methodsand examples of data mining in the proximal promoter regions.

Coexpressed genes are often grouped by different clustering methods. It isreasonable to present the hypothesis that some of the similar changes of expressionare due to the effect of the same or similar transcription factors - therefore it makessense to search for shared sequences that could be TFBS. Of course, depending onyour clustering (cluster sizes and number), you may have several mechanisms af-fecting within one cluster, or you may have the same mechanism affecting membersof several clusters, but in any case, at least some clusters may show enrichment ofsimilar TFBS. It makes even sense to look for over- and/or under-represented TFBSin sets of genes with increased or decreased expression in one experiment only.

Sometimes you may correlate known functional associations of genes to yourpromoter analysis results to make more sense of emerging groups of coexpressedgenes and the biological meaning of the coexpression.

In brief, it is reasonable to presume that coexpression may imply coregulation,and looking for enriched shared patterns, especially known TFBS, in promoter re-


gions allows you to formulate hypotheses of regulation mechanisms for focusedexperimental testing. However, before experimental verification, your results willbe only sets of potential TFBS of potentially coregulated genes.

14.3 Input data I: promoter region sequences

Promoter analyses need the promoter region sequences, but getting them may notbe trivial! In this section, we introduce some tools and data sources that we havefound useful for retrieving promoter regions for human genes.

First of all, it is instrumental to know precisely which genes are contained inthe microarray. The data from the manufacturer is not always systematically pre-sented or complete. There may be a mix of references to GenBank accession num-bers, UniGene clusters, gene names, protein accession numbers, RefSeq mRNA orprotein accession numbers, Ensembl gene codes and EntrezGene ID codes (previ-ously LocusLink), and if you are lucky, even actual sequences that are containedon the array.

Even if we assume that all data provided by the manufacturer is correct, thereare some issues that you should be aware of before you can fully trust your geneassignments:

• gene names have been subject to many changes, so your data may not containthe current official names

• the UniGene clusters of ESTs are dynamic entities that undergo fusions andfissions in every UniGene build as more data becomes available, so yourESTs may be assigned to other genes, or may lack a UniGene association inthe current build. New UniGene cluster numbers appear in every new buildand old ones are made obsolete. Bottom line: save UniGene codes as a lastresort after using other, stable data items.

• the data regarding your filter or chip may have been updated after it left thefactory, so the shipped gene lists may not be as complete and up-to-date asyou can find at the manufacturer’s web site. You should definitely work withthe most recent data available as long as it refers to the same manufacturingbatch that you actually used.

• you may find several RefSeq or Ensembl mRNA and protein codes for asingle gene locus, corresponding to alternatively spliced forms - however,many alternative splice forms are still missing from the databases

• the provided EST code is not always the actual sequence on the array, be-cause the manufacturer may have used a longer and more correct version ofthe corresponding mRNA

• many genes, perhaps 25 to 30 % of human genes, contain alternative tran-scription start sites, and therefore alternative promoters (see Database ofTranscriptional Start Sites, DBTSS, at http://dbtss.hgc.jp/). Currentlyvery few of them are annotated in the genome databases, and there is very


little you can do to make sure you have the correct one for the tissue/systemyou studied. DBTSS (http://dbtss.hgc.jp/) is the best data resourcehere, but there are not any ready-made tools to utilize their data in promoteranalysis.

The ultimate source for the promoter regions is in the genome data. The anno-tations of gene locations are still incomplete, and there are no ready-made tools formaking decisions of the transcription start site and informing the user of the relia-bility of each decision. Currently (late 2005) we have found that human promotersare easiest to retrieve in large scale from two public data sources. They are multi-species resources, so when more genomes become available, the same approacheswork.

• University of California in Santa Cruz (UCSC) provides ready-made collec-tions of upstream sequences of as many genes as they can localize basedon RefSeq mRNA sequences - different sets contain different lengths of 5’-sequences preceding the TSS, called upstream1000, upsteam2000 etc. Thedata from the most recent human genome (hg17 build 35, May 2004) isat http://hgdownload.cse.ucsc.edu/goldenPath/hg17/bigZips/, butyou will always find most recent data at http://genome.ucsc.edu/

downloads.html under Full data set. Similar data is available for mouse andsome other species (but not for all represented species). Start from http://

hgdownload.cse.ucsc.edu/downloads.html and go to “Full data set” foreach species.

• BioMart (http://www.ensembl.org/Multi/martview) allows direct retrie-veal of upstream sequences of any length if you can map your genes to En-sembl gene codes or RefSeq codes (more details follow). Coding sequences,introns and downstream regions can be downloaded as well.

For a smaller number of genes the most reliable data (regarding the place-ment of TSS) is available in the Eukaryotic Promoter Database (EPD, http://www.epd.isb-sib.ch/), which contains promoter regions of 500 bp, only fromgenes with an experimentally verified TSS. For example, you can find 1,871 hu-man entries in EPD 84 (out of 4800 for all species). DBTSS (http://dbtss.hgc.jp/) contains data for over 8,000 human genes as well as thousands from otherspecies. Organism-specific promoter databases may exist to fit your needs, con-sult the genome site of your favourite organism or Michael Zhang’s lab at http://rulai.cshl.org/software/index1.htm.

However, commercial promoter analysis packages may contain ready-madepromoter data sets containg quality evaluations and much else, e.g. as in ElDoradoof the Genomatix package (http://www.genomatix.de/products/ElDorado/).

If these sources fail, or if you feel you cannot trust all of your findings, youcan always resort to using your own similarity searches to place the gene in thegenome assembly, and then pick the region preceding the gene, but this is far fromstraightforward. For example, with bad-quality ESTs you run a risk of findinghomologs as the best hit instead of the actual gene. The same risk applies if your


actual gene is not be represented in a database (such as RefSeq). It may be advisableto drop data for which you cannot get unambigous gene mapping.

If you aim at finding your promoter sequences from the UCSC upstream datasets, you need the corresponding RefSeq mRNA accession codes, because thesecodes are used as identifiers for the promoter sequences. You can also retrieveEnsembl genes with RefSeq codes. Figure 14.1 shows the flow of data and somesuggested procedures for dealing with missing data. It would be a good idea tostore the codes you find a database or a spreadsheet to check the consistency ofyour results.

Figure 14.1: How to retrieve sequences upstream of the genes you study. The dottedblocks indicate consecutive operations which may need iterating before proceeding. 1:Finding usable codes for your genes (the easy part). 2: Dealing with elusive genes/probes(can be skipped as indicated). 3: Actual data retrieval. If you trust your data source,you can use just one of the sources, but a possible scheme for consistency checking issuggested as well.

The tools needed in this procedure include:

• a tool for comparison and interconversion of codes in various databases. Byfar the best one (in the author’s opinion) is IDconverter, which is availableas two independently developed versions at two institutes in Spain: IDcon-verter (CIPF) at http://www.gepas.org/ and IDconverter (CNIO) http://idconverter.bioinfo.cnio.es/. Currently the latter has more output


options and more recent databases available, and therefore gives better re-sults.

• similarity search tools, such as Blast or BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start). For large-scale searches you probablywant to install the program locally and use text processing tools such as Perlfor extracting relevant information from the result files. BLAT has the ad-vantage of giving directly the coordinates in the genome, and even on-lineBLAT accepts modest batches of sequences (25 at a time) if you just want topatch missing data

• Batch Entrez - mass retrieval of sequences from NCBI (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=Nucleotide)

If you want to be really certain that you know which genes you are dealingwith, you can follow several paths in the diagram to see if they all lead you to sameEntrezGene IDs or HUGO names. Remember that there may be several RefSeqmRNAs or Ensembl transcipts per gene, whereas EntrezGene IDs are unique.

Our recommended procedure for the above approach is this (numbers as inFigure 14.1):

1. Starting from whatever codes you have, preferably EntrezGene/LocusLinkcodes and HUGO/HGNC gene names, fetch at least once the default outputfrom IDconverter (CNIO). Repeat with several code sets to see if you canimprove the coverage.

2. If you miss more Ensembl gene codes than you would like, you can use Gen-Bank or UniGene codes for some semi-manual detective work. For GenBankaccession numbers, use batch Entrez to retrieve nucleotide sequences, thenuse Blat to find their locations in the genome and see if there is an anno-tated gene right there. In our experience, this approach has a high successrate. If you have a file of UniGene codes, you can use batch Entrez for Uni-Gene (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi?db=UniGene) toget UniGene links for all of them, and then try to resolve them one by one,for example by copying and pasting a set of representative mRNA codes,which you may apply to nucleotide Batch Entrez and Blat as above. UsingUniGene is more laborious, and in some cases impossible (e.g. for someretired codes, which have been split into a number of new clusters).

3. Use the Ensembl gene codes to retrieve the upstream sequences as describedin detail below.

If you work with the Affymetrix Arabidopsis chips, you may want to lookat VIZARD (http://www.anm.f2s.com/research/vizard/) which includes an-notation and upstream sequence databases for the majority of genes represented onthe Affymetrix Arabidopsis GeneChip R© array. The Broad Institute at MIT pro-vides upstream sequence data for Neurospora crassa and several other organisms,see e.g. http://www.broad.mit.edu/annotation/fgi/.


14.4 Using BioMart to retrieve promoter regions

The preferred option for retrieving upstream sequences is found at the BioMart ser-vice (http://www.ensembl.org/Multi/martview) of the Ensembl project. Thisis limited only by the number of Ensembl genes that are annotated and by the ac-curacy of the annotation. Many genes still have many multiple entries in Ensembl,and some entries are for pseudogenes. However, the service is very easy to use.Note that you will want to use primarily Ensembl data, not Vega (which is manu-ally annotated but still very incomplete data).

For some microarrays (especially Affymetrix chips), BioMart provides directmappings. Such mappings with microarray contents have become more commonin the genome sites, both at Ensembl and at UCSC, and the annotations providedby the manufacturers have improved, too. Therefore, retrieving the upstream se-quences is less of a technical problem now, but the problem of correct transcriptionstart sites remains a serious one, especially in the case of alternative promoters.

Please see the book web site at http://www.csc.fi/molbio/arraybook/for extra material and examples of using BioMart.

14.4.1 Final warnings regarding upstream data

• in some cases the genome assembly may be broken close to the start of yourgene, so that you do not get a full 1000 (or whatever) nucleotides of sequence.In Ensembl this shows as a long string of N’s in the sequence

• the start of a RefSeq mRNA may not always correspond to a true transcrip-tion start site. There are many mRNAs which give evidence of longer tran-script variants than the current RefSeqs. UCSC genome browser relies di-rectly to the RefSeq start sites in their upstream data sets.

• Ensembl tries to extend the 5’-end as far as possible, but in a few cases thisleads erratically long sequences, due to seemingly mis-spliced mRNA ver-sions

14.5 Input data II: the matrices for transcription factor bindingsites

In the “pattern matching” approach we need to know the TFBS sequences in ad-vance. TFBS are never exact sequences without exceptions, but instead, they arepatterns with more or less preferred bases at each nucleotide position in the site.Generally the preferences are represented as a weight matrix, derived from ob-served base frequencies in each position as shown in Figure 14.2.


Figure 14.2: Creation of weight matrices to describe a TFBS (and a scoring example ofa sequence against this matrix). Figure by Eija Korpelainen; includes matrix by WyethWasserman.

The largest collections of TFBS matrices come from commercial sources, withthe Transfac or Genomatix packages. Transfac database ver. 7.0 (with 398 matri-ces) is publicly available (http://www.gene-regulation.com/pub/databases.html#transfac), whereas the current commercial version, Transfac Professional9.4., contains 774 matrices . The Genomatix matrix library 6.0 contains 664 ma-trices for over 2,000 transcription factors. Genomatix offers evaluation accountswith a limited number of searches per month for free (http://www.genomatix.de/shop/evaluation.html).

However, the numbers can not be used for direct comparison of the extentor usability of these products, since both program suites and libraries have theirspecial strengths.

JASPAR (http://jaspar.cgb.ki.se/), is a completely free academic alter-native, a recent (2004) matrix library of 111 matrices, in which great care has beentaken in data selection to arrive at good-quality matrices. A combination of JAS-PAR and the public Transfac might be your best pick if you can not invest in thecommercial full programs.

Another approach to improve the quality of TFBS matrices is to allow corre-lations between base positions in the weight matrices (Zhou and Liu, 2004).

14.5.1 Searching known patterns

Once you have found the genes and their 5’ regions you can proceed to the anal-ysis of patterns in the regions. The simplistic approach is to search single knownsites and estimate the over- or under-representation of the findings. However, be-cause more than 90 % of such findings will have no biological significance, evenwhen you search for known TFBS patterns, you would need additional informa-tion, either from other TFBS in the vicinity (to form “modules” or “frameworks”),or from comparative genomics, studying promoters of orthologous genes from sev-eral species (“phylogenetic footprinting”).

For more details about comparative genomics in TFBS finding, see the ar-


ticle by Prakash and Tompa (2005). For practical implementations you can tryrVista (Loots and Ovcharenko 2004; http://rvista.dcode.org/) and Compare-Prospector (Liu et al. 2004; http://compareprospector.stanford.edu/).

If you want to run a check of known transcription factor binding sites, Tran-scription Regulatory Regions Database (TRRD, http://wwwmgs.bionet.nsc.ru/mgs/dbases/trrd4/trrdintro.html) is a site with lots of information, but onlyfor on-line searching. The complete database is not available. The TranscriptionFactor Database (TFD) and some tools that utilize the data are available at http://www.ifti.org/.

Transfac (http://www.gene-regulation.de/) offers PC software for pro-moter analysis, Transplorer. Their “Transfac Professional” package contains searchtools, too, of course, and many tools, e.g. Match, P-Match and Patch have publicversions available at the Transfac web site (for non-commercial use only).

TransFAT is a recent datamining tool designed to detect under or over-representation of putative TFBS by comparing two sets of genes. This is a promising-looking tool with built-in statistics, but we have not tested it properly yet. It is avail-able at http://babelomics.bioinfo.cnio.es/transfat/transfat.cgi andhttp://babelomics.bioinfo.cipf.es/transfat/cgi-bin/transfat.cgi.

Whatever method you choose to search for TF binding sites, you have to bearin mind that they are short patterns that will be found very frequently by chancealone, so you need to confirm the statistical significance of your findings.

You may improve your results and their significance by searching clusters ofseveral sites, i.e. module searching. Genomatix (see above) offers the programsFrameWorker and ModelInspector for creating and searching combinations of sev-eral TFBS in one promoter. The ModelInspector can also be used for searchingwith the Genomatix library of experimentally verified modules. The makers ofTransFac have made Composite Module Analyst for this purpose (http://www.gene-regulation.com/pub/programs.html#CMAnalyst) and it is freely avail-able for non-commercial use.

14.5.2 Pattern search without prior knowledge of what you search

Several programs and different approaches are available for automated pattern dis-covery in your promoter region sequences. We will mention here a few examplesof different programs and strategies, all of which are capable of finding patterns insequences without their previous alignment. More extensive reviews and compar-isons can be found in Kel et al., 2004, and Tompa et al., 2005.

Expression Profiler is a software suite for many tasks in microarray data analy-sis (http://ep.ebi.ac.uk/EP/). It contains the program SPEXS, Sequence Pat-tern EXhaustive Search, for pattern discovery, which finds regular-expression-likediscrete patterns. As an example, this program was used by Vilo et al. (2000) tofind regulatory patterns from publicly available yeast expression data. To overcomethe fuzziness created by incertainties in clustering, they generated a large number ofdifferent clusters (52,000, with lots of overlapping) and analyzed enriched patternsfrom all of them. After filtering and grouping the patterns, they had 62 groups, 48of which contained patterns matching some sites in the Saccharomyces cerevisiae


Promoter Database.MEME (http://meme.sdsc.edu/meme/website/intro.html) is based on

position-dependent letter-probability matrices which describe the probability ofeach possible letter at each position in the pattern. The WWW interface analyzesyour set of sequences which you believe to contain common patterns, and gives youthe strongest patterns that MEME finds. In addition, you get automatically resultsfrom a companion program MAST which locates the occurences of the patterns inyour set of sequences.

AlignACE (http://arep.med.harvard.edu/mrnadata/mrnasoft.html)provides a Gibbs sampling algorithm for finding patterns in DNA sequences. Theprogram is optimized for finding multiple motifs and it automatically considersboth strands in the sequence. The authors used the program to identify successfullyseveral known transcription factor binding sites and to propose some previously un-known regulatory mechanisms based on the earliest public yeast gene expressiondata sets (Roth et al. 1998).

All of the above methods depend on a clustering that you have made beforepattern discovery, and it is presumed that the clustering is correct. Kimono (http://www.fruitfly.org/∼ihh/kimono/) takes a unified approach that optimizesclustering and pattern discovery together. The approaches described above first findgenes with similar expression patterns, and then see if they have similar promoters.Kimono finds clusters of genes that have similar expression patterns and similarpromoters as described by Holmes and Bruno (2000).

Credo (Cis-Regulatory Element Detection Online, http://mips.gsf.de/

proj/regulomips/) is a recent service which combines the algorithms of Alig-nACE, DIALIGN, FootPrinter, MEME and MotifSampler.

However, none of the pattern recognition programs are directly suitable forlarge-scale comparisons of promoters from microarray data, but need some script-ing skills or manual work for parsing and combining the results.

14.6 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ for ex-tra material on promoter analysis using computer software.


1. Holmes, I., and Bruno, W. J. (2000) Finding regulatory elements using jointlikelihoods for sequence and expression profile data. Proc. Int. Conf. Intell.Syst. Mol. Biol., 8, 202-10.

2. Kel, A., Tikunov, Y., Voss N., and Wingender E. (2004) Recognition of mul-tiple patterns in unaligned sets of sequences: comparison of kernel clusteringmethod with other methods. Bioinformatics, 20, 1512-1516.

3. Liu, Y., Liu, X. S., Liping, W., Russ, B. A., and Batzoglou, S. (2004) Eu-karyotic Regulatory Element Conservation Analysis and Identification UsingComparative Genomics. Genome Res., 14, 451-458.


4. Loots, G. G. and Ovcharenko, I. (2004) rVISTA 2.0: evolutionary analysisof transcription factor binding sites. Nucl. Acids Res., 32, W217-W221.

5. Prakash, A., and Tompa, M. (2005) Discovery of regulatory elements in ver-tebrates through comparative genomics. Nat. Biotechnol., 23, 1249-56.

6. Roth, F. P., Hughes, J. D., Estep, P. W., and Church, G. M. (1998) FindingDNA regulatory motifs within unaligned noncoding sequences clustered bywhole-genome mRNA quantitation. Nat. Biotechnol., 16, 939-45.

7. Tompa, M., Li, N., et al. (2005) Assessing computational tools for the dis-covery of transcription factor binding sites. Nat. Biotechnol., 23, 137-44.

8. Vilo, J., Brazma, A., Jonassen, I., Robinson, A., and Ukkonen, E. (2000)Mining for Putative Regulatory Elements in the Yeast Genome Using GeneExpression Data. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 384-94.

9. Zhou, Q. and Liu, J. S. (2004) Modeling within-motif dependence for tran-scription factor binding site predictions. Bioinformatics, 20, 909-916.


15 Biological sequenceannotations

Juha Saharinen

15.1 Annotations in gene expression studies

Sequence annotations are, in general, information attached to the regions in thesequences. The majority of the several hundreds of bioinformatics databases cur-rently publicly available either provide annotations to sequences or refer to certainsequences (see [4]). Basically, annotations can be either experimentally proven orcomputational predictions. Examples of annotations include localization of a genein a genomic sequence, a SNP marker in a gene, a verified translation start sitein mRNA, a predicted transmembrane domain in a protein sequence, an experi-mentally validated transcription factor binding site in a promoter, localization of aprotein in the cytoplasm, involvement of a protein in MAP-kinase signaling cas-cade, association of a mutation in a gene with a disease etc. After the interestingsets of genes have been discovered by various statistical analysis from expressionmicroarray data, searching and applying the annotations to the genes (transcripts)are used to provide the biological meaning for the observed results.

15.2 Using annotations with gene expression microarrays

With expression microarrays, the use of sequence annotations typically involvesthe retrieval of more information concerning the genes that have been identified asdifferentially expressed in the studied system. Often the first step is to locate anidentifier for the genes. Depending on the used chip platform and the informationprovided, a suitable identifier to be used in e.g. genome browsers can be provided.A number of different sequence / gene / transcript identifier systems are currentlyused. The most important are the GenBank, EMBL, DDBJ, RefSeq, UniGene,LocusLink, and Ensembl identifiers [2, 5, 8, 10, 19, 22].

UniGene, which is quite commonly used identifier in microarrays, is basedon a computational system for automatic clustering of the GenBank data into genecentric clusters, where one cluster is aiming to represent a single gene. Unigene isregularly updated from the source data. In the update process, UniGene clusters can

15 Biological sequence annotations 155

be split apart or joined together, which often makes the gene identification difficult,since the used cluster ID may have become obsolete. Better than UniGene in thisrespect, is the RefSeq system, which provides non-redundant, curated and stableidentifiers for mRNAs. MatchMiner is a tool that can be used to translate betweendifferent identifiers for the same gene, it is available as a web-page as well as acommand line tool [3].

Most of the annotation servers are well cross-referenced, i.e. they providelinks to the corresponding genes also in other annotations servers. Good start-ing points for getting more biological information about the selected genes are themajor publicly accessible genome browsers: Ensembl, NCBI Map Viewer, UCSCGenome Browser, and the GeneCards server [8, 11, 20, 22]. All these servers sharethe same actual genome builds, thus the physical coordinates are interchangeable,provided that the same genome build revisions are used. More information andpractical examples of these genome browsers are available in the User’s guide tohuman genome resources [2]. Also see Chapter 14 (Data mining for promoter se-quences), which also describes some aspects of using the Ensembl and EnsMartserver [12]. Using these servers, it is easy to obtain corresponding protein levelinformation (protein domain data provided by pfam, SMART, often accessed byInterPro[1, 14, 16]), expression information in different tissues, typically based onEST or SAGE (serial analysis of gene expression) data, orthologous and paralogousgenes, SNPs in the gene locus (typically from dbSNP [22]) and possible identifiedgenetic disorders (OMIM database [6]) for the gene in question.

If no suitable gene identifier is available, the referenced gene on the array canbe identified with sequence based homology searches. These are provided by Blastor SSAHA in Ensembl, BLAT in UCSC Genome Browser, and Blast or MegaBlastin NCBI [13, 17, 25].

15.3 Using Ensembl server to batch process of annotations

When working with large sets of genes, a programmable interface is preferred to“high-throughput copy-pasting” of www-pages. The majority of the www-basedannotation servers are constructed in an identical way as a precomputed relationaldatabase, which is accessible though the www-interface. However, Ensembl alsoprovides additional ways to interact with the data. The OpenSource MySQL rela-tional database server, serving/providing the Ensembl data, can be connected di-rectly and data queried via SQL commands. For best performance, the Ensembldata can be downloaded and installed to a local MySQL server: The EnsemblMySQL server is available at address ensembledb.ensembl.org. For Finnish users,Biomedicum Bioinformatics Unit (http://www.bioinfo.helsinki.fi) is mir-roring Ensembl databases for variety of organisms, giving substantially better per-formance than the popular ensembledb.ensembl.org server. Also the BioPerl withEnsembl extensions are installed and available for users.

In addition, a comprehensive Perl API (application program interface) to ac-cess the Ensembl data is available. This requires the installation of BioPERL li-braries as well as the Ensembl additions [21]. Documentations as well as examplesof these are available at http://www.ensembl.org/Docs/. Here are two small


examples using these alternative interfaces to the Ensembl data. In example 1 [Seeonline material at http://www.csc.fi/molbio/arraybook/], 10 probe sets fromAffymetrix U133A chip are queried from Ensembl Homo sapiens database andtheir Ensembl GeneStableIDs, and gene’s sequence locations are returned. Exam-ple 2 [See online material at http://www.csc.fi/molbio/arraybook/], shows asmall Perl script, using EnsemblAPI, to retrieve the genomic sequence from a givenorganism, chromosome and physical position.

15.4 Biological pathways and GeneOntologies

In order to get more insight into the observed gene expression changes, it is mean-ingful to relate the microarray results with the possible biological context, in theform of, for example, known metabolic or signal transduction pathways. The KyotoEncyclopedia of Genes and Genomes (KEGG) PATHWAY database has been per-haps the most known resource for mapping genes to known pathways [9]. KEGGpathway and other KEGG databases are available in addition to www-browser alsoas local installations and as a WebService, supporting high-throughput analysisfrom most programming languages using SOAP protocol. Using KEGG over thewww-browser typically returns a graph of a pathway, where the gene of interest islocated as well as possible links to other pathways. Another place for visualizingpathways is BioCarta (http://www.biocarta.com/genes/index.asp).

The GeneOntology (GO) is currently the most comprehensive collaborativeeffort to assign consistent ontologies to gene products in a species independentmanner. In essence, ontologies are controlled vocabularies, providing standardizednames or terms for different biological events or properties. GO ontologies belongto three separate domains: cellular components, biological processes and molecu-lar functions, which are considered independent of each other. Examples of theseinclude biological process: transforming growth factor beta receptor activity; cellu-lar component: extracellular matrix; molecular function: calcium ion binding. Thedata in GO is in a hierarchical format; specifically in a so called diacyclic graph(DAG), which enables a given term to have multiple parents and childrens. Anexample is protein kinase activity, which is a child of kinase activity and parent oftyrosine kinase and serine/threonine kinase. Gene products in GO can be annotatedto any number of GO terms. The evidence for making an association of a gene to aterm can be either experimental or a computational prediction, which informationis also available.

15.5 Using GeneOntologies for gene expression data analysis

Whilst GO is a very useful tool in deciphering the biological processes, that arelinked to the genes identified by microarray as described in other chapters, the or-ganized data in GO can also be used in an analytical manner. Since a large numberof genes have been mapped to GO, one can analyze whether certain gene sets,say, significantly differentially expressed gene clusters, are enriched in some GOgroup. Multiple different software packages exist for this analysis [7, 15, 18, 23,24] (see also http://www.geneontology.org/GO.tools.html). Typically these


tools test the enrichment or depletion of a group of genes in the all possible GOgroups. The input is a list of genes, ordered by e.g. p-value and it is usually testedby using Fisher’s exact test, i.e. asking the question how likely it is to observe xmembers out of y genes in given annotated pathway, within the z first genes in thelist out of w total genes in the array. Differences exist e.g. whether the software candeal with multiple testing problem, the way the results are linked to other databasesand what are the platform requirements (www-page, Windows, Java).

15.6 Concluding remarks

Adding biological annotations to the genes identified with various statistical meansfrom the gene expression data provides the link to the underlying biological phe-nomena and can provide further information about the biological meaning of theexpression changes than often daunting long lists of genes. Especially in the path-way or gene grouping analysis of gene expression data, the GO annotation is veryimportant. In addition, biological annotations often provide the fastest way to getan idea of the many hypothetical genes identified with microarrays, for which noother data is available.

15.7 References

1. Bateman, A., et al. (2004) The Pfam protein families database. NucleicAcids Res, 32, Database issue, D138-41. http://www.sanger.ac.uk/

Software/Pfam/

2. Benson, D.A., et al. (2005) GenBank. Nucleic Acids Res, 33, DatabaseIssue, D34-8. http://www.ncbi.nlm.nih.gov/Genbank/index.html

3. Bussey, K.J., et al.(2003) MatchMiner: a tool for batch navigation amonggene and gene product identifiers. Genome Biol., 4, R27. http://discover.nci.nih.gov/matchminer

4. Database issue (2005), Nucleic Acids Research, 33.

5. DDBJ, http://www.ddbj.nig.ac.jp/.

6. Hamosh, A., et al. (2005) Online Mendelian Inheritance in Man (OMIM), aknowledgebase of human genes and genetic disorders. Nucleic Acids Res.,33, Database Issue., D514-7. http://www.ncbi.nlm.nih.gov/entrez/

query.fcgi?db=OMIM

7. Hosack, D.A., et al. (2003) Identifying biological themes within lists ofgenes with EASE. Genome Biol, 4, R70.

8. Hubbard, T., et al. (2005) Ensembl 2005. Nucleic Acids Res, 33, DatabaseIssue, D447-53. http://www.ensembl.org/

9. Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome.Nucleic Acids Res., 32. Database issue, D277-80. http://www.genome.

jp/kegg/


10. Kanz, C., et al. (2005) The EMBL Nucleotide Sequence Database. NucleicAcids Res., 33, Database Issue, D29-33. http://www.ebi.ac.uk/embl/

11. Karolchik, D., et al. (2003) The UCSC Genome Browser Database. NucleicAcids Res., 31, 51-4. http://genome.ucsc.edu/

12. Kasprzyk, A., et al. (2004) EnsMart: a generic system for fast and flexibleaccess to biological data. Genome Res., 14, 160-9.

13. Kent, W.J. (2002) BLAT–the BLAST-like alignment tool. Genome Res., 12,656-64.

14. Letunic, I., et al. (2004) SMART 4.0: towards genomic data integration.Nucleic Acids Res., 32, Database issue, D142-4. http://smart.embl-

heidelberg.de/

15. Martin, D., et al. (2004) GOToolBox: functional analysis of gene datasetsbased on Gene Ontology. Genome Biol., 5, R101.

16. Mulder, N.J. et al. (2005) InterPro, progress and status in 2005. NucleicAcids Res., 33, Database Issue, D201-5. http://www.ebi.ac.uk/

interpro/

17. Ning, Z., A.J. Cox, and Mullikin, J.C. (2001) SSAHA: a fast search methodfor large DNA databases. Genome Res., 2001, 11, 1725-9.

18. Osier, M.V., Zhao, H., and Cheung, K.H. (2004) Handling multiple test-ing while interpreting microarrays with the Gene Ontology Database. BMCBioinformatics, 5, 124.

19. Pruitt, K.D., Tatusova, T., and Maglott, D. R. (2005) NCBI Reference Se-quence (RefSeq): a curated non-redundant sequence database of genomes,transcripts and proteins. Nucleic Acids Res, 33, Database Issue, D501-4.http://www.ncbi.nlm.nih.gov/RefSeq/

20. Safran, M., et al. (2002) GeneCards 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics, 18, 1542-3. http://

bioinformatics.weizmann.ac.il/cards/

21. Stajich, J.E., et al. (2002) The Bioperl toolkit: Perl modules for the lifesciences. Genome Res., 12, 1611-8. http://bioperl.org/

22. Wheeler, D.L., et al. (2005) Database resources of the National Center forBiotechnology Information. Nucleic Acids Res., 33, Database Issue, D39-45.

23. Zeeberg, B.R., et al. (2003) GoMiner: a resource for biological interpretationof genomic and proteomic data. Genome Biol., 4, R28.

24. Zhang, B., et al. (2004) GOTree Machine (GOTM): a web-based platformfor interpreting sets of interesting genes using Gene Ontology hierarchies.BMC Bioinformatics, 5, 16.


25. Zhang, Z., et al., (2000) A greedy algorithm for aligning DNA sequences. J.Comput. Biol., 7, 203-14.


16 Software issues

Jarno Tuimala and Teemu Toivanen

Traditionally, biological data, including DNA and amino acid sequences andDNA microarray results, have been stored in a flat file format. Flat files are ba-sically just text files, which have been formatted in such a way that programs canread their contents. Flat files are not the ideal solution for long term storage of largedata sets. They also cause problems when the same data needs to be analyzed usingseveral programs that use a different kind of file format.

The data standardization and storage in central databases (see Chapter 7 formore details) is one solution to the data format issues. However, this solution hasnot yet fully matured, and most of the programs used for the DNA microarraydata analyses are not able to read the data directly from these databases or evenimport the XML-format the databases commonly produce. Databases and programswill surely learn to cooperate better over time. The standardization and storage ofthe data is currently under heavy development, and many software developers areimplementing their own solution to these problems. However, the solutions are notalways MIAME compliant.

16.1 Data format conversions problems

Especially, if freeware tools are used for the analysis, there is often a need to exportthe partly analyzed data from one program to another. More often than not, thedatafile formats are not compatible, which means additional and tedious conver-sion work. Small amounts of data can quite easily be converted between formatsusing Microsoft Excel or any other spreadsheet program. However, the work loadbecomes quickly unbearable when the amount of data increases. Excel has somemacro programming capacities, but there exists also other, and possibly better tools,if one is interested in programming (see section on programming). In addition,Excel does not tolerate huge data files (more than 65 536 rows or 256 columns).Sometimes it becomes necessary to transpose (switch the rows and columns) thedata file, and if there are more than 256 genes, Excel can’t be used. There hasbeen some problems with Excel between its language versions (date fields change,comma and dot are switched in decimal numbers and such). If you run into theseproblems it’s advisable to use some programming language instead of a spreadsheetprogram.

16 Software issues 161

16.2 A standard file format

If the data needs to be imported into several programs at the same time, it is ofteneasier if a “standard file” is first generated from the files written by the chip scanner.Standard files can be produced in Excel or another spreadsheet program easily.Briefly, the first column of the standard file should contain the gene identifiers, forexample their common names. The second column should consist of the sequenceaccession numbers, Genbank accession numbers are probably the most usable ones.The next four columns should contain the spot and background intensities of theused colors. Depending on your analysis needs, it is also possible to use ratios, forexample normalized log ratios, instead of intensity values. The standard file shouldbe saved in a tab-delimited text format (usually “export” or “save as” menu optionand from there “Text (tab delimited)” or similar). It’s sometimes possible to viewdata in some text editor, just to have a peek how the actual data is stored.

16.3 Programming

When the capacities of the spreadsheet program are not sufficient or many similarfiles need processing, some programming tools can be used instead for the datafilemanagement purposes or for data analyses. There are actually several programminglanguages that are especially suited for the data file formatting and other text filemanipulations.

16.3.1 Perl

One of the most common languages is Perl, which has very powerful and easy touse text manipulation tools. Perl is available for free for UNIX, Linux http://

www.perl.org and PC machines http://www.activeperl.com. Perl is a pro-gramming language, which means that getting to know it takes some time and ef-fort. However, if data conversion tools are needed everyday, it would definitely beworthwhile to befriend Perl.

16.3.2 R

R is a free statistical analysis tool and a self-sufficient programming language. It isavailable for UNIX, Linux, Macintosh and PC platforms. In R, scripts for analysesand data file manipulations can be easily constructed. R has many add-on pack-ages for cluster analysis, self-organizing maps, and neural networks, among others.There are also many packages available that have been specifically tailored forDNA microarray data analysis: Bioconductor project develops and updates manyof these pacakges.

16.4 Examples

Please see the book web site at http://www.csc.fi/molbio/arraybook/ forcode examples and computer software presentations.


Index

AAffymetrix

change call, 29change p-value, 29comparison analysis, 28detection call, 25, 26detection p-value, 26discrimination score, 26GCOS, 25Gene Array scanner, 25GeneChip, 25image analysis, 41normalization, 28

RMA, 109robust, 28scaling, 28

signal, 26signal log ratio, 29single array analysis, 25Tau, 26

annotations, 154biological pathways, 156GeneOntology, 156retrieving, 155using, 154

Bbackground subtraction, 83Bayes rule, 116

Ccase-control studies, 88cDNA microarrays, 25centralization, 102, 103Chen’s method, 117clustering

function prediction, 133methods, 124

hierarchical, 124k-means, 126, 131K-nearest neighbor, 127principal component analysis, 129self organizing map, 125

principles, 123supervised, 124unsupervised, 124visualization, 131

coefficient of variation, 60constant, 58correlation, 61

Pearson’s, 61Spearman’s, 61

Ddegrees of freedom, 65distribution, 59

normal, 64, 94skewed, 66, 94

left, 66right, 66

standard normal, 64t, 65

dose response, 47dye-swap, 108dynamic range, 98

EEM-algorithm, 116error

bias, 59random, 59, 69systematic, 59

error models, 80expectation-maximization, 116experimental design, 47

available technology platforms, 49

Index 163

balanced, 78common reference, 78controls, 47

cell lines, 48human, 48mice, 48

gene clustering and classification,49

loop, 78replicates, 48

expression change, 84fold change, 87intensity ratio, 86log ratio, 86

Ffiltering, 91

absolute expression change, 115bad data, 91flagging, 91outliers, 91uninteresting data, 92

flagging, 91

GGene regulatory networks, 135

Bayesian network, 137parameters, 139structure search, 140

fundamentals, 135perturbation data, 136time series data, 136

genotyping, 32

Hhistogram, 66, 67housekeeping genes, 25, 104

Iimage analysis

Affymetrix, 41gridding, 37information extraction, 40quality control, 40segmentation, 38

imputation, 71

Llinear regression, 62

linearity, 95log ratio, 68log_2-transformation, 86log-transformation, 68, 102

MMA plot, 101Mann-Whitney U test, 72mean, 93, 98

arithmetic, 59trimmed, 59

median, 60, 93MIAME, 51microarray

dyesCy3, 18Cy5, 18

hybridization, 20obtaining, 17printing, 16RNA sample preparation, 18scanning, 21typical applications, 21

Minimum information about a microar-ray experiment, 51

missing values, 70, 82casewise deletion, 83mean substitution, 83pairwise deletion, 83

multiple chip methods, 119one-sample t-test, 121standard deviation, 119two-sample t-test, 120

NNewton’s method, 118noise envelope, 116normality, 94normalization, 98, 99, 102

analysis of variance, 108dye-swap, 108global, 103housekeeping genes, 104linearity of data, 105local, 103lowess, 105, 107mean centering, 106median centering, 105, 106per-chip, 103


per-gene, 103ratio statistics, 108RMA, 109spiked controls, 104, 108standardization, 106trimmed mean centering, 106

number of subjects, 59

Ooligonucleotide pairs, 25

mismatch, 25perfect match, 25

outliers, 69, 90quantification error, 90spot saturation, 90statistical modeling, 90

Pp-value, 72pheasant tail, 83plot

normal probability, 67power analysis, 88preprocessing, 82promoter data mining, 143

pattern search, 150, 151retrieving sequences, 145

Biomart, 149TFBS matrices, 149

Rrange, 60replicates

averaging of, 88biological, 88case-control studies, 88chips

checking quality, 89excluding bad replicates, 90handling, 87power analysis, 88software, 88spots

checking quality, 90technical, 88time series, 88

Ssample size, 59

Sapir and Churchill’s method, 116scatter plot, 61

M versus A, 101signal-to-background, 91signal-to-noise, 91single chip methods, 116

Chen’s method, 117Newton’s method, 118noise envelope, 116Sapir and Churchill’s method, 116

single nucleotide polymorphism, 32SNP, 32

genotype calls, 33methods

APEX, 32single base extension, 32

softwarePerl, 161R, 161

spatial effects, 96spiked controls, 104standard deviation, 60, 93standardization, 65, 102, 103statistical testing, 71

ANOVA, 72, 75balanced design, 78completely randomized, 76mixed model, 77

Bonferroni correction, 74choosing the test, 71critical values table, 74empirical p-value, 75False discovery rate, 74hypothesis pair, 72Kruskal-Wallis test, 72one-sample t-test, 72p-value, 72permutation test, 75t-test, 72test statistic, 72two-sample t-test, 73

systematic bias, 99array design, 101batch effect, 101dye effect, 99experimenter issues, 101plate effects, 100printing tip, 100

Index 165

reporter effects, 100scanner malfunction, 100uneven hybridization, 100

systematic variation, 99

Ttime series, 47, 88

UUPGMA, 124

Vvariable, 58

dependent, 58independent, 58qualitative, 58quantitative, 58

variables, 58variance, 60, 94, 98Volcano plot, 121

ZZ-score, 103

a - immunogenomica - dna micro array data analysis 2nd ed

Documents