comparative evaluation of microarray analysis software

MOLECULAR BIOTECHNOLOGY Volume 26, 2004

Evaluation of Microarray Analysis Software 225REVIEW

225

Molecular Biotechnology 2004 Humana Press Inc. All rights of any nature whatsoever reserved. 1073–6085/2004/26:3/225–232/$25.00

Abstract

*Author to whom all correspondence and reprint requests should be addressed: 1WSU-MCBI Bioinformatics Node, Center for MolecularMedicine and Genetics, 2Institute for Scientific Computing, and 3Department of Obstetrics and Gynecology, Wayne State University Schoolof Medicine, Detroit, MI 48201. E-mail: [email protected]

Comparative Evaluation of Microarray Analysis Software

Daniel K. Liu,1 Bin Yao,1 Brian Fayz,1

David D. Womble,1,2 and Stephen A. Krawetz*,1,2,3

A wide variety of software tools are available to analyze microarray data. To identify the optimum soft-ware for any project, it is essential to define specific and essential criteria on which to evaluate the advan-tages of the key features. In this review we describe the results of our comparison of several software tools.We then conclude with a discussion of the subset of tools that are most commonly used and describe thefeatures that would constitute the “ideal microarray analysis software suite.”

Index Entries: Microarray; software; analysis; comparison; shareware.

1. IntroductionMicroarrays have already become an indis-

pensable tool for the investigation of gene-expres-sion profiles and gene polymorphisms. A singlemicroarray affords the ability to simultaneouslymonitor the behavior of several thousand biologi-cal elements from a transcriptome or proteome.Biology is being revolutionized with the applica-tion of microarray technology to many areas.These include assessing the safety of drugs andvaccines, high-speed testing of clinical samplesand population-based screening to identify diseasecarriers (1). Other applications being consideredinclude monitoring the progression of diseases,identifying the influence of environmental factors,and assessing the efficacy of drug therapies (2,3).

Both cDNA and oligonucleotide arrays are com-monly employed. The complementary elementson the surface of the array are then identified byhybridization to a fluorescently or radioactivelylabeled probe (4). After hybridization, the array isilluminated to obtain an image. This image isquantified into expression (or signal) values.Once the expression levels are recorded, the sci-entist is left with the task of trying to understandthe underlying biology of typically tens of thou-

sands of data points (5–7) It is critical to select theappropriate software that will enable one to answerthe biological question posed. However, there ex-ists a plethora of algorithms and software pack-ages that can be used to analyze the data, yet thereis little guidance regarding which is best suited toeach task (8–10). In addition, we are constantlyinundated with new analysis tools. This onlyserves to complicate the selection process. Thereis no standardized means of biological analysis.

To benefit from the tools, it is important toevaluate each software suite with clearly definedscientific and practical criteria. A combination ofsoftware scripted together or even a new softwaretool specifically written for the study may proveto be the ideal solution. Our software evaluationbegan with defining the criteria for successful dataanalysis. The criteria consisted of a list of featuresthat were deemed most critical. This was followedby repeated visits to company Web sites, atten-dance at on-line and face-to-face product demon-strations, and concluded with a local communitysurvey of “useful” tools. The search was limitedto commercial companies that offered productsand promotions that together were affordable toan individual investigator. At the time the project

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM225


226 Liu et al.

was initiated, this included the following softwaresuites: GeneSight (BioDiscovery), GeneSpring(Silicon Genetics), Expressionist (Gene Data),and Vector Expression (Informax). “Expensivesoftware” solutions were excluded. Performancebenchmarking was provided by comparison to aseries of free software tools from EisenLab,Bioconductor, and TIGR. Table 1 lists commer-cial and free microarray analysis software solu-tions that were considered. Our final candidateswere selected from the preliminary price–perfor-mance screening and then subjected to a thoroughevaluation using the clearly defined criteria out-lined below.

2. CriteriaIt is essential to develop the evaluatory criteria

in full consideration of the scientific hypothesisthat is to be tested. The common criteria we usedto evaluate microarray software are summarizedin Table 2. Unfortunately, software suites offermany tools and options that are not required orbest suited to addressing the hypothesis.

As a point of initiation it was helpful to identifythe most commonly used techniques for micro-array analysis. From a survey of 15 major jour-nals, we identified a series of most referencedmethods. The results are summarized in Table 3.We limited our search to articles published be-tween January 2002 and March 2003. The mostcommon data analysis methods are cluster- andfold-change analysis. Figure 1 shows the numberof references found to each of the methods identi-fied. A scoring system used to compare the at-tributes of the software was then defined usingthese and other criteria summarized in Table 2.

3. Results and DiscussionUsing the criteria described in Table 2, the soft-

ware was rigorously evaluated to identify whatfeatures excelled or were of limited value. Theresults of our tests that compared the four soft-ware suites are summarized in Table 4. All havefunctions for data preparation, normalization,clustering, and statistical analysis. They all havelinks to the Web that enable one to downloadedannotations. Only a limited number of data sourcesare included with GeneSight and Vector Expres-

Table 1Software

Commercial Microarray Software

Affymetrix http://www.affymetrix.comGene Data http://www.genedata.comGeneSight http://www.biodiscovery.comGeneSpring http://www.sigenetics.comInformax http://www.informaxinc.comOmniviz http://www.omniviz.comRosetta Resolver http://www.rosettabio.comSAS http://www.sas.comSpotfire http://www.spotfire.com

Free Microarray Software

BASE http://base.thep.lu.seBioconductor http://www.bioconductor.orgD-Chip http://www.biostat.harvard.edu/

complab/dchipDKFZ http://www.dkfz-heidelberg.de/

tbi/services/mchipsEisenLab http://rana.lbl.govTIGR http://www.tigr.org

Statistics Tools

MatLab http://www.mathworks.comPartek http://www.partek.comR http://www.r-project.orgSAS http://www.sas.comS-Plus http://www.insightful.com

sion. Expressionist was the only software suite thateffectively managed slide and data quality control.All of the software had the ability to link to a data-base system for archiving and data mining.

A list of Advantages and Limitations for eachsoftware package is summarized in Table 5. Thisexercise revealed several common difficulties. Forexample, it can be difficult to exchange data be-tween software programs. Although they all ac-cept similar plain text input, the output from eachprogram cannot be directly compared nor used forcrosscomputation without some extra work to for-mat the data into a particular format. In addition,the stability and reproducibility of results ap-peared erratic when large data sets were used asmost software tended to hang or crash when per-forming calculations on large data sets. For ex-ample, clustering algorithms seemed to be limitedto fewer than 20 Affymetrix microarrays.

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM226


Evaluation of Microarray Analysis Software 227Table 2Definition of Criteria

Normalization

Normalization is a procedure to eliminate effects and systematic errors in a data set. Common system-atic errors with microarray experiments come from a wide variety of sources, such as bias in the dyes,location, intensity, and slide. A normalization method should be able to eliminate these errors by trans-forming the data to the same scale so that the data is directly comparable.

Data Preprocessing

Preprocessing requires custom data transformations to correct background. Often implementations ofratios and logs are used. Some software offers highly customizable preprocessing pipelines, whereasothers only allow for certain prepackaged algorithms. Most load or import the raw data and apply thepreprocessing functions until the desired effect is achieved.

K-Means Cluster

A cluster can be a group of genes with similar expression or groups of individuals with similar expres-sion profiles within a population. This is the fastest clustering algorithm. It is used to cluster genes orexperiments into K groups of similar patterns. Some software has a function to help select and set appro-priate K values. There are several common distance metrics. The results can vary depending on whichsettings are used for distance and the initialization of cluster centers (4).

Self-Organizing Map (SOM)

SOM uses a neural-network technique to discover patterns or classes of patterns from input data. Thiscan be visualized as an array of nodes being trained by input data. The SOM is an effective software toolfor the visualization of high-dimensional data. It converts complex statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display.

Hierarchical Cluster

The relation and distance of genes or experiments can be displayed in a hierarchical tree structure. Thereare two approaches to hierarchical clustering. The first, bottom-up strategy, starts with n data points,then merges the points into clusters constructing a tree. The second, top-down strategy, starts with thewhole set, then continually divides this set by constructing a tree (4).

Principal Component Analysis (PCA)

The goal of PCA is to reduce the dimensionality of the data. The most common implementations filterout dimensions that have the least variance. Considerations include the stability of the algorithm forlarge data sets and the graphical representation of the results.

Statistical Analysis

Statistical analysis requires a testable hypothesis. It is a statistical method used to identify gene-expres-sion changes that are statistically significant under different treatments or conditions. This includes“traditional” two or more group comparisons such as the t-test and the analysis of variance (ANOVA).Preparatory theory is still not well developed for microarray analysis. Another statistical feature that iscommonly used for hypothesis testing is multiple comparison adjustment. This modifies the p value forthe number of tests, which can be in the thousands.

Slide and Data Quality Control

Artifacts—outliers and gradients that reduce the quality of the raw data from the microarray—are some-times present. These problems can be masked or corrective filters applied. These tools allow the user tozoom in on questionable regions and mask out areas that appear damaged.

(continued on next page)

227

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM227


228 Liu et al.

Table 2 (continued)Definition of Criteria

Downloaded Annotation Source

Most researchers employ several different literature search strategies to appropriately annotate their dataonce key genes are identified. This tedious work can be simplified by incorporating appropriate toolsinto the microarray analysis suite. By locally storing annotation data, researchers can combine microarraydata with biological data. This markedly increases the effectiveness of data mining. Connections toEntrez-Nucleotide, UniGene, LocusLink, Gene Ontology, and the KEGG pathway are all examples.

Link to Web

This enables the software to connect with on-line software tools without directly downloading data.Some common examples include connections to Entrez-Nucleotide, GeneCard, UniGene, LocusLink,DDBJ, TIGR, FlyBase, and SwissProt.

Data Export

The software should be able to easily export graphs, gene lists, and gene values into common binary andtext formats. This is important when writing custom scripts and algorithms and to exchange data setsbetween different software packages.

Database

The ability to archive, share, and mine data requires the use of a database system. A set of specificationsthat should be implemented are the MIAME standards. This promotes data sharing among the variousmembers of the community.

Customizable Code

It is common for free, open-source projects to provide the entire source code. With some time and effort,these projects can be customized to meet specific needs. Many commercial packages support plug-inmodules via a well-defined API.

Cost

Prices range from free to millions of dollars. In most cases, it is possible to obtain a free trial for a limitedamount of time. Be aware that companies offer subscription-based pricing (per year) and perpetual pric-ing (per update) options.

Technical Support Community

It is wise to consider the level of support that the company is willing to provide. Telephone and e-mailsupport are standard, and some companies include on-line training, workshops, and code customization.In addition, some on-line communities have formed chat rooms for beginners, and advanced users dis-cuss their techniques and troubleshoot problems. Directing questions to these groups or reading their on-line archives is a great way to learn about specific software packages.

4. The Ideal ToolThe Wayne State University Bioinformatics

program serves as a melting pot bringing togetherbiologists, geneticists, computer scientists, andstatisticians to work on common biological prob-lems. With such varied frames of reference, it is

difficult to identify a microarray analysis suite thatwill fulfill each discipline’s needs. Because thedata analysis methods for microarrays are stillevolving, there is certainly no single means to ad-equately conduct an analysis. Similarly, at thiscurrent stage of the field, no individual software

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM228


Evaluation of Microarray Analysis Software 229

Fig. 1. Number of references found. (Published from January, 2002 to March, 2003)

Table 3Number of Methods Found per Journala

No.Journal of Methods

EMBO J. 6Molecular Cell 9American Journal of Human Genetics 2Nature Genetics 7J. Cell Biology 3Genes & Development 8Nature Medicine 7FASEB J. 8Cell 9Genome Research 15PNAS 21J. National Cancer Institute 2Immunity 3Human Molecular Genetics 5Science 11

Total from 15 Journals 116aPublished from January, 2002 to March, 2003.

package can include optimal implementations ofall the algorithms and data formats of interest.Nevertheless there is common ground sharedamong the disciplines. Every group expressed a

preference toward open-source or applicationprotocol interface (API)-customizable software.There was a strong dislike toward using undocu-mented algorithms or programs with minimalcustomization capabilities, as additional sophisti-cated analysis tools are often required to completethe analysis. These include the statistical tools ofSAS, R, S-plus, and MatLab. Collectively, it isdesirable for the software to seamlessly connectwith other statistical analysis software packages.The open-source project Bioconductor is writtenin R and holds promise to providing the means todevelop additional statistical analysis methods (8).It should be noted that these tools require the ap-propriate knowledge of statistics and some pro-gramming skills. Even so, this proves to be easierthan trying to develop new algorithms in lowerlevel languages like C or Java.

Interestingly, the lack of the standard datarepresentation remains to be fully resolved. TheMicroarray Gene Expression Data Society (MGED;11) is currently working toward implementing adatabase standard “Minimum Information Abouta Microarray Experiment” (MIAME) that de-scribes microarray experiments and standards thatallow for the exchange and comparison of data

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM229


230 Liu et al.

Tab

le 4

Eva

luat

ion

Com

pari

son

Gen

eSpr

ing

5.0

Gen

eSig

ht 3

.2E

xpre

ssio

nist

4.0

Vec

tor

Exp

ress

ion

(Dem

on)

(Sil

icon

Gen

etic

s)(B

ioD

isco

very

)(G

ene

Dat

a)(I

nfor

max

)

Nor

mal

izat

ion

By

mea

n, p

erce

ntil

e,B

y m

ean,

per

cent

ile,

By

loga

rith

mic

mea

n,B

y m

ean,

hou

seke

epin

gin

tens

ity

depe

nden

t,Z

-sco

re, l

iner

reg

ress

ion,

arit

hmet

ic m

ean,

med

ian

gene

s, r

egre

ssio

n,re

gion

al n

orm

aliz

atio

n,pi

ecew

ise

line

rva

rian

ce e

qual

izin

gho

usek

eepi

ng g

enes

Dat

a pr

epar

atio

n√

√√

√K

-mea

n cl

uste

r√

√√

√S

OM

√√

√√

Hie

rarc

hica

l clu

ster

√√

√√

PC

A√

√√

√S

tati

stic

al a

naly

sis

t-te

st (

para

met

ric,

t-te

st (

para

met

ric,

t-te

st (

pair

ed,

t-te

st, L

atin

-squ

ires

for

nonp

aram

etri

c,no

npar

amet

ric,

para

met

ric,

flip

-dye

des

ign,

p-v

alue

uneq

ual v

aria

nce)

,un

equa

l var

ianc

e),

nonp

aram

etri

c,ad

just

men

ton

e-w

ay A

NO

VA

,on

e-w

ay A

NO

VA

,un

equa

l var

ianc

e)cr

oss-

gene

err

or m

odel

,si

gnif

ican

t ana

lysi

sO

ne-w

ay A

NO

VA

p-va

lue

adju

stm

ent

(bas

ed o

n no

ise-

esti

mat

ion)

p-va

lue

adju

stm

ent,

Sli

de a

nd d

ata

q

uali

ty c

ontr

olX

X√

XD

ownl

oade

d

ann

otat

ion

sour

ceE

ntre

z-N

ucle

otid

e,E

ntre

z-N

ucle

otid

eG

ene

Ont

olog

yG

ene

Ont

olog

yU

niG

ene,

Loc

usL

ink,

and

gene

des

crip

tion

Vec

tor

Pat

hBla

zer

Gen

e O

ntol

ogy,

from

Aff

ymet

rix

KE

GG

pat

hway

for

Gen

eChi

pL

ink

to W

ebE

ntre

z-N

ucle

otid

e,E

ntre

z-N

ucle

otid

e,E

ntre

z-N

ucle

otid

e,X

Gen

eCar

d, U

niG

ene,

Uni

Gen

e, P

ubM

ed,

Net

Aff

y, G

eneC

ard

Loc

usL

ink,

DD

BJ,

SO

UR

CE

, Use

rL

ocus

Lin

k, O

MIM

TIG

R, F

lyB

ase,

Sw

issP

rot

cust

omiz

able

Gen

e O

ntol

ogy,

Uni

Gen

eD

ata

expo

rtT

ext,

grap

hT

ext,

grap

hT

ext,

grap

hT

ext,

grap

h, E

xcel

Dat

abas

e√

√√

√C

usto

miz

able

cod

e√

√√

XC

ost

Con

tact

com

pany

Con

tact

com

pany

Dep

ends

on

num

ber

of u

sers

Con

tact

com

pany

Tec

hnic

al

sup

port

com

mun

ity

Pho

ne, e

-mai

l,ac

tive

e-m

ail l

ists

,P

hone

, e-m

ail,

Pho

ne, e

-mai

lP

hone

, e-m

ail,

on-l

ine

trai

ning

on-l

ine

trai

ning

on-l

ine

trai

ning

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM230


Evaluation of Microarray Analysis Software 231

these data-mining functions tend to be simple keyword searches and hyperlinks and only providelimited value. However, this may only be a tem-porary difficulty as SOURCE (http://source.stanford.edu) (5) and Onto Express (http://vortex.cs.wayne.edu/Projects.html) (6) are two examplesof Web tools that have merged substantial datafrom multiple public databases in a useful fashion.

5. ConclusionThere exists a wide variety of software for ana-

lyzing microarray data. It is impractical to evalu-ate all of the available software. In this report wehave detailed a methodological approach towardcomparing various microarray analysis tools. Byfirst defining criteria then concentrating on thosepackages that meet these requirements, one shouldbe able to identify the appropriate analysis toolthat will directly meet one’s needs. However,many issues like data standards and analysis pro-tocols remain. These issues need to be resolvedbefore the use of microarray technology can beconfidently applied and compared. At present,meeting these challenges is difficult because manysoftware programs are closed systems with lim-ited customizability.

points called “Microarray Gene Expression Mark-Up Language” (MAGE-ML). These standards arebecoming increasingly important as microarraydata sets are archived much like complementarydeoxyribonucleic acid (cDNA) libraries and ex-pressed sequence tag (EST) collections. Some ex-amples of implementations include MIAMExpress(http://www.ebi.ac.uk/microarray/MIAMExpress/miamexpress.html) and the BioArray SoftwareEnvironment (http://base.thep.lu.se ) (9).

A critical and neglected component of any idealmicroarray software is the ability to biologicallymine the data. For example, even the simple ques-tion “What is the biological meaning of the classi-cal lists of up- and down-regulated genes that aregenerated for each experiment?” is often difficultto answer. One strategy to address this question isto classify genes into functional groups to con-struct their corresponding gene regulatory net-work or pathway. Unfortunately, the majority ofthe current data mining software is at a similarstage of development as the data analysis soft-ware. There are too few standards and far toomany implementations. Most of the software weevaluated had a limited ability to retrieve data fromEntrez, UniGene, LocusLink, GO, or KEGG. But

Table 5Advantages and Limitations of Software

Advantages Limitations

GeneSpring 5.0

A good set of integrated biological A limited amount of data plot functionsdata-mining tools

A very active user discussion e-mail list A limited amount of normalizationand statistical methods

GeneSight 3.2

A comprehensive set of data preprocessing tools Unable to handle large data setsA significance test based on noise sampling Some operations are slow and unstable

for replicate data

Expressionist 4.0

Tools to mask and correct some slide defects Is geared to Affymetrix data setsIs an easily accessible Web-based system Has fewer features for spotted arrays

A very limited set of tools for statisticsVector Expression

Integrates well with other Informax products User interface seems confusing to many usersImplement ANOVA model for flip-dye analysis Some operations are slow and unstable

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM231


232 Liu et al.

AcknowledgmentsThis work was supported in part by grant

442000 from the Michigan Life Sciences Corri-dor to SAK. The authors wish to thank Laila Pois-son, Gary Chase, Jill-Barnholz-Sloan, and SusanLand for their helpful contributions to the soft-ware evaluation trials and for their helpful sug-gestions throughout the course of this project.

References1. Beaucage, S. L. (2001) Strategies in the preparation

of DNA oligonucleotide arrays for diagnostic appli-cations. Curr. Med. Chem. 8, 1213–1244.

2. Al-Khaldi, S. F., Martin, S. A., Rasooly, A., andEvans, J. D. (2002) DNA microarray technology usedfor studying foodborne pathogens and microbial habi-tats: minireview. J. AOAC Int. 85, 906–910.

3. Marton, M. J., DeRisi, J. L., Bennett, H. A., et al.(1998) Drug target validation and identification ofsecondary drug target effects using DNA microarrays.Nature Med. 4, 1293–1301.

4. Draghici, S. (2003) Data analysis and visualization inDNA microarrays. In Introduction to Bioinformatics(Krawetz, S. and Womble, D., eds.). Humana Press,Totowa, NJ, pp. 665–692.

5. Bittner, M., Meltzer, Y., Chen, Y., et al. (2000) Mo-lecular classification of cutaneous malignant mela-

noma by gene expression profiling. Nature 406, 536–540.

6. Eisen, M. B. and Brown, P. O. (1999) DNA arrays foranalysis of gene expression. Methods Enzymol. 303,179–205.

7. Hegde, P., Qi, R., Abernathy, K., et al. (2000) A con-cise guide to cDNA microarray analysis. Biotech-niques 29, 548–556.

8. Gentleman, R. and Carey, V. (2002) Bioconductor. RNews 2, 11–16.

9. Saal, L., Troein, C., Vallon-Christersson, J.,Gruvberger, S., Borg, A., and Peterson, C. (2002)BioArray Software Environment (BASE): a platformfor comprehensive management and analysis ofmicroarray data. Genome Biol. 3, software0003.1–0003.6.

10. Wildsmith, S. E. and Elcock, F. J. (2001) Microarraysunder the microscope. Mol. Pathol. 54, 8–16.

11. Spellman, P. T., Miller, M., Stewart, J., et al. (2002)Design and implementation of microarray gene ex-pression markup language (MAGE-ML). GenomeBiol. 3, RESEARCH0046.

12. Diehn, M., Sherlock, G., Binkley G., et al. (2003)SOURCE: a unified genomic resource of functionalannotations, ontologies, and gene expression data.Nucleic Acids Res. 31, 219–223.

13. Khatri, P., Draghici, S., Ostermeier, C., and Krawetz,S. (2002) Profiling gene expression using Onto-Ex-press. Genomics 79, 266–270.

07-JW638-Krawetz,225-232 2/19/04, 9:46 AM232

comparative evaluation of microarray analysis software

Documents