software note: using probe secondary structure information to enhance affymetrix genechip background...

7
Computational Biology and Chemistry 31 (2007) 92–98 Software Note: Using probe secondary structure information to enhance Affymetrix GeneChip background estimates Raad Z. Gharaibeh, Anthony A. Fodor, Cynthia J. Gibas Bioinformatics Research Center, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA Received 13 February 2007; accepted 14 February 2007 Abstract High-density short oligonucleotide microarrays are a primary research tool for assessing global gene expression. Background noise on microarrays comprises a significant portion of the measured raw data. A number of statistical techniques have been developed to correct for this background noise. Here, we demonstrate that probe minimum folding energy and structure can be used to enhance a previously existing model for background noise correction. We estimate that probe secondary structure accounts for up to 3% of all variation on Affymetrix microarrays. © 2007 Elsevier Ltd. All rights reserved. Keywords: DNA microarray; Probe secondary structure; Background correction 1. Introduction Microarray technology holds the promise of capturing global gene expression by providing global molecular snapshots of the cell’s transcriptional machinery products (Lockhart et al., 1996). The ultimate goal of gene expression microarrays is to measure the abundance of each known transcript in the sample under investigation. The abundance is inferred from the signal generated by each probe as a result of a hybridization reaction with a labeled target (transcript). However, this signal includes background noise that not only measures the target abundance, but also non-specific binding and autofluorescence of the chip surface. In the Affymetrix GeneChip system, each transcript’s abun- dance is measured by a set of 11–20 probe pairs. Each pair is composed of a perfect match probe (PM), which exactly com- plements a region on the transcript, and a mismatch probe (MM), which is identical to the PM probe except at the 13th base, where the reverse compliment nucleotide is introduced. MM probes were originally introduced by Affymetrix to measure background noise. However, it has been shown by many groups that MM contain significant amount of PM signal and are there- fore unreliable as estimators of background noise (Chudin et al., 2001; Forman et al., 1998; Irizarry et al., 2003; Naef et al., Corresponding author. Tel.: +1 704 687 8378; fax: +1 704 687 6610. E-mail address: [email protected] (C.J. Gibas). 2003). A true estimate of background noise would improve the quality of Affymetrix GeneChip data. Inconsistency of the signal generated from each probe is a common phenomenon in GeneChip microarray experiments (Li and Wong, 2001; Nielsen et al., 2005). The differences in the signal produced can be attributed to many sources: optical noise, cross-hybridization, dye-related contributions and probe sequence composition. Many algorithms have been developed to attempt to correct for these inconsistencies (Irizarry et al., 2006; Wu and Irizarry, 2005; Zhang et al., 2003). In particular, it has been found that probe sequence composition can significantly affect the intensity of the signal generated from that probe, inde- pendent of the concentration of its target. A number of groups have suggested models where the background intensity of probes could be estimated based on their sequence composition (Naef and Magnasco, 2003; Zhang et al., 2003). The process of nucleic acid hybridization in solution has been well studied and models such as the nearest-neighbor model provide a robust description of hybridization thermodynamics (SantaLucia and Hicks, 2004). Probe-target hybridization on the microarray surface, however, does not follow the solution analogue, and the nearest-neighbor parameters that describe solution hybridization appear to be different than those for microarrays (Zhang et al., 2003). On-chip DNA hybridization is likely to be complicated by the geometric constraints of hav- ing one strand (i.e. probe) attached to the surface of the chip (Shchepinov et al., 1997). In addition, many other factors like probe and target secondary structure, effective reaction volume, 1476-9271/$ – see front matter © 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2007.02.008

Upload: raad-z-gharaibeh

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

A

cnn©

K

1

gt1mugwbbs

dcpwwpbtfa

1d

Computational Biology and Chemistry 31 (2007) 92–98

Software Note: Using probe secondary structure information toenhance Affymetrix GeneChip background estimates

Raad Z. Gharaibeh, Anthony A. Fodor, Cynthia J. Gibas ∗Bioinformatics Research Center, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA

Received 13 February 2007; accepted 14 February 2007

bstract

High-density short oligonucleotide microarrays are a primary research tool for assessing global gene expression. Background noise on microarraysomprises a significant portion of the measured raw data. A number of statistical techniques have been developed to correct for this backgroundoise. Here, we demonstrate that probe minimum folding energy and structure can be used to enhance a previously existing model for backgroundoise correction. We estimate that probe secondary structure accounts for up to 3% of all variation on Affymetrix microarrays.

2007 Elsevier Ltd. All rights reserved.

n

2q

a(tnsaWbaphca

wp(t

eywords: DNA microarray; Probe secondary structure; Background correctio

. Introduction

Microarray technology holds the promise of capturing globalene expression by providing global molecular snapshots ofhe cell’s transcriptional machinery products (Lockhart et al.,996). The ultimate goal of gene expression microarrays is toeasure the abundance of each known transcript in the sample

nder investigation. The abundance is inferred from the signalenerated by each probe as a result of a hybridization reactionith a labeled target (transcript). However, this signal includesackground noise that not only measures the target abundance,ut also non-specific binding and autofluorescence of the chipurface.

In the Affymetrix GeneChip system, each transcript’s abun-ance is measured by a set of 11–20 probe pairs. Each pair isomposed of a perfect match probe (PM), which exactly com-lements a region on the transcript, and a mismatch probe (MM),hich is identical to the PM probe except at the 13th base,here the reverse compliment nucleotide is introduced. MMrobes were originally introduced by Affymetrix to measureackground noise. However, it has been shown by many groups

hat MM contain significant amount of PM signal and are there-ore unreliable as estimators of background noise (Chudin etl., 2001; Forman et al., 1998; Irizarry et al., 2003; Naef et al.,

∗ Corresponding author. Tel.: +1 704 687 8378; fax: +1 704 687 6610.E-mail address: [email protected] (C.J. Gibas).

asmii(p

476-9271/$ – see front matter © 2007 Elsevier Ltd. All rights reserved.oi:10.1016/j.compbiolchem.2007.02.008

003). A true estimate of background noise would improve theuality of Affymetrix GeneChip data.

Inconsistency of the signal generated from each probe iscommon phenomenon in GeneChip microarray experiments

Li and Wong, 2001; Nielsen et al., 2005). The differences inhe signal produced can be attributed to many sources: opticaloise, cross-hybridization, dye-related contributions and probeequence composition. Many algorithms have been developed tottempt to correct for these inconsistencies (Irizarry et al., 2006;u and Irizarry, 2005; Zhang et al., 2003). In particular, it has

een found that probe sequence composition can significantlyffect the intensity of the signal generated from that probe, inde-endent of the concentration of its target. A number of groupsave suggested models where the background intensity of probesould be estimated based on their sequence composition (Naefnd Magnasco, 2003; Zhang et al., 2003).

The process of nucleic acid hybridization in solution has beenell studied and models such as the nearest-neighbor modelrovide a robust description of hybridization thermodynamicsSantaLucia and Hicks, 2004). Probe-target hybridization onhe microarray surface, however, does not follow the solutionnalogue, and the nearest-neighbor parameters that describeolution hybridization appear to be different than those foricroarrays (Zhang et al., 2003). On-chip DNA hybridization

s likely to be complicated by the geometric constraints of hav-ng one strand (i.e. probe) attached to the surface of the chipShchepinov et al., 1997). In addition, many other factors likerobe and target secondary structure, effective reaction volume,

l Biol

enbtS

omsostems2at

steTsapdo

caivM

rp

2

2

gcc(emc2d

2

cR

m(go

d

TRs

D

L

C

L

E

B

S

M

R

R.Z. Gharaibeh et al. / Computationa

lectrostatics, diffusion and surface effects, reaction thermody-amics and kinetics, competitive binding effects, hybridizationuffer composition and probe–probe interactions are believedo affect microarray DNA hybridization (Lima et al., 1992;outhern et al., 1999).

In this study, we examine the effect of predicted probe sec-ndary structure on background hybridization in Affymetrixicroarrays. Although microarray probes are attached to the

urface of the chip, they are dynamic molecules that, dependingn their sequence composition, can fold onto themselves intotable secondary structure. Such stable secondary structure hashe potential to interfere with probe-target hybridization (Limat al., 1992). Consequently, the signal obtained from such probesay not reflect the actual transcript concentrations. It has been

hown, for example, that a stable secondary structure motif in a0-mer probe dramatically decreases the final signal obtained topoint where the probe is considered insensitive to its intended

arget (Anthony et al., 2003).Microarray probes are usually screened for the presence of

table secondary structure either by a simple base complemen-arity check or using more sophisticated and time consumingnergy minimization algorithms (Markham and Zuker, 2005).he base complementarity check is more routinely used for itsimplicity and speed. Discrepancies between methods do exist,nd there are no guidelines that determine which method isreferable (Koehler and Peyret, 2005). It is therefore likely that,espite these screening procedures, a significant amount of sec-ndary structure is present in probes in microarray experiments.

Here we propose that the background noise of each probean be modeled as a function of its sequence composition

nd its minimum folding energy and secondary structure. Byncorporating probe secondary structure information into a pre-iously described model of background concentration (Naef andagnasco, 2003), we improved the fit of that model to microar-

ptmr

able 12 of Naef and Magnasco (2003) model (NM) and the position-dependent secondarytudy

ata set naa

atin square 42

hoe (Choe et al., 2005) 6

eukemia (Armstrong et al., 2002) 72

toposide response (C. Richardson, personal communication) 60

K knockout (Meredith et al., 2006; Pyott et al., 2007) 20

plicing microarray (Sugnet et al., 2006) 75

alaria (Le Roch et al., 2003) 17

esults presented as average R2 ± S.D.a Number of chips.b Number of probes.c The differences in R2 between NM and PSAA are all statistically significant (P <

ogy and Chemistry 31 (2007) 92–98 93

ay data by 1–3% with minimal addition of significant freearameters.

. Methods

.1. Data sets

Seven data sets were used in this study (Table 1): the humanenome U133 Latin square data set (http://www.affymetrix.om/support/technical/sample data/datasets.affx), the Choeontrol data set (Choe et al., 2005), a Leukemia data setArmstrong et al., 2002), a Malaria PM only data set (Le Rocht al., 2003), an Etoposide response data set (personal com-unication from Dr. Christine Richardson), a BK potassium

hannel knockout data set (Meredith et al., 2006; Pyott et al.,007) and an alternative splicing PM only tiling microarrayata set (Sugnet et al., 2006).

.2. System and software

All the computational work was done on a 73-node Appleluster. Each node is a dual 2.3 GHz PowePC G5 with 2GBAM running Mac OSX 10.4.

Secondary structure prediction was done using the hybrid-in-ss program of the UNAFold-2.5 software package

Markham and Zuker, 2005). All probes were folded as sin-le DNA strands at 45 ◦C and 1.0 M sodium concentration. Allther options were set to the program defaults.

Simple linear model fitting and p-value calculations wereone using R linear model function (lm) (http://www.r-

roject.org/). The Naef and Magnasco (2003) model andhe position-dependent secondary-structure attenuated affinity

odel were implemented in Perl. All Perl code is available uponequest.

-structure attenuated affinity model (PPSA) for the seven data sets used in this

npb NM PSAAc

248,152 PM 0.17 ± 0.009 0.184 ± 0.010248,152 MM 0.40 ± 0.009 0.416 ± 0.009

195,994 PM 0.20 ± 0.022 0.216 ± 0.025195,994 MM 0.46 ± 0.017 0.49 ± 0.017

201,800 PM 0.49 ± 0.063 0.51 ± 0.062201,800 MM 0.60 ± 0.036 0.61 ± 0.035

496,468 PM 0.05 ± 0.040 0.06 ± 0.040496,468 MM 0.11 ± 0.062 0.12 ± 0.062

496,468 PM 0.09 ± 0.035 0.10 ± 0.036496,468 MM 0.29 ± 0.050 0.30 ± 0.049

505,916 PM 0.30 ± 0.062 0.31 ± 0.063

173,262 PM 0.36 ± 0.043 0.38 ± 0.043

10−3) using paired one-sided Wilcoxon test.

9 l Biol

3

3

e

I

wnSjwcoM

pcbr

I

wG2mafibo

twsbT

brp

s

I

w

sHid�

n

dsfswfsfrps

I

wolRl

Fr

4 R.Z. Gharaibeh et al. / Computationa

. Results

.1. Simple linear models

The signal intensity generated from each probe can be mod-led as:

j = Oj + Nj + Sj (1)

here I is the raw intensity value of probe j, O is the opticaloise, N is the background noise of non-specific binding, andis the signal generated from specific binding between probe

and its intended target (Wu and Irizarry, 2005). In this paper,e do not model the signal and none of our models therefore

ontain terms for S. Since the S term, which we are ignoring inur models, is significantly higher in the PM probes than theM probes, each probe type was modeled separately.Controlling the GC content of the probe is one of the basic

rinciples of microarray probe design. A probe with high GContent tends to hybridize better and to form a stable duplex withoth target and non-target sequences. A simple linear model thatelates probe intensity to GC content can be written as follows:

j = B0 + B1〈GC〉j + εj (2)

here I is the raw intensity of probe j, 〈GC〉j the number ofC nucleotides in probe j (which is a number between 0 and5), B0 and B1 are free parameters and εj is an error term. Theodel explains a modest amount of the overall intensity when

pplied to the Latin square data set; R2 ≤ 0.02 for PM and 0.12or MM (Fig. 1). The model explains more of the MM probesntensity because most of the signal obtained from MM probes isackground noise. MM intensity is therefore more independentf the concentration of the target gene.

We wondered, compared to the GC content, how much ofhe background noise probe secondary structure would explain

hen put into a simple linear model. The free energy of probe

econdary structure formation (�GSS) is an indicator of the sta-ility of secondary structure in which the probe folds on itself.he more stable the secondary structure, the less a probe will

ee

o

ig. 1. R2 distribution for the simple linear models (Eqs. (2)–(4)) for all the U133 Lejected with high confidence (P < 10−4) for all the models.

ogy and Chemistry 31 (2007) 92–98

e able to hybridize to its target or non-target sequences. As aesult, one would expect to observe a low signal from such arobe.

How much of all probe variance can be explained directly byecondary structure predictions? A simple linear model is:

j = B0 + B1〈�GSS〉j + εj (3)

here 〈�GSS〉j is probe j minimum folding energy in kcal/mol.If we apply this simple linear model to the Latin square data

et, we find a very low r-squared values; R2 < 10−4 (Fig. 1).owever, the p-value of the null hypothesis that the B1 parameter

s equal to zero is rejected with high confidence (Fig. 1). Theseata suggest that there is a statistically significant influence ofGSS on the observed intensity, although this relationship does

ot explain very much of the overall intensity on the array.One may argue that the low r-squared values in Eq. (3) are

ue to the fact that �GSS value does not reflect the size of theecondary structure motif found in that probe and the number ofree bases available for hybridization. The program hybrid-min-s reports, for the most stable secondary structure of a probe,hether a given nucleotide is involved in secondary structure

ormation or not. We can define a value, SL, which is the longesttretch of nucleotides that are not involved in secondary structureormation (for example, SL = 10 in Fig. 2). To investigate theelationship between the longest free string of bases of the foldedrobe (SL) and the observed intensity, we can again apply aimple linear model:

j = B0 + B1〈SL〉j + εj (4)

here 〈SL〉j is the longest free string of bases in probe j basedn its minimum energy structure. This model also has a veryow r-squared values when applied to the Latin square data set;2 < 10−3 (Fig. 1), only slightly higher than Eq. (3) suggesting

ittle direct effects of probe SL on the observed intensity. How-

ver, the p-value of the null hypothesis that the B1 parameter isqual to zero is also rejected with high confidence (Fig. 1).

We see that GC content can explain a modest amount ofverall intensity. Models based on secondary structure explain

atin square chips. The null hypothesis that the B1 parameter is equal to zero is

R.Z. Gharaibeh et al. / Computational Biol

Fig. 2. Folded probe showing its sequence, minimum folding energy (�GSS)and minimum energy structure. The longest free string of bases of the foldedpi

ms

3a

swttsnpsi

pIc

l

wtpha(

l

l

Epofioc

tf

l

Tpinmsfswb1�GSS cutoff value, below which θ will be constant (θ = tb). We

robe (SL) is shown in gray box; bases involved in hydrogen bonding are shownn green ovals.

uch less of the intensity data, although they are still highlytatistically significant.

.2. Position-dependent secondary-structure attenuatedffinity model

Since the three simple linear models (Eq. (2)–(4)) all holdignificant relationships with the observed intensity (Fig. 1), weanted to combine them into one model that takes into account

he base composition, �GSS and SL of the probe. We foundhat a simple linear combination of GC, �GSS and SL did notignificantly improve on the power of the individual models (dataot shown). We reasoned that a model that is aware of each

robe’s base position and involvement in the overall secondarytructure of the probe would outperform models that ignore thisnformation.

atF

ogy and Chemistry 31 (2007) 92–98 95

The model of Naef and Magnasco (2003) provides a startingoint that meets our requirement for individual base information.n this model, probe background is modeled based on sequenceomposition:

n〈B/M〉 =25∑

k=1

25∑l ∈ (A,T,C,G)

SlkAlk (5)

here B is the raw probe intensity, M is the median intensity ofhe array, l is the nucleotide index, k is the position of l along therobe, S is a Boolean variable equals to 1 if the probe sequenceas l at k and zero otherwise, and A is the per-site per letterffinity. To clarify, consider the probe shown in Fig. 2, then Eq.5) will read:

n〈B/M〉 = (S1G × A1G) + (S1A × A1A) + (S1T × A1T )

+(S1C × A1C) + (S2G × A2G) + (S2A × A2A)

+(S2T × A2T ) + (S2C × A2C) + (S3G × A3G)

+(S3A × A3A) + (S3T × A3T ) + (S3C × A3C)

+· · · + (S25G × A25G) + (S25A × A25A)

+(S25T × A25T ) + (S25C × A25C)

n〈B/M〉 = A1C + A2G + A3A + · · · + A25C

q. (5) is a simple model that has four free parameters for eachrobe base (100 free parameters for a 25-base probe). The valuesf these 100 free parameters are generated by linear least squarest (Naef and Magnasco, 2003). Given the large number of probesn each chip (about half a million for the human genome U133hip, for example) over-fitting is not a concern.

In our approach, we add the continuous variable θ to reflecthe involvement of the probe nucleotides in secondary structureormation. The model now is written as

n〈B/M〉 =25∑

k=1

25∑l ∈ (A,T,C,G)

SlkAlk (6)

he θ term reflects the degree to which an individual probe basearticipates in secondary structure formation. In our model, its represented by any value between 0 and 1. There are a largeumber of ways in which values for θ could be generated. Weade the following simplifying assumptions. We begin by con-

idering nucleotides that are not involved in secondary structureormation. In cases where a probe’s �GSS > 0 kcal/mol we canet θ for all bases within that probe to 1. Likewise, when a baseithin a probe is not involved in secondary structure hydrogenonding (yellow ovals in Fig. 2, for example), we can set θ tofor that base. To calculate θ for the remaining bases, we set a

ssumed that the relationship between θ and �GSS is linear inhe region where �GSS is between �GSS-cutoff and 0 (Fig. 3).rom the assumption of linearity, we can derive a slope and an

96 R.Z. Gharaibeh et al. / Computational Biol

Fi

i

θ

TTo(f�

s

Fot

ilti

((

(

t

(

l

l

T(

ig. 3. A model for the relationship between �GSS and θ for the bases involvedn secondary structure formation.

ntercept to yield:

= tb − 1

�GSS-cutoff〈�GSS〉j + 1 (7)

his equation has two unknown parameters �GSS-cutoff and tb.o find the best values for these parameters, we tested the effectsf changing �GSS-cutoff and tb on the performance of the modelEq. (6)) on a single chip from the Latin square data set. We

ound that the best performance of the model was obtained atGSS-cutoff = − 3.6 kcal/mol and a tb = 0.35 (Fig. 4).To summarize, we define our position-dependent secondary-

tructure attenuated affinity model (PSAA) as Eq. (6), where B

ig. 4. Effects of changing the values of �GSS-cutoff and tb on the performancef the position-dependent secondary-structure attenuated affinity model usinghe human genome U133 Latin square experiment 2 replicate 1 PM probes.

rMssNcfn

3a

oMl�

oat(m�

ba

ogy and Chemistry 31 (2007) 92–98

s the raw probe intensity, M is the median intensity of the array,is the letter index, k is the position of l along the probe, A is

he per-site per letter affinity, S is a Boolean variable equal to 1f the probe sequence has l at k and zero otherwise, and θ is:

a) 1, if the probe �GSS > 0 kcal/mol.b) 1, if the probe �GSS < 0 kcal/mol and l is not involved in

secondary structure hydrogen bonding.

c)

⎛⎝

0.35 if �GSS ≤ −3.60.65

3.6〈�GSS〉j + 1 if − 3.6 < �GSS ≤ 0

⎞⎠

and l is involved in secondary structure hydrogen bonding.

Here, the involvement of each probe base in secondary struc-ure hydrogen bonding is based on its minimum energy structure.

When we consider the folded probe presented in Fig. 2, Eq.6) reads:

n〈B/M〉 = θ((S1G × A1G) + (S1A × A1A) + (S1T × A1T )

+(S1C × A1C)) + θ((S2G × A2G) + (S2A × A2A)

+(S2T × A2T ) + (S2C × A2C)) + θ((S3G × A3G)

+(S3A × A3A) + (S3T × A3T ) + (S3C × A3C))

+· · · + θ((S25G × A25G) + (S25A × A25A)

+(S25T × A25T ) + (S25C × A25C))

n〈B/M〉 = A1C + A2G + 0.64 A3A + · · · + A25C

he model defined in Eq. (6) was fitted to all the data setsTable 1). The fitting was done on the PM and MM probes sepa-ately. Table 1 shows a comparison between the native Naef and

agnasco (2003) model and our position-dependent secondary-tructure attenuated affinity model. We see that including probeecondary structure information improved the fit of the nativeaef and Magnasco (2003) model by 1–3%, depending on the

hip and probe type. Note that all the models (Eqs. (2–6)) per-orm better on the MM probes due to the higher backgroundoise present in the MM signal.

.3. Gains in performance can not be trivially explained bydditional free parameters

We note that there are two distinct kinds of free parameters inur model. The 100 free parameters from the original Naef andagnasco model (Eq. (5)) are calculated for each chip by linear

east squares fit. We have added two free parameters in Eq. (6),GSS-cutoff and tb. These parameters were determined from one

f the Latin square data set chips from the curves shown in Fig. 4nd were held constant for all the data sets in this paper. Givenhat our fits contain between 173, 262 and 496, 468 data pointsTable 1), it seems unlikely that the improvements in perfor-

ance could be explained by the addition of the free parametersGSS-cutoff and tb. Nonetheless, to further rule out this possi-

ility, we refolded the Latin square data set probes with eithercompletely random sequence (generated with an equal prob-

R.Z. Gharaibeh et al. / Computational Biology and Chemistry 31 (2007) 92–98 97

Fig. 5. Effects of changing tb or �GSS-cutoff on the performance of the PSAA model. Effects of changing (A) the values of tb while holding �GSS-cutoff = − 3.6 or (B)the values of �GSS-cutoff while holding tb = 0.35 on the performance of PSAA: the position-dependent secondary-structure attenuated affinity model (Eq. (6)). Datas M pro( d the(

aEwrFn(sctadpi(ist

4

rtton(pmtisIbo

tCpcohtabhm

ts(pocosaWhbespafcbi

hown are for the human genome U133 Latin square experiment 2 replicate 1 P-RD) indicates the R2 after generating the minimum folding energy (�GSS) ansee Section 3.3 for explanation).

bility of A, C, G and T) or a shuffled sequence. Then we fedq. (6) the original probe sequence (i.e. the right l at k) alongith the new �GSS and the new minimum energy structure that

esulted from folding the random or shuffled probe sequence.or the random sequence case, the performance of the origi-al Naef and Magnasco (2003) model was severely degradedFig. 5). For the shuffled sequences, the probe’s base compo-ition is not affected, but the position of each base has beenhanged due to the shuffling process. For the shuffled sequences,he fit of the model dropped down to that of the original Naefnd Magnasco (2003) model. These results on shuffled and ran-om sequence show that the presence of the two additional freearameters �GSS-cutoff and tb cannot by themselves explain themproved performance over the original Naef and Magnasco2003) model. This strongly supports our argument that the gainn the r-squared values of our model came from including probeecondary structure information and do not arise trivially fromhe addition of free parameters.

. Discussion

In the absence of a clear understanding of the microar-ay hybridization mechanisms and the frequent use of probeshat fold into stable secondary structure under the hybridiza-ion conditions on microarrays, a model is needed to explainr approximate the effects of such behavior on microarray sig-al. Using simple linear models, we saw a modest relationshipR2 < 10−3) between probe intensity and its �GSS or SL. We pro-ose as a more powerful alternative to the two-parameter linearodels, a modification of the Naef and Magnasco (2003) model

o include probe secondary structure effects on the backgroundntensity. Our model works by equating an increase in secondary

tructure with a decreased contribution to a linear least square fit.f a particular base is involved in secondary structure hydrogenonding (Fig. 2), we assign it a low θ score depending on theverall �GSS of the probe (Eq. (7)). Consequently, this base con-

dRls

bes. NM: Naef and Magnasco (2003) model (Eq. (5)). The suffixes (-SH) andminimum energy structure from shuffled and random sequences, respectively

ribution is attenuated in the Naef and Magnasco (2003) model.onsider, for example, the third adenine base in the folded proberesented in Fig. 2, in the Naef and Magnasco (2003) model itsontribution to the brightness is A3A. Based on the predictionsf hybrid-min-ss, this base is involved in secondary structureydrogen bonding and we therefore expect a reduced contribu-ion to the intensity caused by background binding. We thereforettenuate its contribution to the brightness by θ, and its contri-ution now is 0.64A3A instead of A3A. Results attenuated by θ

ave more power than the original Naef and Magnasco (2003)odel over a wide range of Affymetrix data sets (Table 1).The secondary structure information used here is based on

he minimum folding energy (�GSS) and the minimum energytructure, as predicted by an energy minimization algorithmMarkham and Zuker, 2005) that uses the nearest-neighborarameters (SantaLucia, 1998) to predict secondary structuref single-stranded DNA molecules in solution. In the absence oflear understanding of the effects of the geometric constraintsf attaching one end of the DNA probe to the chip surface on itsecondary structure, the nearest-neighbor parameters representreasonable approximation for microarray (Held et al., 2003).e are also fully aware that single-stranded DNA molecules are

ighly dynamic and each molecule is likely to exist in an ensem-le of structures. Based on that, predicting the minimum foldingnergy (�GSS) and the minimum energy structure for any single-tranded DNA molecule can be different when using differentrediction algorithms, even when the same folding conditionsre used. The results presented here are based on the minimumolding energy (�GSS) and the minimum energy structure cal-ulated using UNAFold (Markham and Zuker, 2005). It haseen shown that the differences in the predicted minimum fold-ng energy (�GSS) and the minimum energy structure between

ifferent prediction algorithms are small (Ding et al., 2004;atushna et al., 2005). Consequently, we would expect simi-

ar results no matter which of the currently popular secondarytructure prediction algorithms were used.

9 l Biol

1cabsttt

teeAoioo

A

0Cp

R

A

A

C

C

D

F

H

I

I

K

L

L

L

L

M

M

N

N

N

P

R

S

S

S

S

S

rays. PLoS Comput. Biol. 2, e4.

8 R.Z. Gharaibeh et al. / Computationa

The results presented in this work suggest that, on average,–3% of all the intensities on Affymetrix GeneChip microarraysan be explained by probe secondary structure independent ofny target information. Given that not all the probes form sta-le secondary structure (50% of the human genome U133 Latinquare data set probes, for example have predicted �GSS > 0),he 1–3% enhancement over the original model is quite satisfac-ory, and represent a step forward in understanding the factorshat affect the on-chip hybridization process.

The current design of GeneChip microarrays devotes half ofhe chip to MM probes. The sole purpose of these probes is tostimate the background noise portion present in the PM signal tonhance the chip ability to detect differentially expressed genes.dvances in the ability to correctly estimate background noisen Affymetrix GeneChip microarrays based on probe sequencenformation may in the future eliminate the need of MM probesn these arrays offering more space to interrogate more genesn the same array.

cknowledgements

This research was supported in part by NIH 1R01GM072619-1 (C.J.G.) and by the UNC-Charlotte GASP program (R.Z.G.).el files for the splicing microarray data set were generouslyrovided by Manny Ares.

eferences

nthony, R.M., Schuitema, A.R., Chan, A.B., Boender, P.J., Klatser, P.R.,Oskam, L., 2003. Effect of secondary structure on single nucleotide poly-morphism detection with a porous microarray matrix; implications for probeselection. Biotechniques 34, 1082–1089.

rmstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L.,Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J.,2002. MLL translocations specify a distinct gene expression profile thatdistinguishes a unique leukemia. Nat. Genet. 30, 41–47.

hoe, S.E., Boutros, M., Michelson, A.M., Church, G.M., Halfon, M.S., 2005.Preferred analysis methods for Affymetrix GeneChips revealed by a whollydefined control dataset. Genome Biol. 6, R16.

hudin, E., Walker, R., Kosaka, A., Wu, S., Rabert, D., Chang, T., Kreder, D.,2001. Assessment of the relationship between signal intensities and tran-script concentration for Affymetrix GeneChip(R) arrays. Genome Biol. 3,research0005.0001–research0005.0010.

ing, Y., Chan, C.Y., Lawrence, C.E., 2004. Sfold web server for statistical fold-ing and rational design of nucleic acids. Nucl Acids Res. 32, W135–W141.

orman, J.E., Walton, I.D., Stern, D., Rava, R.P., Trulson, M.O., 1998.Thermodynamics of duplex formation and mismatch discrimination on pho-tolithographically synthesized oligonucleotide arrays. ACS Symp. Ser. 682,206–228.

eld, G.A., Grinstein, G., Tu, Y., 2003. Modeling of DNA microarray data byusing physical properties of hybridization. Proc. Natl. Acad. Sci. U.S.A. 100,7575–7580.

rizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J.,Scherf, U., Speed, T.P., 2003. Exploration, normalization, and summaries

W

Z

ogy and Chemistry 31 (2007) 92–98

of high density oligonucleotide array probe level data. Biostatistics 4, 249–264.

rizarry, R.A., Wu, Z., Jaffee, H.A., 2006. Comparison of Affymetrix GeneChipexpression measures. Bioinformatics 22, 789–794.

oehler, R.T., Peyret, N., 2005. Effects of DNA secondary structure on oligonu-cleotide probe binding efficiency. Comput. Biol. Chem. 29, 393–397.

e Roch, K.G., Zhou, Y., Blair, P.L., Grainger, M., Moch, J.K., Haynes, J.D., DeLa Vega, P., Holder, A.A., Batalov, S., Carucci, D.J., Winzeler, E.A., 2003.Discovery of gene function by expression profiling of the malaria parasitelife cycle. Science 301, 1503–1508.

i, C., Wong, W.H., 2001. Model-based analysis of oligonucleotide arrays:expression index computation and outlier detection. Proc. Natl. Acad. Sci.U.S.A. 98, 31–36.

ima, W.F., Monia, B.P., Ecker, D.J., Freier, S.M., 1992. Implication of RNAstructure on antisense oligonucleotide hybridization kinetics. Biochemistry31, 12055–12061.

ockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S.,Mittmann, M., Wang, C., Kobayashi, M., Horton, H., Brown, E.L., 1996.Expression monitoring by hybridization to high-density oligonucleotidearrays. Nat. Biotechnol. 14, 1675–1680.

arkham, N.R., Zuker, M., 2005. DINAMelt web server for nucleic acid meltingprediction. Nucleic Acids Res. 33, W577–W581.

eredith, A.L., Wiler, S.W., Miller, B.H., Takahashi, J.S., Fodor, A.A., Ruby,N.F., Aldrich, R.W., 2006. BK calcium-activated potassium channels regu-late circadian behavioral rhythms and pacemaker output. Nat. Neurosci. 9,1041–1049.

aef, F., Magnasco, M.O., 2003. Solving the riddle of the bright mismatches:labeling and effective binding in oligonucleotide arrays. Phys. Rev. E: Stat.Nonlinear Soft. Matter Phys. 68, 011906.

aef, F., Socci, N.D., Magnasco, M., 2003. A study of accuracy and precisionin oligonucleotide arrays: extracting more signal at large concentrations.Bioinformatics 19, 178–184.

ielsen, H.B., Gautier, L., Knudsen, S., 2005. Implementation of a gene expres-sion index calculation method based on the PDNN model. Bioinformatics21, 687–688.

yott, S.J., Meredith, A.L., Fodor, A.A., Vazquez, A.E., Yamoah, E.N., Aldrich,R.W., 2007. Cochlear function in mice lacking the BK channel �, �-1, or�-4 Subunits. J. Biol. Chem. 282, 3312–3324.

atushna, V.G., Weller, J.W., Gibas, C.J., 2005. Secondary structure in the tar-get as a confounding factor in synthetic oligomer microarray design. BMCGenom. 6, 31.

antaLucia Jr., J., 1998. A unified view of polymer, dumbbell, and oligonu-cleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci.U.S.A. 95, 1460–1465.

antaLucia Jr., J., Hicks, D., 2004. The thermodynamics of DNA structuralmotifs. Annu. Rev. Biophys. Biomol. Struct. 33, 415–440.

hchepinov, M.S., Case-Green, S.C., Southern, E.M., 1997. Steric factors influ-encing hybridisation of nucleic acids to oligonucleotide arrays. NucleicAcids Res. 25, 1155–1161.

outhern, E., Mir, K., Shchepinov, M., 1999. Molecular interactions on microar-rays. Nat. Genet. 21, 5–9.

ugnet, C.W., Srinivasan, K., Clark, T.A., Brien, G., Cline, M.S., Wang, H.,Williams, A., Kulp, D., Blume, J.E., Haussler, D., Ares, M., 2006. Unusualintron conservation near tissue-regulated exons found by splicing microar-

u, Z., Irizarry, R.A., 2005. Stochastic models inspired by hybridization theoryfor short oligonucleotide arrays. J. Comput. Biol. 12, 882–893.

hang, L., Miles, M.F., Aldape, K.D., 2003. A model of molecular interactionson short oligonucleotide microarrays. Nat. Biotechnol. 21, 818–821.