isochores merit the prefix ‘iso’

Deposited research articleIsochores Merit the Prefix ‘Iso’Wentian Li1, Pedro Bernaola-Galvan2, Pedro Carpena2 and Jose L. Oliver3

Addresses: 1 Center for Genomics and Human Genetics, North Shore - LIJ Research Institute, 350 Community Drive, Manhasset, NY 11030,USA. 2 Departmenta de Fisica Aplicada II, Universidad de Malaga, E-39071, Malaga, Spain. 3 Departmento de Genetica, Instituto deBiotechnologia, Universidad de Granada, E-18071 Granada, Spain.

Correspondence: Wentian Li. E-mail: [email protected]

http://genomebiology.com/2002/3/11/preprint/0009.1

com

ment

reviews

reports

deposited research

interactions

inform

ation

refereed research

.deposited research

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY

TO WHICH ANY ORIGINAL RESEARCH CAN BE SUBMITTED AND WHICH ALL INDIVIDUALS CAN ACCESS

FREE OF CHARGE. ANY ARTICLE CAN BE SUBMITTED BY AUTHORS, WHO HAVE SOLE RESPONSIBILITY FOR

THE ARTICLE'S CONTENT. THE ONLY SCREENING IS TO ENSURE RELEVANCE OF THE PREPRINT TO

GENOME BIOLOGY'S SCOPE AND TO AVOID ABUSIVE, LIBELLOUS OR INDECENT ARTICLES. ARTICLES IN THIS SECTION OF

THE JOURNAL HAVE NOT BEEN PEER-REVIEWED. EACH PREPRINT HAS A PERMANENT URL, BY WHICH IT CAN BE CITED.

RESEARCH SUBMITTED TO THE PREPRINT DEPOSITORY MAY BE SIMULTANEOUSLY OR SUBSEQUENTLY SUBMITTED TO

GENOME BIOLOGY OR ANY OTHER PUBLICATION FOR PEER REVIEW; THE ONLY REQUIREMENT IS AN EXPLICIT CITATION

OF, AND LINK TO, THE PREPRINT IN ANY VERSION OF THE ARTICLE THAT IS EVENTUALLY PUBLISHED. IF POSSIBLE, GENOME

BIOLOGY WILL PROVIDE A RECIPROCAL LINK FROM THE PREPRINT TO THE PUBLISHED ARTICLE.

Posted: 23 September 2002

Genome Biology 2002, 3(11):preprint0009.1-0009.15

The electronic version of this article is the complete one and can befound online at http://genomebiology.com/2002/3/11/preprint/0009

© BioMed Central Ltd (Print ISSN 1465-6906; Online ISSN 1465-6914)

Received: 19 September 2002

This is the first version of this article to be made available publicly. A peer-reviewed and modified version is now available in full atComputational Biology and Chemistry 2003, 27:5-10

This information has not been peer-reviewed. Responsibility for the findings rests solely with the author(s).

2 Genome Biology Deposited research (preprint)

Isochores Merit the Prefix ‘Iso’

Wentian Li1,∗, Pedro Bernaola-Galvan2, Pedro Carpena2, Jose L. Oliver3

1. Center for Genomics and Human Genetics, North Shore - LIJ Research Institute

350 Community Drive, Manhasset, NY 11030, USA

2. Departmenta de Fısica Aplicada II, Universidad de Malaga, E-39071, Malaga, Spain

3. Departmento de Genetica, Instituto de Biotechnologıa, Universidad de Granada,

E-18071 Granada, Spain

Email addresses: [email protected], [email protected], [email protected], [email protected].

* The corresponding author.

Running title: isochores merit iso

Abbreviations and acronyms:

ANOVA: analysis of variance; IHGSC: international human genome sequencing consor-

tium; MHC: major histocompatibility complex


Abstract

The isochore concept in human genome sequence was challenged in an analysis by the

International Human Genome Sequencing Consortium (IHGSC). We argue here that a

statement in IGHSC analysis concerning the existence of isochore is incorrect, because it

had applied an inappropriate statistical test. To test the existence of isochores should be

equivalent to a test of homogeneity of windowed GC%. The statistical test applied in the

IHGSC’s analysis, the binomial test, is however a test of a sequence being random on the

base level. For testing the existence of isochore, or homogeneity in GC%, we propose to

use another statistical test: the analysis of variance (ANOVA). It can be shown that DNA

sequences that are rejected by binomial test may not be rejected by the ANOVA test.

Background

The degree of homogeneity in base composition in human genome is a fundamental

property of the genome sequence. Not only does it characterize the organization and

evolution of the genome, but also it provides a context of many practical sequence analysis.

Statistical quantities such as GC%, used for sequence analyses such as computational gene

recognition, should be sampled from a homogeneous region of the sequence. If these

quantities are sampled from an inhomogeneous region, error is introduced and the quality

of a sequence analysis such as the performance of gene prediction, could be affected.

It has been known for a long time from the work of Bernardi’s group that there are

compositional homogeneous regions in human genome with sizes of at least 200-300 kb

[1, 2]. These homogeneous regions are called “isochores” [3], and the whole genome is a

mosaic of isochores. Recently, however, this view of human genome is questioned in an

initial analysis of human genome draft sequence [4]. The analysis presumably shows that

no sequence of 300-kb length examined could be claimed to be homogeneous (“... the


hypothesis of homogeneity could be rejected for each 300-kb window in the draft genome

sequence”, page 877 of [4], and a stunning statement was made that, essentially, isochore

concept does not hold (“... isochores do not appear to merit the prefix ‘iso’”, page 877 of

[4]).

The purpose of this Letter is to show that an incorrect statistical distribution for win-

dowed GC% is assumed in [4], based on an unrealistic condition for DNA sequences. As a

result, the statistical test used in [4] is invalid. We will present a correct statistical test,

assuming a more reasonable statistical distribution of windowed GC%. Under the new

test, the conclusion concerning the existence of isochore is drastically altered. Although

our testing result may still depend on the window size at which GC% is sampled, and may

possibly depend on the choice of GC% groups, it is clear that the test in [4] is too biased

towards rejecting the homogeneity null hypothesis, and sequences that fail the test in [4]

usually do not fail our new test.

Results

For a sequence to be homogeneous in GC%, the mean/average of windowed GC% values

sampled from one region of the sequence should be similar to that in another region, with a

consideration on the amount of allowed variance. In other words, to claim that a sequence

is homogeneous, not only do we need to calculate means of GC% along the sequence, but

also we need to know the variance. Generally speaking, the mean and the variance are two

independent parameters of a statistical distribution. However, for the homogeneity test

in [4], the variance is assumed to be a function of the mean, thus it is not independently

estimated.

In [4], the windowed GC% is assumed to follow a binomial distribution. For a binomial

distribution to be true, bases within the window should be uncorrelated, similar to tossing

a coin many times. Violating this assumption invalids the use of binomial application. The


more reasonable statistical distribution of GC% should be the normal distribution which,

unlike the binomial distribution, has two independent parameters (mean and variance).

Mean value can be estimated from a window, whereas variance can be estimated from a

group of windows.

To illustrate our point, we analyze two well known isochore sequences, the Major His-

tocompatibility Complex (MHC) class III and class II sequences on human chromosome

6 [5, 6, 7, 8]), with lengths 642.1 kb and 900.9 kb, respectively. The exact borders of

the two isochore sequences are determined by a segmentation procedure [9, 10] and an

online resource on isochore mapping [11]). We first repeat the test in [4] that these two

sequences, when viewed as a collection of many 20 kb windows, are sampled from a bino-

mial distribution. According to [4], a rejection of this test is considered to be an evidence

for heterogeneity. The test results are included in Table 1, which clearly shows that the

variances of GC% values sampled from 20-kb windows are much larger than expected from

a binomial distribution, with p-value close to be 0 (< 10−50).

This result, that the variance of GC% sampled from windows is much larger than

expected by binomial distribution, has been known for a long time [12, 13, 3, 14], [15] (and

the references therein). It is not surprising that the binomial distribution assumption is

rejected even for isochore sequences as shown in Table 1. Nevertheless, this rejection only

shows that a 20-kb window is not a series of 20000 uncorrelated bases; it is not a rejection

of homogeneity of windowed GC% along the sequence.

To reaffirm our belief that the binomial test used in [4] is a test of randomness of the

sequence instead of homogeneity, one bacterial sequence (Borrelia burgdorferi, 910.7 kb)

and two randomly generated sequences (with same length and base composition as the

MHC class III and class II sequences) are used for test. Table 1 shows that the null

hypothesis cannot be rejected by the binomial test for the two random sequences, but it

is rejected for the Borrelia burgdorferi, a particularly homogeneous genome, as shown in a


recent survey of archaeal and bacterial genome heterogeneity [16].

We would like to suggest that the more reasonable statistical distribution of windowed

GC% is the normal/Gaussian distribution, and the more appropriate test of homogeneity

of these GC% values along a sequence is the analysis of variance (ANOVA). There are at

least two reasons to believe that ANOVA is the more appropriate test. First, it is a test of

equality between means, which is identical to the intuitive meaning of homogeneity, i.e.,

GC% are the same along the sequence. Second, ANOVA and normal distribution reflects

the real situation of DNA sequences that these are not random sequences, and windowed

GC%’s exhibit higher values of variances. ANOVA allows the variance to be estimated

from the data, rather than being fixed by the mean value as in binomial distribution.

ANOVA was previously applied to the study of inter-chromosomal homogeneity of yeast

genome [14, 17].

To apply ANOVA to test homogeneity, we split a sequence into several super-windows,

and several windows per super-window. GC% from each window is calculated. The null

hypothesis is that the mean of windowed GC%’s in each super-window is the same. The

simplest selection of super-windows and windows is to assume all windows to have the

same length. To match the discussions in [4], we choose 20-kb windows and 300-kb super-

windows. This corresponds to roughly 2 super-windows, 16 windows per super-window for

the MHC class III sequence, and 3 super-windows, 15 windows per super-window for the

MHC class II sequence. ANOVA test results of these two isochores are listed in Table 2.

The p-values are 0.192 and 0.323, respectively, for MHC class III and class II sequence.

The null hypothesis, that means of GC% in different super-windows are the same, is not

rejected.

When the ANOVA test is applied to the Borrelia burgdorferi genome sequence and

two randomly generated sequences, null hypothesis cannot be rejected, indicating that all

three sequences are homogeneous at the respective window and super-window sizes (20 kb


and 300 kb). This is a more satisfactory situation than the binomial test because now a

homogeneous bacterial sequence is indeed confirmed to be homogeneous by the test.

Discussions

Due to the “domains within domains” phenomenon in DNA sequences [18, 19, 20],

we should not assume automatically that a homogeneity test result obtained at 20-kb

window and 300-kb super-window will hold true for other window and super-window sizes.

To check this, we carry out ANOVA tests on the MHC class III and class II sequences

at other window and super-window sizes. Fig.1 shows the result for the ANOVA test

result (− log10(p−value) ) for window sizes of around 20 kb, 10 kb, 5 kb, and 2.5 kb, and

the sequence is partitioned into 2, 3, 5, 8 (2,3,5,9) super-windows for MHC class III (II)

sequence.

Several observations could be made from Fig.1. First, when GC%’s are sampled from

(e.g.) 20-kb windows, changing the number of super-windows (i.e. number of partitions of

the sequence) does not greatly influence the ANOVA test result. This change corresponds

to a regrouping of windowed GC%’s. Generally speaking, if the sequence is homogeneous

with all GC% values (taken from a fixed window size) having the similar value, regrouping

these values does not make an insignificant result to be significant.

Second, the ANOVA test becomes more significant when the window size decreases.

This observation is understandable because at smaller length scales, GC% fluctuations are

no longer averaged out. These smaller-length-scale fluctuations could be due to repeats,

insertions, foreign elements, etc. For MHC class II sequence, as the subwindow size is

reduced to around 2.5 kb, the ANOVA test result is typically significant (Fig.1). This

is consistent to the definition of isochores as “fairly homogeneous” (as versus “strictly

homogeneous”) segments above a size of 3 kb [21, 22], and justifies the “coarse graining”

procedure to locate isochore boundaries in [9].


Third, two isochore sequences may look similar at one length scale (e.g. 20 kb), but

quite different at another length scale. Fig.1 shows that MHC class II sequence is more

heterogeneous than MHC class III sequence when viewed at the 2-10 kb length scales. It

is known that GC-poor sequences are generally considered to be more homogeneous than

GC-rich sequences, or more accurately, a sequence with a GC% closer to 50% is more

heterogeneous than a sequence whose GC% is far away from 50% [13, 3, 15]. Since the

GC% of MHC class III and II sequence is 51.9% and 41.1%, respectively, we might expect

MHC class II sequence to be more homogeneous than class III sequence. Interestingly,

Fig.1 shows the contrary.

To conclude, the binomial test used in [4] should not be a test of homogeneity if the

expected variance does not reflect the true variance in the sequence. The reason that the

expected variance in a binomial test (which is derived from the mean GC% instead of

being an independent parameter) is unrealistic is because the underlying base sequence is

not random/uncorrelated. We are naturally led to the ANOVA test if we actually estimate

the variance from the data. With ANOVA tests, it is clear that homogeneous regions of

GC% in human genome do exist; in other words, isochores exist.

Methods

Binomial test: Following [23], a binomial test is applied to many GC% values mea-

sured from a fixed-sized window (e.g. 20 kb). For example, if the sequence length is

900 kb, there are n =45 such 20-kb windows and 45 GC% values. The variance of these

GC%’s (σ2) is calculated, and the variance as expected from a binomial distribution is

σ20 = m(1−m)/20000, where m is probability of G or C. The value of m can be estimated

by the actual GC% of the sequence. The test statistic is c2 = (n − 1)σ2/σ20. For null

hypothesis (that windowed GC% measurements do follow binomial distribution, which is

true when the underlying base sequence is random/uncorrelated within the window), c2


follows the χ2df=n−1 distribution (e.g. χ2

df=44 in our example). For any given c2 value, the

p-value can be determined by the corresponding χ2 distribution.

ANOVA test: ANOVA test (analysis of variance) is applied to several groups of

GC%’s (as a comparison, binomial test is only applied to one group of GC%’s). The

concept of “group” and “member” in ANOVA now becomes “super-window” and “window”

here. The number of super-windows partitioned in a sequence is a, and the number of

windows in the super-window i is ni. The two “sum of squares” (SS) are defined: SSw =

∑ai=1

∑nij=1(GC%ij−GC%i)

2 (within a group), and SSa =∑a

i=1 ni(GC%i−GC%)2 (among

groups). The test statistic is F = SSa/SSw × ∑ai=1(ni − 1)/(a − 1). The distribution of

F under null (i.e., GC%1=GC%2= · · · GC%a) is known, and this distribution can be used

to determined the p-value.

Acknowledgments

We would like to acknowledge the financial support from the 5th Anton Dohrn Workshop

at Ischia (2001) where some of the ideas presented here were discussed. W.L. acknowledges

partial support from NIH contract N01-AR12256, P.B.G., P.C. and J.L.O. acknowledge the

grant support BIO99-0651-CO2-01 from the Spanish Government.

References

[1] Bernardi G: The isochore organization of the human genome and its evolu-

tionary history–a review, Gene 1993, 135:57-66.

[2] Bernardi G: The human genome: organization and evolutionary history,

Annual Review of Genetics 1995, 23:637-661.


[3] Cuny G, Soriano P, Macaya G, Bernardi G: The major components of the mouse

and human genomes. I. preparation, basic properties and compositional

heterogeneity, European Journal of Biochemistry 1981, 115:227-233.

[4] Lander ES, Waterston RH, Sulston J, Collins FS, et al. (International human genome

sequencing consortium): Initial sequencing and analysis of the human genome,

Nature 2001, 409:860-921.

[5] Fukagawa T, Sugaya K, Matsumoto K, Okumura K, Ando A, Inoko H, Ikemura T:

A boundary of long-range G+C% mosaic domains in the human MHC

locus: pseudoautosomal boundary-like sequence exists near the boundary,

Genomics 1995, 25:184-191.

[6] Fukagawa T, Nakamura Y, Okumura K, Nogami M, Ando A, Inoko H, Saito N, Ike-

mura T: Human pseudoautosomal boundary-like sequences: expression and

involvement in evolutionary formation of the present-day pseudoautosomal

boundary of human sex chromosomes, Human Molecular Genetics 1996, 5:23-32.

[7] Stephens R, Horton R, Humphray S, Rowen L, Trwosdale J, Beck S: Gene organisa-

tion, sequence variation and isochore structure at the centrometric bound-

ary of the human MHC, Journal of Molecular Biology 1999, 291:789-799.

[8] Beck S, Geraghty D, Inoko H, Rowen L, et al. (The MHC sequencing consortium):

Complete sequence and gene map of a human major histocompatibility

complex, Nature 1999, 401:921-923.

[9] Oliver JL, Bernaola-Galvan P, Carpena P, Roman-Roldan R: Isochore chromosome

maps of eukaryotic genomes, Gene 2001, 276:47-56.

[10] Li W: Delineating relative homogeneous G+C domains in DNA sequences,

Gene 2001, 276:57-72.


[11] An online resource on isochore mapping : [http://bioinfo2.ugr.es/isochores/]

[12] Sueoka N: A statistical analysis of deoxyribonucleic acid distribution in den-

sity gradient centrifugation, Proceedings of the National Academy of Sciences

1959, 45:1480-1490.

[13] Sueoka N: On the genetic basis of variation and heterogeneity of DNA base

composition, Proceedings of the National Academy of Sciences 1962, 48(4):582-592.

[14] Li W, Stolovitzky G, Bernaola-Galvan P, Oliver JL: Compositional heterogeneity

within, and uniformity between, DNA sequences of yeast chromosomes,

Genome Research 1998, 8:916-928.

[15] Clay O, Carel N, Douady C, Macaya G, Bernardi G: Compositional heterogeneity

within and among isochores in mammalian genomes. I. CsCl and sequence

analyses, Gene 2001, 276:15-24.

[16] Bernaola-Galvan, P, Oliver JL, Carpena P, Clay O, Bernardi G: Intragenomic het-

erogeneity in prokaryotic genomes, Gene 2002, submitted.

[17] Oliver JL, Li W: Quantitative analysis of compositional heterogeneity in long

DNA sequences: the two-level segmentation test (abstract), Genome Mapping,

Sequencing & Biology (Cold Spring Harbor Laboratory) 1998, page 163.

[18] Li W, Marr T, Kaneko K: Understanding long-range correlations in DNA

sequences, Physica D 1994, 75:392-416.

[19] Bernaola-Galvan P, Roman-Roldan R, Oliver JL : Compositional segmentation

and long-range fractal correlations in DNA sequences, Physical Review E

1996, 53:5181-5189.


[20] Li W: The study of correlation structures of DNA sequences – a critical

review, Computer & Chemistry (special issue on open problems of computational

molecular biology) 1997, 21:257-271.

[21] Bettecken Th, Aissani B, Muller CR, Bernardi G: Compositional mapping of the

human dystrophin-encoding gene, Gene 1992, 122:329-335.

[22] Bernardi G: Isochores and the evolutionary genomics of vertebrates, Gene

2000, 241:3-17.

[23] web supplement material for (Lander et al., 2001): [

http://www.nature.com/nature/journal/v409/n6822/suppinfo/409860a0.html]


22

22

MHC class 3

window size

-log1

0(p-

value

)

5 10 15 20 25

12

3

333

3

555

5

88

8

8

p=0.05

p=0.01

p=0.001

GC%=51.9%, L=642kb

80kb

210kb

320kb

128kb

2

2

2

2

MHC class 2

window size

5 10 15 20

33

3

3 5

5

5

5

9

9

9

9 GC%=41.1%, L=901kb

300kb

100kb180kb

450kb

Figure 1: The − log10(p−value) of ANOVA tests as a function of the window sizes, for MHC class III (left)

and MHC class II (right) sequences. These tests with the same number of super-windows are connected

in a line. The size of the super-window and the number of super-windows in the sequence is indicated for

each line.


seq # win (n) mean var σ2 binomial var σ20 σ2/σ2

0 c2 = (n− 1)σ2/σ20 p-value

MHC class III 32 0.5188 0.0005345 0.00001248 42.8215 1327.47 0

MHC class II 45 0.4105 0.0007268 0.00001210 60.0709 2703.19 0

random (class III) 32 0.5185 0.00001137 0.00001248 0.9110 28.2402 0.609

random (class II) 45 0.4106 0.00001255 0.00001210 1.0369 45.6244 0.404

B. burgdorferi 45 0.2859 0.0001515 0.00001021 14.8432 653.099 0

Table 1: Testing the hypothesis that GC% values sampled from 20-kb windows follow a

binomial distribution. Five sequences are tested: MHC class III and MHC class II isochore sequences,

two random sequences similar these two MHC sequences (same length and same base composition), and

bacterium Borrelia burgdorferi genome sequence. Detailed explanation of column headers: 1. Sequence

name. 2. Total number of windows in the sequence (n), with each contributing a GC% value. 3. Mean of

the GC% (m). 4. Variance of the GC% (σ2). 5. Variance of GC% expected from a binomial distribution

(σ20 = m(1 − m)/20000). 6. Ratio of the two variances σ2/σ2

0 . 7. test statistic c2 = (n − 1)σ2/σ20 . 8.

p-value from the binomial distribution test.


df SS MS F-value p-value

MHC class III (sw=2, w=16)

between windows 1 0.0009159 0.0009159 1.781 0.192

within windows 30 0.01543 0.0005143

MHC class II (sw=3, w=15)

between windows 2 0.001658 0.0008288 1.162 0.323

within windows 42 0.02997 0.0007137

random seq similar to class III (sw=2, w=16)

between windows 1 0.00000288 0.00000288 0.247 0.623

within windows 30 0.0003496 0.00001165

random seq similar to class II (sw=3, w=15)

between windows 2 0.00004546 0.00002273 1.884 0.165

within windows 42 0.0005066 0.00001206

B. burgdorferi (sw=3, w=15)

between windows 2 0.0002064 0.0001032 0.671 0.517

within windows 42 0.006461 0.0001538

Table 2: ANOVA test results of the five sequences (two MHC isochore sequences and their ran-

domized sequences, and bacterium Borrelia burgdorferi sequence). df : degrees of freedom; SS: sum of

squares. MS: mean squares. F -value: test statistic value; p-value: p-value from the ANOVA test. sw and

w are the number of super-windows and windows.

isochores merit the prefix ‘iso’

Documents