on the expectation of the maximum of iid geometric random variables

9
Statistics & Probability Letters 78 (2008) 135–143 On the expectation of the maximum of IID geometric random variables Bennett Eisenberg Department of Mathematics #14, Lehigh University, Bethlehem, PA 18015, USA Received 6 December 2006; received in revised form 11 April 2007; accepted 23 May 2007 Available online 8 June 2007 Abstract A study of the expected value of the maximum of independent, identically distributed (IID) geometric random variables is presented based on the Fourier analysis of the distribution of the fractional part of the maximum of corresponding IID exponential random variables. r 2007 Elsevier B.V. All rights reserved. Keywords: Geometric random variables; Exponential random variables; Maximum; Asymptotic expectation; Fractional part; Fourier series; Bioinformatics Introduction The probability problem studied in this paper is motivated by a statistical problem in bioinformatics. In bioinformatics one often compares long sequences of nucleotides (denoted A, C, G, T) in DNA for similarities or sometimes observes a single string for long runs of a given nucleotide or pattern of nucleotides (see Ewens and Grant, 2005, sec. 5.4 and sec. 6.3.). In the simplest case of comparing two strands of DNA, one might line up the two DNA strands and count the length of matches. Under the null hypothesis of randomness, this might be modeled as a modified geometric random variable (the number of matches until the first non-match) with parameter p ¼ :75, the probability of a non-match. One might also try to match triples of nucleotides called codons, which play a major role in molecular biology. Here we might use p ¼ 1 :25 3 ¼ :98. These values for p are approximations since the nucleotides do not have equal probabilities in reality. In testing for statistical significance in these procedures, the test statistic is often the maximum of an extremely large number of independent, identically distributed (IID) modified geometric random variables. In doing such tests approximations are used for the distribution of the test statistic under the null hypothesis. Ewens and Grant state on p. 138, ‘‘unless very accurate approximations are used for the mean and variancey, serious errors in P-value approximations can arisey’’. In this paper we analyze the standard approximation used for the mean of the maximum of IID geometric random variables. It is well known and easily shown that EðM n Þ, the expected value of the maximum of n IID exponential random variables with mean 1=l is ð1=lÞ P n k¼1 ð1=kÞ. This formula is useful for large n since P n k¼1 ð1=kÞ is ARTICLE IN PRESS www.elsevier.com/locate/stapro 0167-7152/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2007.05.011 E-mail address: [email protected]

Upload: bennett-eisenberg

Post on 29-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESS

0167-7152/$ - se

doi:10.1016/j.sp

E-mail addr

Statistics & Probability Letters 78 (2008) 135–143

www.elsevier.com/locate/stapro

On the expectation of the maximum of IIDgeometric random variables

Bennett Eisenberg

Department of Mathematics #14, Lehigh University, Bethlehem, PA 18015, USA

Received 6 December 2006; received in revised form 11 April 2007; accepted 23 May 2007

Available online 8 June 2007

Abstract

A study of the expected value of the maximum of independent, identically distributed (IID) geometric random variables

is presented based on the Fourier analysis of the distribution of the fractional part of the maximum of corresponding IID

exponential random variables.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Geometric random variables; Exponential random variables; Maximum; Asymptotic expectation; Fractional part; Fourier

series; Bioinformatics

Introduction

The probability problem studied in this paper is motivated by a statistical problem in bioinformatics. Inbioinformatics one often compares long sequences of nucleotides (denoted A, C, G, T) in DNA for similaritiesor sometimes observes a single string for long runs of a given nucleotide or pattern of nucleotides (see Ewensand Grant, 2005, sec. 5.4 and sec. 6.3.). In the simplest case of comparing two strands of DNA, one might lineup the two DNA strands and count the length of matches. Under the null hypothesis of randomness, thismight be modeled as a modified geometric random variable (the number of matches until the first non-match)with parameter p ¼ :75, the probability of a non-match. One might also try to match triples of nucleotidescalled codons, which play a major role in molecular biology. Here we might use p ¼ 1� :253 ¼ :98. Thesevalues for p are approximations since the nucleotides do not have equal probabilities in reality. In testingfor statistical significance in these procedures, the test statistic is often the maximum of an extremely largenumber of independent, identically distributed (IID) modified geometric random variables. In doing such testsapproximations are used for the distribution of the test statistic under the null hypothesis. Ewens and Grantstate on p. 138, ‘‘unless very accurate approximations are used for the mean and variancey, serious errors inP-value approximations can arisey’’. In this paper we analyze the standard approximation used for the meanof the maximum of IID geometric random variables.

It is well known and easily shown that EðMnÞ, the expected value of the maximum of n IID exponentialrandom variables with mean 1=l is ð1=lÞ

Pnk¼1ð1=kÞ. This formula is useful for large n since

Pnk¼1ð1=kÞ is

e front matter r 2007 Elsevier B.V. All rights reserved.

l.2007.05.011

ess: [email protected]

Page 2: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143136

asymptotic to ðlog nþ gÞ, where g ¼ :577 . . . is Euler’s constant. There is no such simple expression for EðM�nÞ,

the expected value of the maximum of n IID geometric random variables with mean 1=p. There is the infiniteseries expression from the tail probabilities of the maximum of the random variables and even a finite sumexpression, but these expressions are not so useful.

It is also easily seen that

1

l

Xn

k¼1

1

kpEðM�

nÞo1þ1

l

Xn

k¼1

1

k,

where q ¼ ð1� pÞ ¼ e�l. This is better, but not good enough. The gap of 1 between the upper and lowerbounds is significant for moderate values of l and n. A finer analysis is needed.

Careful asymptotic approximations are given for EðM�nÞ in Szpankowski and Rego (1990). In our notation

the Szpankowski and Rego result for the first moment is as follows:

EðM�nÞ ¼

Xn

k¼1

1

lkþ

1

2�

1

l

Xma0

G2pim

l

� �e�2pim log n=l þOðn�1Þ. (1)

Since jGð2pim=lÞj is very small for lo2 and all ma0, this result has the interpretation that for lo2 andlarge n that EðM�

n �Pn

k¼11lkÞ, is close to 1=2. This is the interpretation used in Jeske and Blessinger (2004) and

applied to bioinformatics by Ewens and Grant (2005).This interpretation is true, but can be misleading. First, there is no convergence to 1=2 since the gamma

function terms do not go to zero. They are just very small. Also the error term Oðn�1Þ means that here is anunknown constant C such that Oðn�1ÞoC=n. Without knowing the value of C, one does not know what thebound really is. In this case this problem is compounded by the fact that for reasonable values of n and l, thevalue 1=n is much greater than the value of the gamma function terms. Hence the Oðn�1Þ term can easilydominate the infinite sum in determining how close EðM�

n �Pn

k¼11lkÞ is to 1=2.

In this paper we use simple Fourier analysis to show that EðM�nÞ �

Pnk¼1

1lk

is very close to 1=2 not only formoderate values of l, but also relatively small values of n and that this difference is logarithmically summableto 1=2 for all values of l. Moderate l may be interpreted as lo2. This corresponds to po:865, which is almostalways the case. We see, however, that the codon example is an exception to this. A key component of thiswork is the analysis of the distribution of the fractional part of Mn.

1. A survey of formulas for EðMnÞ and EðM�nÞ

Let X 1;X 2; . . . ;X n be IID exponential random variables with PðXpxÞ ¼ 1� e�lx for x40 and letMn ¼ maxðX 1; . . . ;X nÞ. We then have PðMnpxÞ ¼ ð1� e�lxÞ

n. It follows that

EðMnÞ ¼

Z 10

PðMn4xÞdx ¼

Z 10

1� ð1� e�lxÞn dx

¼

Z 1

0

1� un

lð1� uÞdu ¼

Z 1

0

Xn�1k¼0

uk

ldu ¼

Xn

k¼1

1

lk.

This also follows by decomposing Mn as

Mn ¼ X ð1Þ þ ðX ð2Þ � X ð1ÞÞ þ � � � þ ðX ðnÞ � X ðn�1ÞÞ

¼ Y 1 þ Y 2 þ � � � þ Y n, ð2Þ

where X ðiÞ is the ith order statistic of X 1; . . . ;X n. It follows from the lack of memory property of exponentialrandom variables and the fact that the minimum of exponential random variables is exponential thatY 1; . . . ;Y n are independent exponential random variables with parameters nl; ðn� 1Þ=l; . . . ; 1=l, respectively.This implies

EðMnÞ ¼Xn

k¼1

1

lkand VarðMnÞ ¼

Xn

k¼1

1

l2k2. (3)

Page 3: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143 137

Things are not so simple for discrete geometric random variables. Let X �1; . . . ;X�n be IID geometric

random variables with PðX �i ¼ kÞ ¼ qk�1p, for k ¼ 1; . . . : We then have PðX �i pkÞ ¼ 1� qk. Now letM�

n ¼ maxðX �1; . . . ;X�nÞ. Then PðM�

npkÞ ¼ ð1� qkÞn, where k is a positive integer.

In analogy with the exponential case, we have

EðM�nÞ ¼

X1k¼0

PðMn4kÞ ¼X1k¼0

ð1� ð1� qkÞnÞ.

There is no closed form expression for this sum and we cannot decompose M�n into the sum of independent

geometric random variables analogous to the decomposition of M�n since the order statistics X �ð1Þ; . . . ;X

�ðnÞ are

not necessarily distinct. However, letting q ¼ e�l, we can sayZ 10

1� ð1� e�lxÞn dxo

X1k¼0

ð1� ð1� e�lkÞnÞo1þ

Z 10

1� ð1� e�lxÞn dx.

It follows that

1

l

Xn

k¼1

1

koEðM�

nÞo1þ1

l

Xn

k¼1

1

k. (4)

Let Kn equal the number of X �1; . . . ;X�n equal to 1. Szpankowski and Rego (1990) use complicated analysis

to solve the recursive relation

EðM�nÞ ¼

Xn

k¼0

PðKn ¼ kÞEðM�njKn ¼ kÞ

¼Xn

k¼0

n

k

!pkð1� pÞn�k

ð1þ EðM�n�kÞÞ

to find exact and asymptotic expressions for EðM�nÞ. We can easily derive some of their results using the tail

probability generating of M�n. We have

QnðsÞ ¼X1k¼0

skPðM�n4kÞ ¼

X1k¼0

skð1� ð1� qkÞnÞ

¼X1k¼0

skXn

j¼1

n

j

!ð�1Þ jþ1q jk ¼

Xn

j¼1

ð�1Þ jþ1n

j

!ð1� sq jÞ

�1.

Letting s ¼ 1, we find EðM�nÞ ¼ Qnð1Þ ¼

Pnj¼1ð�1Þ

jþ1n

j

!ð1� q jÞ

�1. This agrees with the formula (2.6) given

in Szpankowski and Rego (1990). Higher moments can be found using derivatives of QnðsÞ. However, eventhese finite sums are not very useful in analyzing the behavior of the moments as n!1.

2. The connection between EðMnÞ and EðM�nÞ

If x is any variable or number, bxc denotes the greatest integer less than or equal to x and fxg ¼ x� bxc

denotes the fractional part of x. We also say that fxg ¼ xmod 1. If X is exponential with parameter l, then fork a non-negative integer and 0pxo1, we have

PðbXc ¼ k; fX gpxÞ ¼ PðkpXok þ xÞ

¼ e�lk � e�lðkþxÞ ¼ e�lkð1� e�lxÞ.

It follows that bXc and fX g are independent with PðbXc ¼ kÞ ¼ e�lkð1� e�lÞ and PðfX gpxÞ ¼ ð1� e�lxÞ

ð1� e�lÞ�1 for k ¼ 0; 1; 2; . . . and 0pxo1. We thus have bXc ¼ X � � 1, where X � is geometric with q ¼ e�l.

Page 4: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143138

Also

Mn ¼ maxðX 1; . . . ;X nÞ ¼ maxðbX 1c; . . . ; bX ncÞ þ fMng

¼ maxðX �1 � 1; . . . ;X �n � 1Þ þ fMng ¼M�n � 1þ fMng,

where the X �i are IID geometric random variables.Thus

EðM�nÞ ¼ EðMnÞ þ 1� EðfMngÞ ¼

Xn

k¼1

1

lkþ 1� EðfMngÞ. (5)

This gives a probabilistic proof of an improved version of (4). It follows from (5) that accurate estimates ofEðM�

nÞ can be found by analyzing EðfMngÞ, the expected value of the fractional part of the maximum of n IIDexponential random variables.

3. The asymptotic distribution of fMng

The asymptotic distribution of the fractional part of Mn received some attention in the 1990s. Jagers (1990)solved a problem of Steutel of proving that the limiting distribution did not exist. Jagers proved this usingfairly complicated analysis. He remarked that the proof is much easier for very large values of the parameter l.In this section we give a simple proof of this result using characteristic functions and sequences. We then usethese same characteristic sequences to study the asymptotic expectation.

The proof is based on the well-known and easily proved result that Mn � log n=l converges in distributionto a random variable W with cumulative distribution function expð�e�lwÞ and characteristic functionGð1� it=lÞ (e.g. Ewens and Grant, 2005, p. 108). The decomposition in (2) implies that Mn has characteristicfunction fnðtÞ ¼

Qnk¼1ð1�

itlkÞ�1 and Mn � log n=l has characteristic function fnðtÞ expð�it log n=lÞ. It is

worth noting here that for geometric random variables not only does M�n � log n=l not converge in

distribution, but there do not exist sequences an and bn such that ðM�n � anÞ=bn converges in distribution to a

non-degenerate random variable (see Arnold et al., 1992, p. 217).

Theorem 1. The limiting distribution of fMng does not exist.

Proof. Let fnðtÞ be the characteristic function of Mn. It follows that

fnð2pmÞ ¼ Eðe2pimðbMncþfMngÞÞ ¼ Eðe2pimfMngÞ.

That is, fnð2pmÞ is the characteristic sequence of fMng, where m is an arbitrary integer. Since 0pfMngo1, itdetermines its distribution. Furthermore, fMng converges in distribution if and only if fnð2pmÞ converges forevery m.

We know that Mn � log n=l converges to W with characteristic function Gð1� it=lÞ, which is never 0. Thusfor each m, fnð2pmÞe�2pmi log n=l converges to a non-zero value. If fnð2pmÞ converged, it would have toconverge to a non-zero value as well since je2pmi log n=lj ¼ 1. This would imply that e2pmi log n=l converges, but itclearly does not. Therefore fMng does not converge in distribution. &

A direct analysis of fnð2pmÞ would also show that it does not converge for ma0. Jaegers notes that

liml!1

VarðMnÞp liml!1

X1k¼1

1

l2k2¼ 0,

while EðMnÞ ¼Pn

k¼11=ðlkÞ. It follows that as l!1 for large n the distribution of Mn is concentrated aboutPnk¼11=ðlkÞmod1. As n increases, these values are dense in ½0; 1�, so fMng cannot possibly converge in

distribution. It follows that for large enough l;EðfM�ngÞ will vary almost all the way from 0 to 1 as n!1.

Combined with (5), we see that for large l;EðM�nÞ �

Pnk¼11=ðlkÞ will not stay close to 1/2 as n!1.

We now consider the asymptotic behavior of EðfMngÞ. To do that, we find a useful formula for the densityof fMng. Note that fnð2pmÞ ¼

Qnk¼1ð1�

2pmilkÞ�1 and thus jfnð2pmiÞj2 ¼

Qnk¼1ð1þ

4p2m2

l2k2 Þ�1, which decreases

Page 5: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143 139

as n increases and as l decreases. Moreover,

limn!1

Yn

k¼1

1þ4p2m2

l2k2

� ��1¼ G 1�

2pmi

l

� ���������2

¼2p2m

l sinhð2p2m=lÞ.

The last equality follows from standard identities (see Silverman, 1967, pp. 310–320).

Theorem 2. The density of fMng is given by

gnðxÞ ¼ 1þXma0

Yn

k¼1

1þ2pimlk

� ��1e2pimx for 0pxp1.

Proof. This follows by noting that fnð�2pmÞ ¼R 10 gnðxÞe

�2pimx dx is the mth Fourier coefficient of gnðxÞ. Theexpression for gnðxÞ is merely its Fourier series expansion. To check the convergence we note thatPjfnð2pmÞj2p

Pjf1ð2pmÞj2 ¼

Pð1þ ð4p2m2Þ=l2Þ�1o1. &

An interesting question is how close is gnðxÞ to a uniform distribution for large n. Although we can writegnðxÞ ¼

P1t¼0f nðxþ tÞ, where f nðxÞ ¼ nle�lxð1� e�lxÞ

n is the density of MnðxÞ, this expression is not veryuseful for asymptotic analysis. For the asymptotic analysis, we invoke dominated convergence withjfnð2pmÞj2pjf1ð2pmÞj2, to show

limn!1

Xma0

jfnð2pmÞj2 ¼Xma0

2p2ml sinhð2p2m=lÞ

. (6)

This gives the corollary.

Corollary 1. If gnðxÞ is the density of fMng, thenZ 1

0

ð1� gnðxÞÞ2 dx ¼ 2

X1m¼1

Yn

k¼1

1þ4p2m2

l2k2

� ��1.

and

limn!1

Z 1

0

j1� gnðxÞj2 dx ¼

Xma0

2p2m

l sinhð2p2m=lÞ.

Now, liml!12p2m=l sinhð2p2m=lÞ ¼ 1, so the sum in the right side of (6) can be large for large values of l.However, for reasonable values of l, say lo2, things are different.

Let c ¼ 2p2=l in (4). Then

limn!1

Xma0

jfnð2pmÞj2 ¼Xma0

cm

sinhðcmÞ¼X1m¼1

4cme�cm

ð1� e�2cmÞ.

It easily follows that

4ce�cð1� e�cÞ�2o

X1m¼1

4cme�cm

1� e�2cmo4ce�cð1� e�2cÞ

�1ð1� e�cÞ

�2

and the bounds are relatively tight for say, c42 or lo10 since the ratio of the lower bound to the upperbound is ð1� e�2cÞ. In particular, we can see from the upper bound that for l ¼ 2 or c ¼ p2, thatlimn!1

Pma0jfnð2pmÞj2o:0021, which is very small. Using Corollary 1, we find the results summarized in

Table 1.We note that

R 10 jgnðxÞ � 1j2 dx is always a decreasing function of n and an increasing function of l. The

most interesting cases in applications are where q is large and p is very small, so the table shows that in thiscase, even for nX10 and qX:368 that

R 10 jgnðxÞ � 1j2 dxo:000074. On the other hand, we see that for

qo2� 10�9, no matter how large the sample size n is, we haveR 10jgnðxÞ � 1j2 dx44. This confirms the earlier

observation that for large l, the distribution of fMng is not close to uniform for large n.

Page 6: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESS

Table 1R 10 ð1� gnðxÞÞ

2 dx

l q ¼ e�l n ¼ 10 n!1

1 .368 .0000074 .000000211

2 .135 .00258 .00204

4 .018 .185 .144

20 2� 10�9 4.26 4.00

B. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143140

We can use these formulas to estimate EðfMngÞ. We have

EðfMngÞ �1

2

��������2

¼

Z 1

0

x�1

2

� �ðgnðxÞ � 1Þdx

��������2

oZ 1

0

x�1

2

� �2

dx

Z 1

0

ðgnðxÞ � 1Þ2 dx ¼1

24

Z 1

0

ðgnðxÞ � 1Þ2 dx.

It then follows, for example, from the table that for lp2ðqX:135Þ and nX10 that jEðfMngÞ � 1=2jo:011and for lp1ðqX:368Þ and nX10 that jEðfMngÞ � 1=2jo:0006. Thus from (3), we have that for qX:368 andnX10,

EðM�nÞ �

Xn

k¼1

1

lkþ 1=2

!����������o:0006.

Brands et al. (1994) derive bounds on jR t

0 gnðxÞdx� tj by different methods. In terms of the results above wecan say

Z t

0

gnðxÞdx� t

�������� ¼

Z 1

0

1½0;t�ðgnðxÞ � 1Þdx

��������o ffiffi

tp

Z 1

0

ðgnðxÞ � 1Þ2 dx

� �1=2

.

We can also use the Fourier series to get an exact expression for EðfM�ngÞ. We have that

R 10 xe�2pimx dx ¼

ð�2pmiÞ�1 for ma0 andR 10 xdx ¼ 1=2. Thus

EðfM�ngÞ ¼

Z 1

0

xgnðxÞdx ¼1

2þXma0

1

2pmifnð�2pmÞ.

Substituting this into (5) gives us the following corollary.

Corollary 2.

EðM�nÞ ¼

X1k¼1

1

lkþ

1

2�Xma0

1

2pmi

Yn

k¼11þ

2pmi

kl

� ��1.

The main interest of this formula is that it gives an exact version of the approximation (1). To see theconnection, note that

limn!1

exp2pim log n

l

� �Yn

k¼1

1þ2pimlk

� ��1¼ G 1þ

2piml

� �.

Moreover,

G 1þ2piml

� �¼

2piml

G2piml

� �.

Page 7: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143 141

So

1

2pmi

Yn

k¼1

1þ2pmi

kl

� ��1�

1

lG

2piml

� �exp �

2pim log n

l

� �.

4. The limit of the logarithmic means of fM�ng

In this section we show that although the sequence fM�ng does not converge in distribution,

limn!1

1

log n

Xn

k¼1

PðfM�kgpxÞ=k ¼ x.

Such convergence is called logarithmic summability. It follows that EðfM�ngÞ and EðM�

nÞ �Pn

k¼11=ðlkÞ areeach logarithmically summable to 1=2 and this is true even for large values of l, where the distributions offMng are concentrated around values all the way from 0 to 1 as n!1. Although we have not proved thatEðfM�

ngÞ does not converge for small l, it seems highly unlikely given the formulas we have derived.Let Kn equal the number of bX ic equal to maxðbX 1c; . . . ; bX ncÞ. Then, using the fact that bX ic and fX ig are

independent, it is easily seen that Mn ¼ maxðX 1; . . . ;X nÞ ¼ bMnc þ fMng, where bMnc ¼ maxðbX 1c; . . . ; bX ncÞ

and fMng ¼W Kn, where W Kn

is the maximum of Kn IID random variables equal in distribution to fX ig.For example, if X 1 ¼ 4:5, X 2 ¼ 1:8, and X 3 ¼ 4:2, then M3 ¼ 4:5, bM3c ¼ maxðbX 1c; bX 2c; bX 3cÞ ¼

maxð4; 1; 4Þ ¼ 4, K3 ¼ 2, and fM3g ¼W 2 ¼ maxðfX 1g; fX 3gÞ ¼ maxð:5; :2Þ ¼ :5.We therefore have

PðfMngpxÞ ¼ PðW KnpxÞ ¼

X1k¼1

PðKn ¼ kÞF kðxÞ. (7)

This result is also used in Brands et al. (1994). It is convenient to allow the sum to go to infinity even thoughPðKn ¼ kÞ ¼ 0 for k4n.

We recall that bX ic þ 1 is geometric with q ¼ e�l. It then follows from (7) and Theorem 1 that for IIDgeometric random variables, the number of variables taking the maximum value does not converge indistribution. (If they did, fMng would have to converge in distribution and it does not.) Eisenberg et al. (1993)used a much more complicated argument to prove this surprising result. However, it is shown in Baryshnikovet al. (1995) and reproved by a different method in Olofsson (1999) that

limN!1

1

logN

XN

n¼1

PðKn ¼ kÞ

n¼ð1� qÞk

kj log qjfor k ¼ 1; 2; . . . . (8)

This leads to the following theorem.

Theorem 3.

limN!1

1

logN

XN

n¼1

PðfMngpxÞ

n¼ x for 0pxp1.

Proof. From (7) we have

1

logN

XN

n¼1

PðfMngpxÞ

1

logN

XN

n¼1

X1k¼1

PðKn ¼ kÞ

nF kðxÞ

¼X1k¼1

1

logN

XN

n¼1

PðKn ¼ kÞ

n

!FkðxÞ.

Page 8: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143142

Thus using (8) and dominated convergence from the summability of F kðxÞ, we have

limN!1

1

logN

XN

n¼1

PðfMngpxÞ

n¼X1k¼1

ð1� e�lÞk

kj logðe�lÞj

ð1� e�lxÞ

ð1� e�lÞ

� �k

.

The right side immediately simplifies to

X1k¼1

ð1� e�lxÞk

kl¼� logðe�lxÞ

l¼ x: &

Corollary 3.

limN!1

1

logN

XN

n¼1

EðfMngÞ

n¼ 1=2.

Proof.

1

logN

XN

n¼1

1� PðfMngpxÞ

n

����������o3

for all N41. Therefore from Theorem 3 and dominated convergence

1

logN

XN

n¼1

EðfMngÞ

1

logN

XN

n¼1

R 10 ð1� PðfMngpxÞdx

n

¼

Z 1

0

1

logN

XN

n¼1

ð1� PðfMngpxÞ

ndx!

Z 1

0

1� xdx ¼ 1=2: &

Corollary 3 plus (5) gives us the following corollary.

Corollary 4.

limN!1

1

logN

XN

n¼1

EðM�nÞ �

Pnk¼1

1

lk

n¼ 1=2.

5. A puzzling phenomenon

We next consider the sequence of new record values. Call the nth new record value Rn. Then Rnþ1 is thevalue of the first exponential random variable after the time of the nth new record taking a greater valuethan Rn. It follows by the lack of memory property of exponential random variables that Rnþ1 is equal indistribution to Rn þ X , where X is an exponential random variable independent of Rn. Thus Rn is equal indistribution to X 1 þ � � � þ X n, where the X i’s are IID exponential random variables with parameter l (e.g.Arnold et al., 1992, p. 243). Rn therefore has an Erlangðn; lÞ distribution and characteristic functioncðtÞ ¼ ð1� it=lÞ�n. fRng has characteristic sequence cnð2pmÞ which converges to 0 for ma0. It follows thatfRng converges in distribution to a uniform random variable on ½0; 1�.

This is a bit surprising after what we have seen for fMng. Let Nn denote the number of new records in thefirst n random variables. It then follows that fMng ¼ fRNn

g. That is, the maximum observation up to time n isthe value of the most recent new record. Now it is known for large n that Nn is of the order log n (see Arnold

Page 9: On the expectation of the maximum of IID geometric random variables

ARTICLE IN PRESSB. Eisenberg / Statistics & Probability Letters 78 (2008) 135–143 143

et al., 1992, p. 247) so at first glance one might think that fMng has approximately the same asymptoticdistribution as fR½log n�g, namely the uniform distribution. Evidently this is not the case. With some furtheranalysis one can see where this reasoning goes wrong.

We have PðRNnpxjNn ¼ mÞ ¼ PðMnpxjNn ¼ mÞ. But clearly Mn and Nn are independent. The number of

new records in n observations depends on the order of the observations and not on their absolute magnitudes.So, for example, PðM2pxjN2 ¼ 1Þ ¼ PðX 1pxjX 14X 2Þ and PðM2pxjN2 ¼ 2Þ ¼ PðX 2pxjX 24X 1Þ. Thesetwo probabilities are the same by symmetry, so we see that M2 is independent of N2. In this way we concludethat

PðMnpxÞ ¼Xn

m¼1

PðNn ¼ mÞPðRNnpxjNn ¼ mÞ

¼Xn

m¼1

PðNn ¼ mÞPðRNnpxÞ ¼ PðRNn

pxÞ.

Thus the relation fMng ¼ fRNng appears to be of no help in relating the asymptotic distribution of fMng to

that of fRng. Still there should be some connection since both represent sequences of maximum values.A description of the relationship would be of interest.

References

Arnold, B.C., Balakrishnan, N., Nagaraja, H.N., 1992. A First Course in Order Statistics. Wiley, New York.

Baryshnikov, Y., Eisenberg, B., Stengle, G., 1995. A necessary and sufficient condition for the existence of the probability of a tie for first

place. Statist. Probab. Lett. 23, 203–209.

Brands, J.J.A.M., Steutel, F.W., Wilms, R.J.G., 1994. On the number of maxima in a discrete sample. Statist. Probab. Lett. 20, 209–217.

Eisenberg, B., Stengle, G., Strang, G., 1993. The asymptotic probability of a tie for first place. Ann. Appl. Probab. 3, 731–745.

Ewens, W.J., Grant, G.R., 2005. Statistical Methods in Bioinformatics. An Introduction, second ed. Springer, New York.

Jagers, A.A., 1990. Solution of problem 247 of F.W. Steutel. Statist. Neerlandica 44, 180.

Jeske, D.R., Blessinger, T., 2004. Tunable approximations for the mean and variance of the maximum of heterogeneous geometrically

distributed random variables. Amer. Statist. 58, 322–327.

Olofsson, P., 1999. A Poisson approximation with applications to the number of maxima in a discrete sample. Statist. Probab. Lett. 44,

23–27.

Silverman, R.A., 1967. Introductory Complex Analysis. Prentice-Hall, Englewood Cliffs, NJ.

Szpankowski, W., Rego, V., 1990. Yet another application of binomial recurrence: order statistics. Computing 43, 401–410.