[ieee 2008 ieee international workshop on genomic signal processing and statistics (gensips) -...

AN ASYMPTOTICALLY-EXACT EXPRESSION FOR THE VARIANCE OFCLASSIFICATION ERROR FOR THE DISCRETE HISTOGRAM RULE

Ulisses Braga-Neto

Department of Electrical and Computer EngineeringTexas A&M University

[email protected]

ABSTRACT

Discrete classification is fundamental in GSP applications.In a previous publication, we provided analytical expres-sions for moments of the sampling distribution of the trueerror, as well as of resubstitution and leave-one-out errorestimators, and their correlation with the true error, for thediscrete histogram rule. When the number of samples orthe total number of quantization levels is large, compu-tation of these expression becomes difficult, and approx-imations must be made. In this paper, we provide an ap-proximate expression for the variance of the classificationerror, which is shown to be asymptotically exact as thetotal number of quantization levels increases to infinity,under a mild distributional assumption.

1. INTRODUCTION

Discrete classification is fundamental in GSP applications,in particular in methods for discrete gene expression pre-diction [2], and the inference of discrete gene expressionregulatory networks [3]. The most common discrete clas-sification rule is perhaps the discrete histogram rule. In[1], we found for this rule analytical expressions, valid forany sample size, for the first and second moments of thesampling distribution of the true error, as well as of re-substitution and leave-one-out error estimators, and theircorrelation with the true error. This made it possible tocompute exactly the bias, variance, and RMS of those es-timators. When the number of samples or the total numberof quantization levels is large, application of the formulasin [1] becomes problematic due to the high computationalcomplexity involved. The problem is particularly difficultin the case of computation of second moments (neededto determine variance and RMS), which needs the com-putation of the second-order joint probabilities of the cellcounts over pairs of cells. In this paper, we present anapproximation approach that reduces the computation ofsecond-order probabilities over pairs of cells to first-orderprobabilities over single cells, which dramatically speedsup computation. We illustrate this approach by deriving anapproximate formula for the variance of the classificationerror. We prove that this approximation is asymptoticallyexact as the total number of quantization levels increasesto infinity, under a mild distributional assumption.

2. DISCRETE HISTOGRAM RULE

Let X1, X2, . . . , Xd be a set of quantized predictor ran-dom variables such that each Xi is quantized into a finitenumber bi of values, and let Y be a target random vari-able taking values in {0, 1, . . . , c − 1} (we will assumec = 2). Since the predictors as a group take on values ina finite space of b =

∏bi

i=1 possible states, and a bijectioncan be established between this finite state-space and thesequence of integers 1, . . . , b, we may alternatively andequivalently assume, without loss of generality, a singlepredictor variableX taking on values in the set {1, . . . , b}.The value b is the total number of quantization levels, orthe number of “bins,” into which the data is categorized —this parameter provides a direct measure of the complexityof discrete classification.

The discrete classification model is completely speci-fied by the class prior probabilities c0 = P (Y = 0) andc1 = P (Y = 1), and the class-conditional probabilities:pi = P (X = i | Y = 0) and qi = P (X = i | Y = 1),for i = 1, . . . , b, where c0 + c1 = 1,

∑bi=1 pi = 1, and∑b

i=1 qi = 1. Let Sn = {(X1, Y1), . . . (Xn, Yn)} be ani.i.d. sample taken from this probability model, and definethe random variables

Ui = #{Xj = i | Yj = 0}, i = 1, . . . , b,Vi = #{Xj = i | Yj = 1}, i = 1, . . . , b,

(1)

where N0 =∑bi=1 Ui and N1 =

∑bi=1 Vi are the total

number of samples in classes 0 and 1, respectively, withN0 +N1 = n. The variables (U1, . . . , Ub, V1, . . . , Vb) arejointly multinomially distributed with parameters (n, c0p1,. . . , c0pb, c1q1, . . . , c1qb); hence, Ui ∼ Binomial(n, pi)and Vi ∼ Binomial(n, qi), for i = 1, . . . , b. Note also thatNi ∼ Binomial(n, ci), for i = 1, 2.

Given observed values Ui = ui and Vi = vi, fori = 1, . . . .b, the discrete histogram classification rule pro-duces the discrete classifier given by

ψn(i) = Iui<vi=

{1, ui < vi

0, otw, i = 1, . . . , b. (2)

1-4244-2372-9/08/$20.00 ©2008 IEEE

with classification error given by

εn =b∑i=1

[c0pi Iψn(i)=1 + c1qi Iψn(i)=0

]= c1 +

b∑i=1

(c0pi − c1qi) IUi<Vi.

(3)

The bin counts Ui, Vi are therefore sufficient statistics forthe classification error of the discrete histogram rule.

3. STATISTISCS OF THE CLASSIFICATIONERROR

The expected error over the sample is given by

E[εn] = c1 +b∑i=1

(c0pi−c1qi)E[IUi<Vi]

= c1 +b∑i=1

(c0pi−c1qi)P (Ui < Vi)

(4)

where

P (Ui < Vi) =∑k<l

P (Ui = k, Vi = l) (5)

with

P (Ui = k, Vi = l) =

=(

n

k, l, n−k−l

)(c0pi)k(c1qi)l(1−c0pi−c1qi)n−k−l

(6)for k, l = 0, . . . , n such that k + l ≤ n, with P (Ui = k,Vi = l) = 0, otherwise.

The variance of the classification error is given by Theexpected error over the sample is given by

Var(εn) = Var

(b∑i=1

(c0pi−c1qi) IUi<Vi

)

=b∑i=1

(c0pi−c1qi)2 Var(IUi<Vi) +

+ 2∑i<j

(c0pi−c1qi) (c0pj−c1qj) Cov(IUi<Vi, IUj<Vj

)

(7)

4. APPROXIMATION OF VARIANCE

Equation (7) can be computed analytically; however, it isvery computer-intensive, because it involves the calcula-tion of second-order probabilities of the kind P (Ui = k,Uj = l, Vi = r, Vj = s) (see [1], where a slightly differentdevelopment from the one here is given). But the covari-ance term in (7), and all second-order probabilities, dis-appear provided we can assume that the event {Ui < Vi}is approximately independent of {Uj < Vj}. As we willsee, this approximation is exact asymptotically in a largenumber of cases. Under the independence assumption, (7)

reduces to a very simple expression, which involves onlyfirst-order probabilities as in (4) and (5):

Var[εn] =b∑i=1

(c0pi−c1qi)2 Var(IUi<Vi)

=b∑i=1

(c0pi−c1qi)2 P (Ui < Vi)(1− P (Ui < Vi))

(8)Suppose that X(k); k = 1, 2, . . . is a sequence of pre-

dictors with number of quantization levels b(k), such thatb(1) < b(2) < . . . Let p(k)

i = P (X(k) = i | Y = 0) andq(k)i = P (X(k) = i | Y = 1), for i = 1, . . . , b(k). We

have the following result

Theorem. Under the assumption

limk→∞

p(k)i = lim

k→∞q(k)i = 0, for i = 1, . . . , b(k), (9)

the expression for the variance of classification in (8) isasymptotically exact.

Proof. Define the variables U (k)i ; i = 1, . . . , b(k) as in

(1). The distribution of U (k)i is binomial with param-

eters (n0, p(k)i ), whereas the distribution of U (k)

i givenU

(k)j = u is binomial with parameters (n0 − u, p

(k)i /p),

where p =∑m6=j p

(k)m . It is clear therefore thatU (k)

i is in-

dependent of U (k)j if and only if p(k)

i = 0 or p(k)j = 0. By

the same token, V (k)i is independent of V (k)

j if and only if

q(k)i = 0 or q(k)j = 0. Therefore, condition (9) guaran-

tees that the event {Ui < Vi} becomes independent of theevent {Uj < Vj} as k →∞. �

The assumption (9) is not too demanding. It is satis-fied, for instance, when the sequence of predictors X(k)

is obtained by continually partitioning the sampling in-tervals into equal-length subintervals, or by continuallyadding predicting variables, as long as these variables areconditionally-independent from the current predictor giventhe target, and all probabilities arebounded away from 0 and 1. For example, if the sequenceof predictors X(k) is obtained by continually adding bi-nary predicting variables which are jointly uniform, thenp(k)i , p

(k)j = O(2−k) as k → ∞. Since Cov[U (k)

i , U(k)j ] =

−np(k)i p

(k)j (so independence and uncorrelatedness are

equivalent in this case), it follows that Cov[U (k)i , U

(k)j ] =

O(2−2k) as k → ∞ and the rate of convergence of theapproximation to the exact variance is exponential (detailsto appear in upcoming publications).

Figures 1 and 2 display the standard deviation of theclassification error versus sample size and number of bins,respectively, based on the exact and approximate expres-sions for the variance, and assuming a Zipf model [1]. Wecan see that the approximation is quite good for the samplesizes and number of bins shown. We remark that asymp-totic convergence cannot be appreciated in the second plotdue to the small number of bins considered and the fact

that the bin probabilities in the Zipf model of [1] convergeslowly to zero as b increases. The good accuracy of theapproximation is obtained at a huge savings in computa-tion time. As an example, for n = 60 and b = 16, it takesmore than 30 minutes and less than 1 second to computethe exact and approximate expressions for the variance,respectively, using the R package on a MacBook Pro IntelDuo 2.33 GHz.

Figure 1. Exact (solid line) and approximate (dashed line)standard deviation of classification error versus samplesize.

Figure 2. Exact (solid line) and approximate (dashed line)standard deviation of classification error versus number ofbins.

5. REFERENCES

[1] U.M. Braga-Neto and E.R. Dougherty. Exact per-formance of error estimators for discrete classifiers.

Pattern Recognition, 38(11):1799–1814, 2005.

[2] S. Kim, E.R. Dougherty, M.L. Bittner, Y. Chen, K.Sivakumar, P. Meltzer, and J.M. Trent. A generalframework for the analysis of multivariate gene in-teraction via expression arrays. Biomedical Optics,5(4):411424, 2000.

[3] I. Schmulevich, E.R. Dougherty, S. Kim, and W.Zhang. Probabilistic Boolean Networks: a rule-based uncertainty model for gene-regulatory net-works. Bioinformatics, 18:261274, 2002.

[ieee 2008 ieee international workshop on genomic signal processing and statistics (gensips) -...

Documents