[ieee 2011 45th asilomar conference on signals, systems and computers - pacific grove, ca, usa...

Maximum Likelihood Estimation of theBinary Coefficient of Determination

Ting ChenDepartment of Electrical Engineering

Texas A & M University

College Station, Texas 77843

Email: [email protected]

Ulisses Braga-Neto∗Department of Electrical Engineering

Texas A & M UniversityCollege Station, Texas 77843

Email: [email protected] (Corresponding Author)

Abstract—The binary Coefficient of Determination (CoD) isa key component of inference methods in Genomic SignalProcessing. Assuming a stochastic logic model, we introduce anew sample CoD estimator based upon maximum likelihood(ML) estimation. Experiments have been conducted to assesshow the ML CoD estimator performs in recovering predictorsin multivariate prediction settings. Performance is comparedwith the traditional nonparametric CoD estimators based onresubstitution, leave-one-out, bootstrap and cross-validation. Theresults show that the ML CoD estimator is the estimator of choiceif prior knowledge is available about the logic relationships inthe model, even if this knowledge is incomplete.

Index Terms—CoD estimation, maximum liklihood estimation,stochastic logic model, multivariate predictive inference.

I. INTRODUCTION

The binary Coefficient of Determination (CoD) [1] measures

the predictive power of a set of binary predictor variables X ={X1, X2, ..., Xn} ∈ {0, 1}n with respect to a binary target

variable Y ∈ {0, 1}, as given by the simple formula:

CoD(X, Y ) =εY − εX,Y

εY(1)

where εY is the error of the best predictor of Y in the absence

of observations and εX,Y is the error of the best predictor

of Y based on the observation of X. The CoD measures

the nonlinear interaction between predictors and target. The

CoD has had far-reaching applications in Genomic Signal

Processing [2–6].

The CoD depends on the probability model connecting pre-

dictors and target, which, however, is usually unknown, or only

partially known. Therefore, the problem arises of how to find

a good sample CoD estimator. In a previous publication [7],

we studied nonparametric (“model-free”) CoD estimators such

as resubstitution, leave-one-out, cross-validation and bootstrap

CoD estimators. It was concluded that, provided one has

evidence of moderate to tight regulation between the genes,

and the number of predictors is not too large, one should

use the nonparametric maximum-likelihood estimator, i.e., the

resubstitution CoD estimator.

In Genomic Signal Processing, models of gene regulatory

networks play a significant role [2, 5]. In many cases of

practical interest, such a model must be inferred from noisy

gene-expression sample data, but partial knowledge about

the pathway of interest is already available. This motivates

us to investigate parametric maximum-likelihood (ML) CoD

estimation, where there is partial knowledge of the model from

which sample data are drawn. We carry out numerical exper-

iments to investigate how the ML CoD estimator performs in

inferring predictors by using synthetic data and compare the

results with those obtained by resubstitution, leave-one-out,

bootstrap and cross-validation CoD estimators. Results show

that the ML CoD estimator performs the best in recovering the

true multivariate regulatory relationships, under an incomplete

prior knowledge assumption.

The paper is organized as follows. Section II introduces a

two-input stochastic logic model, and the maximum likelihood

estimators of its parameters are studied. In Section III, we

review nonparametric CoD estimators, whereas Section IV

introduces the parametric ML CoD estimator. Section V com-

pares the performance of ML, resubstitution, leave-one-out,

bootstrap and cross-validation CoD estimators in the inference

of multivariate regulatory relationships. Finally, Section VI

presents concluding remarks.

II. STOCHASTIC LOGIC MODEL

In many applications, and in particular in Genomic Signal

Processing, models of regulatory networks play a prominent

role [2, 5]. Figure 1 displays an example of regulatory network

associated with the cell cycle. The activation and suppression

relationships between the various genetic switches lead to the

activation or not of DNA synthesis. We can see in Figure 1

that this network corresponds to a logic circuit, i.e., a Boolean

function.

In practice, such Boolean circuits must be identified from

sample data of the states of the various genes, which is

represented here in Figure 1. This measurement data is noisy,

and even the Boolean functions themselves can be affected by

uncertainty from unpredictable extraneous factors, which leads

to the need of a stochastic approach. For example, the state

”0 1 0 1” for the predicting genes appear two times in the

sample data, with conflicting values of the target gene, i.e.,

the sample data is “inconsistent” with a deterministic logic

circuit. Even though the final goal is to come up with the best

possible logic circuit model for the pathway, this model will

have an irreducible amount of error due to the noise.

1012978-1-4673-0323-1/11/$26.00 ©2011 IEEE Asilomar 2011

Fig. 1. Example of regulatory network, equivalent logic circuit, andsample data for the DNA synthesis pathway of the cell cycle. Adapted fromShmulevich et al. [5].

In order to address this problem, we employ in this paper

a stochastic logic model [2] in the problem of identification,

or inference, of multivariate predictive logical relationships

among binary elements, such as the ones present in the

regulatory circuits exemplified in Figure 1. The logic gates

in this model are replaced by a joint probability distribution

between predictors and target. In this paper, we consider the

case of two binary predictors X1 and X2 and one binary

target Y , i.e., two-input logic gates (the extension to many-

input logics is fairly involved, and will be treated in a future

publication).

Two-Input Logic Model: For a given Boolean function (logic

gate) f : {0, 1}2 → {0, 1}, let

P (Y = 1|X1 = x1, X2 = x2) = pf(x1,x2)(1− p)1−f(x1,x2)

(2)

P (X1 = x1, X2 = x2) = P x11 P x2

2 (1− P1)1−x1(1− P2)

1−x2

+ (−1)x1+x2γ,(3)

where p is the predictive power, Pi = P (Xi = 1), i = 1, 2are the predictor “biases” (the value 0.5 being considered

unbiased), and γ = E[X1X2]−E[X1]E[X2] is the covariance

between predictors. It is clear that (2) and (3) fully determine

the joint distribution P (X1 =x1, X2 =x2, Y = y) = P (Y =y|X1=x1, X2=x2)P (X1=x1, X2=x2).

We remark that this stochastic logic model differs from

previous models proposed in the literature. For example, the

“noisy-OR” model, commonly used in Bayesian network in-

ference [9], and the stochastic logic model of [10] both assume

deterministic logics with randomness only at the inputs. The

Probabilistic Boolean Network model [5] involves multiple

deterministic logics selected by a random probability. The

model considered here is most similar to the Boolean Networkwith perturbation model [11], which assumes a deterministic

logic with an output that can be flipped with a certain small

probability; however, there is no attempt to model the full joint

distribution between predictors and the target.

Note that the Boolean function appears in the two-input

logic model only in the formulation of P (Y = 1|X1 =x1, X2 = x2) in eq. (2). This conditional probability specifies

the stochastic logic gate corresponding to the Boolean func-

tion f . See Figure 2 for an illustration of stochastic AND,

OR, and XOR logic gates and their corresponding conditional

probability tables. Note also that the parameters P1, P2, and

γ concern only the marginal distribution of the predictors,

whereas p concerns only the stochastic logic gate. We will

assume throughout that p ≥ 1/2, since if p < 1/2 one obtains

the negated logic gate with predictive power 1− p.

Fig. 2. Example of stochastic logic gates and corresponding conditionalprobability tables.

Proposition 1. Under the two-input logic model, the best pre-dictor of the output Y given the inputs X1, X2, i.e., the func-tion with the maximal prediction accuracy P (ψ(X1, X2) =Y ) among all functions ψ : {0, 1}2 → {0, 1}, is the underlyingBoolean logic function itself, ψ∗(X1, X2) = f(X1, X2). Fur-thermore, the maximal prediction accuracy is P (f(X1, X2) =Y ) = p, the predictive power.

PROOF. See Appendix.

From Proposition 1, the predictive power p is the upper

bound on how well the output can be predicted from the inputs,

and it thus measures the amount of uncertainty in the model.

The best predictor is always the underlying ordinary logic

itself. The closer p is to 1, the less uncertainty there is, and

the more the stochastic logic gate operates as the underlying

ordinary logic.

A word about two-input logics. There are clearly 16 such

logics in all; 2 are constant logics, for which the output is a

constant 0 or 1, 4 are 1-minterm logics, which have only one

”1” in the output truth table (these include the AND logic),

4 are 3-minterm logics (these include the OR logic), which

1013

are symmetric to the 1-minterm logics via negation, and 6

are 2-minterm logics (these include the XOR logic). Of the 6

two-minterm logics, 4 are of the form X1, X1, X2, X2, which

are 1-input logics. Therefore, there are only 10 two-input logic

gates of interest, which are represented by the prototype logics

AND, OR, and XOR.

A. Maximum-Likelihood Estimation of Model Parameters

In this subsection we study the problem of estimating the

parameters p, P1, P2, and γ of the 2-input logic model from

i.i.d. sample data Sn = {(X11, X12, Y1), . . . , (Xn1, Xn2, Yn)}drawn from the model. First note that, by definition, P1 =P (X1 = 1) and P2 = P (X2 = 1), whereas, by Proposition 1,

p = P (f(X1, X2) = Y ). It is well-known that the maximum-

likelihood estimators of probabilities and expectations are the

associated empirical frequencies and sample-means, respec-

tively, which have several desirable properties [12]. This leads

to the following proposition, which is given without proof.

Proposition 2. The maximum-likelihood estimators of theparameters of two-input logic model are given by

p = 1− 1

n

n∑i=1

|f(Xi1, Xi2)− Yi| ,

P1 =1

n

n∑i=1

Xi1 , P2 =1

n

n∑i=1

Xi2 ,

γ =1

n

n∑i=1

Xi1Xi2 − 1

n2

n∑i=1

Xi1

n∑i=1

Xi2 .

(4)

It is easy to show that p, P1, and P2 are minimum-variance

unbiased, with Var[p] = 1np(1− p), Var[P1] =

1nP1(1− P1),

and Var[P2] =1nP2(1−P2). However, γ is a biased estimator,

such that E[γ] = n−1n γ. As ML estimators, all the pre-

vious estimators are asymptotically unbiased, asymptotically

efficient, and consistent [13].

III. THE DISCRETE COEFFICIENT OF DETERMINATION

We examine in this Section the formulation of the discrete

CoD, and its estimation, for the assumed case of two predictors

X1, X2 and a target Y . For the general case with p > 2predictors, see [2, 7].

It can be shown easily that the best predictor of Y in

the absence of variables is ψY = argmaxi P (Y = i), with

optimal error

εY = min{P (Y = 0), P (Y = 1)} , (5)

whereas the best of predictor of Y when X is available is

ψY (X) = argmaxi P (Y = i|X), with optimal error [14]:

εX,Y =

1∑x1=0

1∑x2=0

min{P (Y = 0, X1 = x1, X2 = x2),

P (Y = 1, X1 = x1, X2 = x2)} .(6)

where X = (X1, X2) is the predictor vector. This leads to an

analogous concept to the CoD of classical regression [1]:


εY= 1− εX,Y

εY. (7)

By convention, one assumes 0/0 = 1 in the previous def-

inition. It is clear that 0 ≤ CoD(X, Y ) ≤ 1. This CoD

measures the relative decrease in prediction error of Y when

using X as the predictor vector, relative to using no predictors;

if CoD(X, Y ) = 0, then X carries no information about Y ,

whereas if CoD(X, Y ) = 1, then X predicts Y perfectly. The

discrete CoD in (7) — for simplicity, we will refer to this as

the “CoD”, without qualification — has been applied to the

inference of multivariate discrete predictive relationships [1]

and Boolean networks [5]. The basic idea is to select the

predictors with the best CoD estimated from sample data (as

will be discussed in Section V).

Given the i.i.d. sample data Sn = {(X11, X12, Y1), . . . ,(Xn1, Xn2, Yn)} drawn from the underlying probability

model, a CoD estimator is obtained by estimation of εY and

εX,Y in (7):


εY= 1− εX,Y

εY. (8)

Typically, εY is given by the empirical frequency estimator,

εY = min

{N0

n,N1

n

}. (9)

with Ni being the number of sample points in state Y = i,i = 0, 1, (such that N0 + N1 = n), and εX,Y is obtained

by using one of a number of choices of error estimators in

connection with the discrete histogram rule [8]. It can be

shown that the nonparametric ML estimator of εX,Y is given

by the resubstitution error estimator [15]:

εrX,Y =1

n

1∑x1=0

1∑x2=0

min{U(x1.x2), V (x1.x2)} , (10)

where U(x1.x2) and V (x1, x2) denote the number of sam-

ple points in states (X1 = x1, X2 = x2, Y = 0) and

(X1 = x1, X2 = x2, Y = 1), respectively. This leads to the

nonparametric ML CoD estimator

CoDr

X,Y = 1− εrX,Y

min{N0

n ,N1

n

} , (11)

i.e., the resubstitution CoD estimator. It is easy to show that

this is a consistent estimator of CoDX,Y . For the formulation

of CoD estimators based on other choices of εX,Y (e.g., re-

substitution, leave-one-out, cross-validation, bootstrap), please

see [7].

IV. PARAMETRIC ML COD ESTIMATOR

If prior knowledge about the distribution of (X, Y ) is avail-

able, in the form of the stochastic logic model of Section 2,

the corresponding parametric ML estimator CoDML

(X, Y ) is

obtained as follows. The true CoD in (7) is a function of the

model parameters p, P1, P2 and γ; CoD = g(p, P1, P2, γ) (for

1014

the sake of simplicity, we will omit henceforth the explicit

reference to (X, Y ) in CoD notation). By the principle of ML

invariance [13], to obtain the ML estimators of the CoD one

plugs in the ML estimators of the model parameters into g.

For example, expressions for the form of the function of gwere obtained in [2] for the two-input AND and XOR logic

model of Section II,

CoDAND = 1− 1− p

F [P1P2 + γ + (1− 2P1P2 − 2γ)p], (12)

and

CoDXOR = 1− 1− p

F [A+ (1− 2A)p]. (13)

where F (x) = min{x, 1 − x}, for 0 ≤ x ≤ 1, and A =P1+P2−2P1P2−2γ. Hence, the corresponding ML estimators

for these logic models are

CoDML

AND = 1− 1− p

F [P1P2 + γ + (1− 2P1P2 − 2γ)p],

CoDML

XOR = 1− 1− p

F [A+ (1− 2A)p],

(14)

where A = P1 + P2 − 2P1P2 − 2γ, and p, P1, P2, and γare given in (4). The ML CoD estimators for other two-input

logics can be similarly derived.

V. MULTIVARIATE PREDICTOR INFERENCE

This section reports the results of numerical experiments

that were conducted to assess the inference capability of ML

CoD estimators as compared with resubstitution, leave-one-

out, bootstrap and cross-validation CoD estimators, under an

incomplete prior knowledge assumption, namely, that a set

of candidate logics are known to contain the logics that can

appear as a predictor of a target in the network. The smaller

this set is, the more prior knowledge about the model is

available.

Considering eight predictor genes, we randomly assign two

of them to regulate each of eight target genes with a stochastic

XOR logic function (this is thus an “XOR” network). We

construct 100 datasets based on one specific multivariate

predictive setting for varying predictive power and sample

size, respectively. Also, we set the same candidate logic for all

targets. Each candidate logic contains the true logic, say XOR

in our example. Here we consider three different candidate

logic sets, say CL1 = {XOR (0110)}, CL2 = {XOR (0110),

OR (0111), (0100), (0010)}, CL3 = {XOR (0110), OR (0111),

(0100), (0010), AND (0001), NAND (1110)}.

The inference method is as follows. For the resubstitution,

leave-one-out, bootstrap and cross-validation cases, we use the

largest estimated CoD to pick the best predictor set from(82

)choices, for each target. The best logic is decided by using the

histogram rule. For the parametric ML case, we first choose

the predictor set with the largest ML CoD estimate for each

target by assuming each logic in a given candidate logic set.

Next, we compute the ML estimates of p in (4) associated

with the chosen predictor sets for each target. Finally, we

choose the best predictor set of each target using the largest

one of its computed ML estimates of p, and the corresponding

candidate logic is the best logic. Using the synthetic datasets,

we compute the average percentage of predictors recovered

correctly by each CoD estimator.

Figure 3 describes the average percentage of predictors

correctly inferred via the ML, resubstitution, leave-one-out,

bootstrap, and cross-validation CoD estimators as a function

of sample size for varying predictive power and candidate

logic set. As the sample size increases, the average percentage

increases accordingly, and it converges to 1 faster. Generally,

the ML approach outperforms the others. When we have

complete prior knowledge about the model, i.e. the candidate

logic set consists of a single logic, CL1 = {XOR (0110)},

the ML approach recovers the largest percentage of predictors

correctly inferred. As we are less certain about the model,

that is, the number of candidate logics increases, the average

percentage of predictors correctly recovered decreases.

VI. CONCLUSION

We proposed in this paper, for the first time, to apply the

maximum likelihood principle to the problem of discrete CoD

estimation, in the case of a stochastic two-input logic model.

We showed that, given incomplete prior knowledge about the

model, the ML CoD estimator could recover more predictors

than the resubstitution, leave-one-out, bootstrap and cross-

validation in a multivariate prediction setting. Therefore, we

conclude that, provided one has evidence of particular logic

regulatory relationships, one should use the CoD estimator

based upon maximum likelihood estimation. Our future work

will focus on the extension to many-input stochastic logics and

to dynamic models, as well as its applications to multivariate

predictive inference problems.

ACKNOWLEDGEMENTS

The authors would like to acknowledge the financial support

afforded by NSF Award CCF-0845407.

APPENDIX

Proof of Proposition 2: We need to maximize the

prediction accuracy as shown by (assuming p ≥ 1/2)

maxψ

P (ψ(X1, X2) = Y )

= maxψ

∑x1,x2

P (X1 = x1, X2 = x2, Y = ψ(x1, x2))

=∑x1,x2

maxψ

(pP (X1 = x1, X2 = x2)1ψ=f +

(1− p)P (X1 = x1, X2 = x2)1ψ �=f )

=∑x1,x2

pP (X1 = x1, X2 = x2) = p,

(15)

where the maximization is satisfied when ψ∗(X1, X2) =f(X1, X2).

1015

20 40 60 80 100

0.4

0.6

0.8

1.0

CL1

sample size

pred

icto

r re

cove

ry (

%)

MLresubloocvbootstrapp = 0.85

20 40 60 80 100

0.4

0.6

0.8

1.0

CL2

sample size

pred

icto

r re

cove

ry (

%)

MLresubloocvbootstrapp=0.85

20 40 60 80 100

0.4

0.6

0.8

1.0

CL3

sample size

pred

icto

r re

cove

ry (

%)


0 50 150 250

0.4

0.6

0.8

1.0

CL1

sample size

pred

icto

r re

cove

ry (

%)


0 50 150 250

0.4

0.6

0.8

1.0

CL2

sample size

pred

icto

r re

cove

ry (

%)


0 50 150 250

0.4

0.6

0.8

1.0

CL3

sample size

pred

icto

r re

cove

ry (

%)


0 100 200 300 400

0.4

0.6

0.8

1.0

CL1

sample size

pred

icto

r re

cove

ry (

%)


0 100 200 300 400

0.4

0.6

0.8

1.0

CL2

sample size

pred

icto

r re

cove

ry (

%)


0 100 200 300 400

0.4

0.6

0.8

1.0

CL3

sample size

pred

icto

r re

cove

ry (

%)


Fig. 3. The average percentage of predictors correctly recovered vs. sample size for varying predictive power and candidate logic set, respectively.

.

REFERENCES

[1] E.R. Dougherty, S. Kim, Y.D. Chen, Coefficient of determination innonlinear signal processing, Signal Processing 80 (2000) 2219–2235.

[2] D.C. Martins, U.M. Braga-Neto, R.F. Hashimoto, M.L. Bittner,E.R. Dougherty, Intrinsically multivariate predictive genes, IEEE Journalof Selected Topics in Signal Processing 2 (3) (2008) 424–439.

[3] S. Kim, E.R. Dougherty, M.L. Bittner, Y. Chen, K. Sivakumar,P. Meltzer, J.M. Trent, A general framework for the analysis ofmultivariate gene interaction via expression arrays, Biomedical Optics5 (4) (2000) 411–424.

[4] S. Kim, E.R. Dougherty, Y. Chen, K. Sivakumar, P. Meltzer, J.M. Trent,M. Bittner, Multivariate measurement of gene expression relationships,Genomics 67 (2000) 201–209.

[5] I. Shmulevich, E.R. Dougherty, s. Kim and W. Zhang, ProbabilisticBoolean networks: a rule-based uncertainty model for gene regulatorynetworks, Bioinformatics 18 (2) (2002) 261–274.

[6] X. Zhou, X. Wang, and E.R. Dougherty, Binarization of microarray databased on a mixture model, Molecular Cancer Therapeutics 2 (7) (2003)679–684.

[7] T. Chen, U.M. Braga-Neto, Exact performance of CoD estimators in

discrete prediction, EURASIP Journal of Advances in Signal Processing:Special Issue on Genomic Signal Processing (2010).

[8] U.M. Braga-Neto, Classification and error estimation for discrete data,Current Genomics 10 (7) (2009) 446–462.

[9] K. Murphy and S. Mian, Modelling gene expression data using dynamicBayesian networks, Technical Report, University of California, Berkeley,CA. (1999).

[10] W. Qian and M.D. Riedel, The Synthesis of Robust Polynomial Arith-metic with Stochastic Logic, Proceedings of 45th Design AutomationConference (DAC’2008).

[11] S. Marshall, L. Yu, Y. Xiao, E.R., Dougherty, Inference of a probabilisticboolean network from a single observed temporal sequence, EURASIPJournal on Bioinformatics and Systems Biology (2007) 1–15.

[12] S.D. Silvey, Statistical Inference, Chapman and Hall, London, 1975.[13] G. Casella, R.L. Berger, Statistical Inference, Duxbury Press, 2002.[14] L. Devroye, L. Gyorfi, G. Lugosi, A probabilistic theory of pattern

recognition, Springer, New York, 1996.[15] U.M. Braga-Neto, E.R. Dougherty, Exact performance of error estima-

tors for discrete classifiers, Pattern Recognition 38 (11) (2005) 1799–1814.

1016

[ieee 2011 45th asilomar conference on signals, systems and computers - pacific grove, ca, usa...

Documents