[ieee 2011 45th asilomar conference on signals, systems and computers - pacific grove, ca, usa...
TRANSCRIPT
![Page 1: [IEEE 2011 45th Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA (2011.11.6-2011.11.9)] 2011 Conference Record of the Forty Fifth Asilomar Conference](https://reader037.vdocuments.mx/reader037/viewer/2022092909/5750a8721a28abcf0cc8a8cb/html5/thumbnails/1.jpg)
Maximum Likelihood Estimation of theBinary Coefficient of Determination
Ting ChenDepartment of Electrical Engineering
Texas A & M University
College Station, Texas 77843
Email: [email protected]
Ulisses Braga-Neto∗Department of Electrical Engineering
Texas A & M UniversityCollege Station, Texas 77843
Email: [email protected] (Corresponding Author)
Abstract—The binary Coefficient of Determination (CoD) isa key component of inference methods in Genomic SignalProcessing. Assuming a stochastic logic model, we introduce anew sample CoD estimator based upon maximum likelihood(ML) estimation. Experiments have been conducted to assesshow the ML CoD estimator performs in recovering predictorsin multivariate prediction settings. Performance is comparedwith the traditional nonparametric CoD estimators based onresubstitution, leave-one-out, bootstrap and cross-validation. Theresults show that the ML CoD estimator is the estimator of choiceif prior knowledge is available about the logic relationships inthe model, even if this knowledge is incomplete.
Index Terms—CoD estimation, maximum liklihood estimation,stochastic logic model, multivariate predictive inference.
I. INTRODUCTION
The binary Coefficient of Determination (CoD) [1] measures
the predictive power of a set of binary predictor variables X ={X1, X2, ..., Xn} ∈ {0, 1}n with respect to a binary target
variable Y ∈ {0, 1}, as given by the simple formula:
CoD(X, Y ) =εY − εX,Y
εY(1)
where εY is the error of the best predictor of Y in the absence
of observations and εX,Y is the error of the best predictor
of Y based on the observation of X. The CoD measures
the nonlinear interaction between predictors and target. The
CoD has had far-reaching applications in Genomic Signal
Processing [2–6].
The CoD depends on the probability model connecting pre-
dictors and target, which, however, is usually unknown, or only
partially known. Therefore, the problem arises of how to find
a good sample CoD estimator. In a previous publication [7],
we studied nonparametric (“model-free”) CoD estimators such
as resubstitution, leave-one-out, cross-validation and bootstrap
CoD estimators. It was concluded that, provided one has
evidence of moderate to tight regulation between the genes,
and the number of predictors is not too large, one should
use the nonparametric maximum-likelihood estimator, i.e., the
resubstitution CoD estimator.
In Genomic Signal Processing, models of gene regulatory
networks play a significant role [2, 5]. In many cases of
practical interest, such a model must be inferred from noisy
gene-expression sample data, but partial knowledge about
the pathway of interest is already available. This motivates
us to investigate parametric maximum-likelihood (ML) CoD
estimation, where there is partial knowledge of the model from
which sample data are drawn. We carry out numerical exper-
iments to investigate how the ML CoD estimator performs in
inferring predictors by using synthetic data and compare the
results with those obtained by resubstitution, leave-one-out,
bootstrap and cross-validation CoD estimators. Results show
that the ML CoD estimator performs the best in recovering the
true multivariate regulatory relationships, under an incomplete
prior knowledge assumption.
The paper is organized as follows. Section II introduces a
two-input stochastic logic model, and the maximum likelihood
estimators of its parameters are studied. In Section III, we
review nonparametric CoD estimators, whereas Section IV
introduces the parametric ML CoD estimator. Section V com-
pares the performance of ML, resubstitution, leave-one-out,
bootstrap and cross-validation CoD estimators in the inference
of multivariate regulatory relationships. Finally, Section VI
presents concluding remarks.
II. STOCHASTIC LOGIC MODEL
In many applications, and in particular in Genomic Signal
Processing, models of regulatory networks play a prominent
role [2, 5]. Figure 1 displays an example of regulatory network
associated with the cell cycle. The activation and suppression
relationships between the various genetic switches lead to the
activation or not of DNA synthesis. We can see in Figure 1
that this network corresponds to a logic circuit, i.e., a Boolean
function.
In practice, such Boolean circuits must be identified from
sample data of the states of the various genes, which is
represented here in Figure 1. This measurement data is noisy,
and even the Boolean functions themselves can be affected by
uncertainty from unpredictable extraneous factors, which leads
to the need of a stochastic approach. For example, the state
”0 1 0 1” for the predicting genes appear two times in the
sample data, with conflicting values of the target gene, i.e.,
the sample data is “inconsistent” with a deterministic logic
circuit. Even though the final goal is to come up with the best
possible logic circuit model for the pathway, this model will
have an irreducible amount of error due to the noise.
1012978-1-4673-0323-1/11/$26.00 ©2011 IEEE Asilomar 2011
![Page 2: [IEEE 2011 45th Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA (2011.11.6-2011.11.9)] 2011 Conference Record of the Forty Fifth Asilomar Conference](https://reader037.vdocuments.mx/reader037/viewer/2022092909/5750a8721a28abcf0cc8a8cb/html5/thumbnails/2.jpg)
Fig. 1. Example of regulatory network, equivalent logic circuit, andsample data for the DNA synthesis pathway of the cell cycle. Adapted fromShmulevich et al. [5].
In order to address this problem, we employ in this paper
a stochastic logic model [2] in the problem of identification,
or inference, of multivariate predictive logical relationships
among binary elements, such as the ones present in the
regulatory circuits exemplified in Figure 1. The logic gates
in this model are replaced by a joint probability distribution
between predictors and target. In this paper, we consider the
case of two binary predictors X1 and X2 and one binary
target Y , i.e., two-input logic gates (the extension to many-
input logics is fairly involved, and will be treated in a future
publication).
Two-Input Logic Model: For a given Boolean function (logic
gate) f : {0, 1}2 → {0, 1}, let
P (Y = 1|X1 = x1, X2 = x2) = pf(x1,x2)(1− p)1−f(x1,x2)
(2)
P (X1 = x1, X2 = x2) = P x11 P x2
2 (1− P1)1−x1(1− P2)
1−x2
+ (−1)x1+x2γ,(3)
where p is the predictive power, Pi = P (Xi = 1), i = 1, 2are the predictor “biases” (the value 0.5 being considered
unbiased), and γ = E[X1X2]−E[X1]E[X2] is the covariance
between predictors. It is clear that (2) and (3) fully determine
the joint distribution P (X1 =x1, X2 =x2, Y = y) = P (Y =y|X1=x1, X2=x2)P (X1=x1, X2=x2).
We remark that this stochastic logic model differs from
previous models proposed in the literature. For example, the
“noisy-OR” model, commonly used in Bayesian network in-
ference [9], and the stochastic logic model of [10] both assume
deterministic logics with randomness only at the inputs. The
Probabilistic Boolean Network model [5] involves multiple
deterministic logics selected by a random probability. The
model considered here is most similar to the Boolean Networkwith perturbation model [11], which assumes a deterministic
logic with an output that can be flipped with a certain small
probability; however, there is no attempt to model the full joint
distribution between predictors and the target.
Note that the Boolean function appears in the two-input
logic model only in the formulation of P (Y = 1|X1 =x1, X2 = x2) in eq. (2). This conditional probability specifies
the stochastic logic gate corresponding to the Boolean func-
tion f . See Figure 2 for an illustration of stochastic AND,
OR, and XOR logic gates and their corresponding conditional
probability tables. Note also that the parameters P1, P2, and
γ concern only the marginal distribution of the predictors,
whereas p concerns only the stochastic logic gate. We will
assume throughout that p ≥ 1/2, since if p < 1/2 one obtains
the negated logic gate with predictive power 1− p.
Fig. 2. Example of stochastic logic gates and corresponding conditionalprobability tables.
Proposition 1. Under the two-input logic model, the best pre-dictor of the output Y given the inputs X1, X2, i.e., the func-tion with the maximal prediction accuracy P (ψ(X1, X2) =Y ) among all functions ψ : {0, 1}2 → {0, 1}, is the underlyingBoolean logic function itself, ψ∗(X1, X2) = f(X1, X2). Fur-thermore, the maximal prediction accuracy is P (f(X1, X2) =Y ) = p, the predictive power.
PROOF. See Appendix.
From Proposition 1, the predictive power p is the upper
bound on how well the output can be predicted from the inputs,
and it thus measures the amount of uncertainty in the model.
The best predictor is always the underlying ordinary logic
itself. The closer p is to 1, the less uncertainty there is, and
the more the stochastic logic gate operates as the underlying
ordinary logic.
A word about two-input logics. There are clearly 16 such
logics in all; 2 are constant logics, for which the output is a
constant 0 or 1, 4 are 1-minterm logics, which have only one
”1” in the output truth table (these include the AND logic),
4 are 3-minterm logics (these include the OR logic), which
1013
![Page 3: [IEEE 2011 45th Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA (2011.11.6-2011.11.9)] 2011 Conference Record of the Forty Fifth Asilomar Conference](https://reader037.vdocuments.mx/reader037/viewer/2022092909/5750a8721a28abcf0cc8a8cb/html5/thumbnails/3.jpg)
are symmetric to the 1-minterm logics via negation, and 6
are 2-minterm logics (these include the XOR logic). Of the 6
two-minterm logics, 4 are of the form X1, X1, X2, X2, which
are 1-input logics. Therefore, there are only 10 two-input logic
gates of interest, which are represented by the prototype logics
AND, OR, and XOR.
A. Maximum-Likelihood Estimation of Model Parameters
In this subsection we study the problem of estimating the
parameters p, P1, P2, and γ of the 2-input logic model from
i.i.d. sample data Sn = {(X11, X12, Y1), . . . , (Xn1, Xn2, Yn)}drawn from the model. First note that, by definition, P1 =P (X1 = 1) and P2 = P (X2 = 1), whereas, by Proposition 1,
p = P (f(X1, X2) = Y ). It is well-known that the maximum-
likelihood estimators of probabilities and expectations are the
associated empirical frequencies and sample-means, respec-
tively, which have several desirable properties [12]. This leads
to the following proposition, which is given without proof.
Proposition 2. The maximum-likelihood estimators of theparameters of two-input logic model are given by
p = 1− 1
n
n∑i=1
|f(Xi1, Xi2)− Yi| ,
P1 =1
n
n∑i=1
Xi1 , P2 =1
n
n∑i=1
Xi2 ,
γ =1
n
n∑i=1
Xi1Xi2 − 1
n2
n∑i=1
Xi1
n∑i=1
Xi2 .
(4)
It is easy to show that p, P1, and P2 are minimum-variance
unbiased, with Var[p] = 1np(1− p), Var[P1] =
1nP1(1− P1),
and Var[P2] =1nP2(1−P2). However, γ is a biased estimator,
such that E[γ] = n−1n γ. As ML estimators, all the pre-
vious estimators are asymptotically unbiased, asymptotically
efficient, and consistent [13].
III. THE DISCRETE COEFFICIENT OF DETERMINATION
We examine in this Section the formulation of the discrete
CoD, and its estimation, for the assumed case of two predictors
X1, X2 and a target Y . For the general case with p > 2predictors, see [2, 7].
It can be shown easily that the best predictor of Y in
the absence of variables is ψY = argmaxi P (Y = i), with
optimal error
εY = min{P (Y = 0), P (Y = 1)} , (5)
whereas the best of predictor of Y when X is available is
ψY (X) = argmaxi P (Y = i|X), with optimal error [14]:
εX,Y =
1∑x1=0
1∑x2=0
min{P (Y = 0, X1 = x1, X2 = x2),
P (Y = 1, X1 = x1, X2 = x2)} .(6)
where X = (X1, X2) is the predictor vector. This leads to an
analogous concept to the CoD of classical regression [1]:
CoD(X, Y ) =εY − εX,Y
εY= 1− εX,Y
εY. (7)
By convention, one assumes 0/0 = 1 in the previous def-
inition. It is clear that 0 ≤ CoD(X, Y ) ≤ 1. This CoD
measures the relative decrease in prediction error of Y when
using X as the predictor vector, relative to using no predictors;
if CoD(X, Y ) = 0, then X carries no information about Y ,
whereas if CoD(X, Y ) = 1, then X predicts Y perfectly. The
discrete CoD in (7) — for simplicity, we will refer to this as
the “CoD”, without qualification — has been applied to the
inference of multivariate discrete predictive relationships [1]
and Boolean networks [5]. The basic idea is to select the
predictors with the best CoD estimated from sample data (as
will be discussed in Section V).
Given the i.i.d. sample data Sn = {(X11, X12, Y1), . . . ,(Xn1, Xn2, Yn)} drawn from the underlying probability
model, a CoD estimator is obtained by estimation of εY and
εX,Y in (7):
CoD(X, Y ) =εY − εX,Y
εY= 1− εX,Y
εY. (8)
Typically, εY is given by the empirical frequency estimator,
εY = min
{N0
n,N1
n
}. (9)
with Ni being the number of sample points in state Y = i,i = 0, 1, (such that N0 + N1 = n), and εX,Y is obtained
by using one of a number of choices of error estimators in
connection with the discrete histogram rule [8]. It can be
shown that the nonparametric ML estimator of εX,Y is given
by the resubstitution error estimator [15]:
εrX,Y =1
n
1∑x1=0
1∑x2=0
min{U(x1.x2), V (x1.x2)} , (10)
where U(x1.x2) and V (x1, x2) denote the number of sam-
ple points in states (X1 = x1, X2 = x2, Y = 0) and
(X1 = x1, X2 = x2, Y = 1), respectively. This leads to the
nonparametric ML CoD estimator
CoDr
X,Y = 1− εrX,Y
min{N0
n ,N1
n
} , (11)
i.e., the resubstitution CoD estimator. It is easy to show that
this is a consistent estimator of CoDX,Y . For the formulation
of CoD estimators based on other choices of εX,Y (e.g., re-
substitution, leave-one-out, cross-validation, bootstrap), please
see [7].
IV. PARAMETRIC ML COD ESTIMATOR
If prior knowledge about the distribution of (X, Y ) is avail-
able, in the form of the stochastic logic model of Section 2,
the corresponding parametric ML estimator CoDML
(X, Y ) is
obtained as follows. The true CoD in (7) is a function of the
model parameters p, P1, P2 and γ; CoD = g(p, P1, P2, γ) (for
1014
![Page 4: [IEEE 2011 45th Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA (2011.11.6-2011.11.9)] 2011 Conference Record of the Forty Fifth Asilomar Conference](https://reader037.vdocuments.mx/reader037/viewer/2022092909/5750a8721a28abcf0cc8a8cb/html5/thumbnails/4.jpg)
the sake of simplicity, we will omit henceforth the explicit
reference to (X, Y ) in CoD notation). By the principle of ML
invariance [13], to obtain the ML estimators of the CoD one
plugs in the ML estimators of the model parameters into g.
For example, expressions for the form of the function of gwere obtained in [2] for the two-input AND and XOR logic
model of Section II,
CoDAND = 1− 1− p
F [P1P2 + γ + (1− 2P1P2 − 2γ)p], (12)
and
CoDXOR = 1− 1− p
F [A+ (1− 2A)p]. (13)
where F (x) = min{x, 1 − x}, for 0 ≤ x ≤ 1, and A =P1+P2−2P1P2−2γ. Hence, the corresponding ML estimators
for these logic models are
CoDML
AND = 1− 1− p
F [P1P2 + γ + (1− 2P1P2 − 2γ)p],
CoDML
XOR = 1− 1− p
F [A+ (1− 2A)p],
(14)
where A = P1 + P2 − 2P1P2 − 2γ, and p, P1, P2, and γare given in (4). The ML CoD estimators for other two-input
logics can be similarly derived.
V. MULTIVARIATE PREDICTOR INFERENCE
This section reports the results of numerical experiments
that were conducted to assess the inference capability of ML
CoD estimators as compared with resubstitution, leave-one-
out, bootstrap and cross-validation CoD estimators, under an
incomplete prior knowledge assumption, namely, that a set
of candidate logics are known to contain the logics that can
appear as a predictor of a target in the network. The smaller
this set is, the more prior knowledge about the model is
available.
Considering eight predictor genes, we randomly assign two
of them to regulate each of eight target genes with a stochastic
XOR logic function (this is thus an “XOR” network). We
construct 100 datasets based on one specific multivariate
predictive setting for varying predictive power and sample
size, respectively. Also, we set the same candidate logic for all
targets. Each candidate logic contains the true logic, say XOR
in our example. Here we consider three different candidate
logic sets, say CL1 = {XOR (0110)}, CL2 = {XOR (0110),
OR (0111), (0100), (0010)}, CL3 = {XOR (0110), OR (0111),
(0100), (0010), AND (0001), NAND (1110)}.
The inference method is as follows. For the resubstitution,
leave-one-out, bootstrap and cross-validation cases, we use the
largest estimated CoD to pick the best predictor set from(82
)choices, for each target. The best logic is decided by using the
histogram rule. For the parametric ML case, we first choose
the predictor set with the largest ML CoD estimate for each
target by assuming each logic in a given candidate logic set.
Next, we compute the ML estimates of p in (4) associated
with the chosen predictor sets for each target. Finally, we
choose the best predictor set of each target using the largest
one of its computed ML estimates of p, and the corresponding
candidate logic is the best logic. Using the synthetic datasets,
we compute the average percentage of predictors recovered
correctly by each CoD estimator.
Figure 3 describes the average percentage of predictors
correctly inferred via the ML, resubstitution, leave-one-out,
bootstrap, and cross-validation CoD estimators as a function
of sample size for varying predictive power and candidate
logic set. As the sample size increases, the average percentage
increases accordingly, and it converges to 1 faster. Generally,
the ML approach outperforms the others. When we have
complete prior knowledge about the model, i.e. the candidate
logic set consists of a single logic, CL1 = {XOR (0110)},
the ML approach recovers the largest percentage of predictors
correctly inferred. As we are less certain about the model,
that is, the number of candidate logics increases, the average
percentage of predictors correctly recovered decreases.
VI. CONCLUSION
We proposed in this paper, for the first time, to apply the
maximum likelihood principle to the problem of discrete CoD
estimation, in the case of a stochastic two-input logic model.
We showed that, given incomplete prior knowledge about the
model, the ML CoD estimator could recover more predictors
than the resubstitution, leave-one-out, bootstrap and cross-
validation in a multivariate prediction setting. Therefore, we
conclude that, provided one has evidence of particular logic
regulatory relationships, one should use the CoD estimator
based upon maximum likelihood estimation. Our future work
will focus on the extension to many-input stochastic logics and
to dynamic models, as well as its applications to multivariate
predictive inference problems.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the financial support
afforded by NSF Award CCF-0845407.
APPENDIX
Proof of Proposition 2: We need to maximize the
prediction accuracy as shown by (assuming p ≥ 1/2)
maxψ
P (ψ(X1, X2) = Y )
= maxψ
∑x1,x2
P (X1 = x1, X2 = x2, Y = ψ(x1, x2))
=∑x1,x2
maxψ
(pP (X1 = x1, X2 = x2)1ψ=f +
(1− p)P (X1 = x1, X2 = x2)1ψ �=f )
=∑x1,x2
pP (X1 = x1, X2 = x2) = p,
(15)
where the maximization is satisfied when ψ∗(X1, X2) =f(X1, X2).
1015
![Page 5: [IEEE 2011 45th Asilomar Conference on Signals, Systems and Computers - Pacific Grove, CA, USA (2011.11.6-2011.11.9)] 2011 Conference Record of the Forty Fifth Asilomar Conference](https://reader037.vdocuments.mx/reader037/viewer/2022092909/5750a8721a28abcf0cc8a8cb/html5/thumbnails/5.jpg)
20 40 60 80 100
0.4
0.6
0.8
1.0
CL1
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp = 0.85
20 40 60 80 100
0.4
0.6
0.8
1.0
CL2
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.85
20 40 60 80 100
0.4
0.6
0.8
1.0
CL3
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.85
0 50 150 250
0.4
0.6
0.8
1.0
CL1
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.75
0 50 150 250
0.4
0.6
0.8
1.0
CL2
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.75
0 50 150 250
0.4
0.6
0.8
1.0
CL3
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.75
0 100 200 300 400
0.4
0.6
0.8
1.0
CL1
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.65
0 100 200 300 400
0.4
0.6
0.8
1.0
CL2
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.65
0 100 200 300 400
0.4
0.6
0.8
1.0
CL3
sample size
pred
icto
r re
cove
ry (
%)
MLresubloocvbootstrapp=0.65
Fig. 3. The average percentage of predictors correctly recovered vs. sample size for varying predictive power and candidate logic set, respectively.
.
REFERENCES
[1] E.R. Dougherty, S. Kim, Y.D. Chen, Coefficient of determination innonlinear signal processing, Signal Processing 80 (2000) 2219–2235.
[2] D.C. Martins, U.M. Braga-Neto, R.F. Hashimoto, M.L. Bittner,E.R. Dougherty, Intrinsically multivariate predictive genes, IEEE Journalof Selected Topics in Signal Processing 2 (3) (2008) 424–439.
[3] S. Kim, E.R. Dougherty, M.L. Bittner, Y. Chen, K. Sivakumar,P. Meltzer, J.M. Trent, A general framework for the analysis ofmultivariate gene interaction via expression arrays, Biomedical Optics5 (4) (2000) 411–424.
[4] S. Kim, E.R. Dougherty, Y. Chen, K. Sivakumar, P. Meltzer, J.M. Trent,M. Bittner, Multivariate measurement of gene expression relationships,Genomics 67 (2000) 201–209.
[5] I. Shmulevich, E.R. Dougherty, s. Kim and W. Zhang, ProbabilisticBoolean networks: a rule-based uncertainty model for gene regulatorynetworks, Bioinformatics 18 (2) (2002) 261–274.
[6] X. Zhou, X. Wang, and E.R. Dougherty, Binarization of microarray databased on a mixture model, Molecular Cancer Therapeutics 2 (7) (2003)679–684.
[7] T. Chen, U.M. Braga-Neto, Exact performance of CoD estimators in
discrete prediction, EURASIP Journal of Advances in Signal Processing:Special Issue on Genomic Signal Processing (2010).
[8] U.M. Braga-Neto, Classification and error estimation for discrete data,Current Genomics 10 (7) (2009) 446–462.
[9] K. Murphy and S. Mian, Modelling gene expression data using dynamicBayesian networks, Technical Report, University of California, Berkeley,CA. (1999).
[10] W. Qian and M.D. Riedel, The Synthesis of Robust Polynomial Arith-metic with Stochastic Logic, Proceedings of 45th Design AutomationConference (DAC’2008).
[11] S. Marshall, L. Yu, Y. Xiao, E.R., Dougherty, Inference of a probabilisticboolean network from a single observed temporal sequence, EURASIPJournal on Bioinformatics and Systems Biology (2007) 1–15.
[12] S.D. Silvey, Statistical Inference, Chapman and Hall, London, 1975.[13] G. Casella, R.L. Berger, Statistical Inference, Duxbury Press, 2002.[14] L. Devroye, L. Gyorfi, G. Lugosi, A probabilistic theory of pattern
recognition, Springer, New York, 1996.[15] U.M. Braga-Neto, E.R. Dougherty, Exact performance of error estima-
tors for discrete classifiers, Pattern Recognition 38 (11) (2005) 1799–1814.
1016