[ieee 2011 ieee international workshop on genomic signal processing and statistics (gensips) - san...

4

Click here to load reader

Upload: ulisses

Post on 14-Apr-2017

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

Sample-Based Estimators for theInstrinsically Multivariate Prediction Score

Ting ChenDepartment of Electrical Engineering

Texas A & M University

College Station, Texas 77843

Email: [email protected]

Ulisses Braga-Neto∗Department of Electrical Engineering

Texas A & M UniversityCollege Station, Texas 77843

Email: [email protected] (Corresponding Author)

Abstract—Canalizing genes possess broad regulatory powerover gene regulatory networks. In a previous publication, theconcept of intrinsically multivariate predictive (IMP) genes wasintroduced and analyzed in the context of stochastic logic models.Furthermore, based on an empirical study of the DUSP gene, acanalizing gene in melanoma, it was hypothesized that canalizinggenes possess IMP properties. In this paper, we study the problemof sample-based estimation of a gene IMP score. We studynonparametric IMP score estimators based on resubstitution,leave-one-out, cross-validation, and bootstrap, and introducea maximum-likelihood IMP score estimator for a many-inputstochastic logic model. Assuming a two-input, three-input andfour-input stochastic AND model, performance metrics of theseestimators are calculated by Monte Carlo sampling. Our resultsshow that the ML IMP score estimator outperforms the otherestimators in RMS, under the assumed stochastic logic model.It is followed by the resubstitution IMP score estimator. Thisindicates that, provided one has information about regulatoryrelationships in the network, the ML IMP score estimator is theestimator of choice, whereas resubstitution is to be preferred inthe absence of prior knowledge.

Index Terms—Intrinsically Multivariate Prediction, MaximumLikelihood Estimation, Stochastic Logic, Gene Regulatory Net-works, Prediction Error Estimation.

I. INTRODUCTION

The existence of canalizing genes that can constrain a

biological system to particular functions was proposed by

C. Waddington in [1]. Such canalizing genes are frequently

found in signaling pathways, which deliver information from

a variety of sources to the machinery that enacts central

cellular functions such as cell-cycle, survival, apoptosis and

metabolism. For example, gene DUSP1 is canalizing in it

phosphorylated state, which is a central component of a

process-integrating pathway implicated in melanoma methas-

tasis. Martins and collaborators [3] hypothesized that when

the controlling gene is active, it cannot be well-predicted by

subsets of its predictor genes, but it can be predicted by the

full set with great accuracy. Such a set of predictor genes is

called Intrinsically Multivariate Predictive (IMP) for the target

gene in [3], where it was shown that DUSP1 presents a large

number of IMP gene sets in its pathway.

The concept of IMP gene is defined in terms of the binary

Coefficient of Determination (CoD) [2]. As such, IMP depends

on the probability model connecting predictors and target,

which, however, is usually unknown, or only partially known.

Therefore, the problem arises of how to find and characterize

the performance of sample-based IMP score estimators. In

[6], we have studied the performance of four nonparamet-

ric sample-based CoD estimators, based on resubstitution,

leave-one-out, cross-validation and bootstrap prediction error

estimators. We introduce in this paper the corresponding

nonparametric IMP score estimators. In addition, we extend

the two-input stochastic logic models studied in [3, 5] to

the many-input case and propose the Maximum-Likelihood

IMP score estimator for this class of models, based on the

corresponding maximum-likelihood CoD estimator introduced

in [5]. Here, we consider the two-input, three-input and four-

input stochastic AND model, and calculate approximate per-

formance metrics, namely, bias, variance and RMS, of the ML

IMP score estimators as a function of predictive power using a

Monte-Carlo sampling approach. The results indicate that the

ML IMP score estimator is the estimator of choice, under the

assumed stochastic logic model, whereas resubstitution is to

be preferred in the absence of prior knowledge.

The paper is organized as follows. Section II introduces

the many-input stochastic logic model, and the maximum

likelihood estimators of its parameters are studied. In Section

III, we define several IMP score estimators in analogy to

the corresponding problem of CoD estimation, and analyze

the bias, variance and RMS of these estimators approximated

using a Monte-Carlo sampling approach. Section IV compares

the performance metrics of these IMP score estimators. Fi-

nally, Section V presents concluding remarks.

II. STOCHASTIC LOGIC MODEL

In Genomic Signal Processing, Boolean (logic) circuits play

a prominent role in modeling gene regulatory networks [3,

4]. However, noise in the sample data affects the Boolean

functions, and causes “inconsistence” between the sample data

and a deterministic logic circuit. To address this problem, we

introduce next a many-input stochastic logic model, which

extends the two-input logic model introduced in [3]. The logic

gates in this class of models are replaced by a joint probability

distribution between predictors and target.

Let X = (X1, . . . , Xd) be a binary predictor vari-

able set and Y be the target variable. To formulate

2011 IEEE International Workshop on Genomic Signal Processing and StatisticsDecember 4-6, 2011, San Antonio, Texas, USA

978-1-4673-0490-0/11/$26.00 ©2011 IEEE 139

Page 2: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

P (X = x), we develop an approach to measure co-

variance among predictors. Suppose any {i1, i2, . . . , ir} ⊆{1, . . . , d}, we define γ(i1, i2, . . . , ir) = E[Xi1Xi2 . . . Xir]−E[Xi1]E[Xi2] . . . E[Xid]. Note that, for d = 2, γ(1, 2) =E[X1X2] − E[X1]E[X2]. Based on this definition, the joint

probability of (X, Y ) for the many-input stochastic logic

model is given next without proof due to space limitation.

Many-Input Logic Model: Let f : {0, 1}d → {0, 1}be a given Boolean function (logic gate), and let Sd ={1, 2, . . . , d}. Then

P (Y = 1|X1 = x1, . . . , Xd = xd)

= pf(x1,...,xd)(1− p)1−f(x1,...,xd)(1)

while

P (X1 = x1, . . . , Xd = xd)

=

d∏i=1

P xii (1− Pi)

1−xi + (−1)∑d

i=1 xi×

∑{i1,...,ir}⊆Sd

⎧⎨⎩(−1)rγ(i1, . . . , ir)∏

k∈Sd\{i1,...,ir}(1− xk)

⎫⎬⎭ ,

(2)

where p = P (f(X1, . . . , Xd) = Y ) is the predictive power,

Pi = E[Xi] = P (Xi = 1), i = 1, 2, . . . , d are the predictor

“biases” (the value 0.5 being considered unbiased), and r ≥ 2.

Eqs. (1) and (2) fully determine the joint distribution P (X1 =x1, . . . , Xd = xd, Y = y) = P (Y = y|X1 = x1, . . . , Xd =xd)P (X1 = x1, . . . , Xd = xd).

The two-input logic model (i.e., d = 2) is a special case of

the many-input logic model, which is given next.

Two-Input Logic Model: For a given Boolean function (logic

gate) f : {0, 1}2 → {0, 1}, let

P (Y = 1|X1 = x1, X2 = x2) = pf(x1,x2)(1− p)1−f(x1,x2)

(3)P (X1 = x1, X2 = x2) = P x1

1 P x22 (1− P1)

1−x1(1− P2)1−x2

+ (−1)x1+x2γ.(4)

Note that the predictive power p concerns only the stochastic

logic gate, whereas P1, . . . , Pd and γ’s up to the order dconcern only the marginal distribution of the d predictors. We

will assume throughout that p ≥ 1/2, since if p < 1/2 one

obtains the negated logic gate with predictive power 1− p.

A. Maximum-Likelihood Estimation of Model Parameters

In the absence of complete distributional knowledge, one

must estimate the model parameters from i.i.d. sample data

Sn = {(X11, . . . , X1d, Y1), . . . , (Xn1, . . . , Xnd, Yn)}, which

is assumed to be drawn from the probability model. The

ML estimators of the model parameters are obtained by

substituting sample averages for expectations. This is the basic

fact used to obtain the following proposition, which is given

without proof.

Proposition 1. The maximum-likelihood estimators of theparameters of the many-input logic model are given by

p =1

n

n∑i=1

1f(Xi1,...,Xid)=Yi,

Pi =1

n

n∑i=1

Xij , for j = 1, 2, . . . , d,

γ (i1, . . . , ir) =1

n

n∑j=1

[r∏

k=1

Xjik

]− 1

nr

r∏k=1

[n∑

i=1

Xjik

],

for (i1, . . . , ir) ⊆ Sd, and r ≥ 2.(5)

It is easy to show that p and Pi, for (i = 1, 2, . . . , d),are minimum-variance unbiased, with Var[p] = 1

np(1 − p),

Var[Pi] =1nPi(1 − Pi), for i = 1, 2, . . . , d. However, γ is a

biased estimator. As ML estimators, all these estimators are

asymptotically unbiased, asymptotically efficient, and consis-

tent [7].

As a special case, the maximum-likelihood estimators

of the parameters of two-input logic model are given by

p = 1n

∑ni=1 1f(Xi1,Xi2)=Yi

, P1 = 1n

∑ni=1 Xi1 P2 =

1n

∑ni=1 Xi2, γ = 1

n

∑ni=1 Xi1Xi2 − 1

n2

∑ni=1 Xi1

∑ni=1 Xi2.

III. INTRINSICALLY MULTIVARIATE PREDICTION

The Coefficient of Determination (CoD) of X with respect

to Y [2] is defined to be

CoDY(X) =εY − εX,Y

εY(6)

where εY is the optimal error of predicting Y in the absence

of other observations and εX,Y is the optimal error based

on the observations of X . The CoD measures the nonlinear

multivariate relationship between predictors and target. By

convention, one assumes 0/0 = 1 in the above definition.

Given the many-input model (1), the CoD is expressed by

CoD = 1− 1− p

F (∑

x P (Y = 1|X = x)P (X = x))

= 1− 1− p

F(∑

x pf(x)(1− p)1−f(x)P (X = x)

) . (7)

Martins et al. (2008) introduced the concept of an intrinsi-

cally multivariate predictive (IMP) gene set: X is said to be

IMP for Y with respect to λ and δ, for 0 ≤ λ < δ ≤ 1, if

maxZ�X

CoDY(Z) ≤ λ and CoDY(X) ≥ δ. (8)

Subsequently, [3] defined the IMP score of a pair (X, Y ) as

IMPY(X) = CoDY(X)−maxZ�X

CoDY(Z), (9)

where Z �= ∅. This definition is independent of λ and δ;

instead one sets a threshold, and if the IMP score exceeds

this threshold, then X is said to be IMP for the target Y .

In the two-predictor case, the IMP score is given by

IMPY(X1,X2)

= CoDY(X1,X2)−max{CoDY(X1),CoDY(X2)} ,(10)

140

Page 3: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

A. Estimation of the IMP Score

In the CoD estimation problem [6], we defined “model-

free” CoD estimators based on resubstitution, leave-one-out,

2-fold 10-repeated cross-validation and .632 bootstrap error

estimators. Likewise, we introduce here the corresponding

IMP score estimators: the resubstitution IMP score estimator

(IMPr), leave-one-out IMP score estimator (IMPl), 2-fold

10-repeated cross-validation IMP score estimator (IMPcv) and

.632 bootstrap IMP score estimator (IMPb632), given by:

IMPY (X) = CoDY (X)−maxZ�X

CoDY (Z), (11)

where IMP and CoD are one of the four IMP and CoD

estimators, respectively.

If prior knowledge about the distribution of (X, Y ) is

known, in the form of the stochastic logic model of Sec-

tion 2, one can obtain a Maximum-Likelihood (ML) IMP

score estimator (IMPML) as a function of ML CoD estimators

[6]. Notice that, in the two-input logic model as an example,

the true IMP in (9) is a function of the model parameters

p, P1, P2 and γ: IMP = g(p,P1,P2, γ). By the principle of

ML invariance [7], to obtain ML IMP score estimators one

plugs in the ML estimators of the model parameters into g.

For example, by combining (3), (4), (6) and (10), we can

obtain the true IMP in the two-input AND model formulated

by (for the sake of simplicity, we will omit from this point on

the explicit reference to (X, Y ) in IMP notation):

IMPML = 1− F(p)

F[A]− max

{1− F(p)(1− P1) + F[B]P1

F[A],

1− F (p)(1− P2) + F [C]P2

F [A]

}(12)

where A = P1P2 + γ + (1 − 2P1P2 − 2γ)p , B = ((P1 −P1P2 − γ) + (2P1P2 + 2γ − P1)p)/P1, C = ((P2 − P1P2 −γ)+ (2P1P2+2γ−P2)p)/P2 and F (x) = min(x, 1−x), for

0 ≤ x ≤ 1. Hence, the corresponding ML IMP score estimator

for this logic model is

IMPML = 1− F(p)

F[A]− max

{1− F(p)(1− P1) + F[B]P1

F[A],

1− F (p)(1− P2) + F [C]P2

F [A]

},

(13)

where A, B and C are obtained by replacing P1, P2, γ with

P1, P2, γ in the formulations of A,B and C, respectively. The

ML IMP score estimator for the three-input or four-input logic

model can be derived in a similar fashion.

B. Performance of IMP Score Estimators

Regarding the performance of the IMP score estimator

IMP, the quantities of interest are the bias, variance, and

RMS, given by Bias[IMP] = E[IMP] − IMP, Var[IMP],

and RMS[IMP] =

√Bias[IMP]2 +Var[IMP], respectively.

A good IMP score estimator will display small values for all

these metrics.

We employ a Monte-Carlo sampling approach [8] to approx-

imate the bias, variance and RMS of IMP score estimators.

Assuming one specific logic model with known parameter

values, we draw 5000 i.i.d. Monte-Carlo samples from the

joint probability distribution given by the product of eq. (1)

and eq. (2). For each sample data set, we calculate the ML,

resubstitution, leave-one-out, cross-validation, and bootstrap

IMP score estimates, respectively. Then, we obtain the mean,

variance, and RMS of the corresponding IMP score estimators.

IV. NUMERICAL EXPERIMENTS

Assuming a stochastic AND model, we plot the approximate

performance metrics of the ML IMP score estimator as a

function of predictive power in the two-input, three-input, and

four-input cases. We also compare these with the approximate

performance metrics for resubstitution, leave-one-out, cross-

validation and bootstrap IMP score estimators, as shown in

Figure 1.

Figure 1 shows that, while a clearly superior estimator

in bias does not emmerge, the ML IMP score estimator is

clearly the least variable estimator, whereas the leave-one-

out is generally the most variable one. Most importantly,

we can see on the RMS column that the ML IMP score

estimator is able to outperform all others. Among the model-

free estimators, resubstitution is clearly the superior choice in

RMS. Notice also that, as the number of inputs (m) in a logic

model increases, there is an increase in the amount by which

the ML IMP score estimator beats the others in RMS, since

the complexity of estimation increases with larger m.

V. CONCLUSION

In this paper, we introduced the estimation problem for

the intrinsically multivariate prediction (IMP) score. We pro-

posed resubstitution, leave-one-out, cross-validation and boot-

strap IMP score estimators. Furthermore, we developed the

maximum-likelihood estimator for the IMP score under a

stochatic many-input logic model. Assuming specific stochas-

tic AND models, we compared their performance metrics

via Monte-Carlo sampling. We conclude from our results

that the ML IMP score estimator is the estimator of choice,

whereas resubstitution is to be preferred in the absence of prior

knowledge.

The paper of Martin et al. (2008) employed the ML CoD

estimator in 2-input stochastic logic model to real melanoma

dataset, and concluded the IMP criterion could be applied as a

practical tool for the identification of critical canalizing genes

in real gene expression data. Our main goal in this paper was to

validate the performance of the ML IMP score estimators as

compared with resubstitution, leave-one-out, cross-validation

and bootstrap from a theoretical perspective. Further research

will be focused on investigating and comparing how these IMP

score estimators reveal the multivariate relationships in gene

regulatory networks and identify canalizing genes in practice.

It is hoped that the ML IMP score estimator would bring

141

Page 4: [IEEE 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) - San Antonio, TX, USA (2011.12.4-2011.12.6)] 2011 IEEE International Workshop on Genomic

Bias Variance RMS

m = 2

0.5 0.6 0.7 0.8 0.9 1.0

−0.4

−0.3

−0.2

−0.1

0.0

0.1

predictive power

bias

MLresubloobootstrapcv

0.5 0.6 0.7 0.8 0.9 1.0

0.00

0.02

0.04

0.06

0.08

0.10

predictive power

varia

nce

MLresubloobootstrapcv

0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

predictive power

RM

S

MLresubloobootstrapcv

m = 3

0.5 0.6 0.7 0.8 0.9 1.0

−0.4

−0.3

−0.2

−0.1

0.0

0.1

predictive power

bias

MLresubloobootstrapcv

0.5 0.6 0.7 0.8 0.9 1.0

0.00

0.02

0.04

0.06

0.08

0.10

predictive power

varia

nce

MLresubloobootstrapcv

0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

predictive power

RM

S

MLresubloobootstrapcv

m = 4

0.5 0.6 0.7 0.8 0.9 1.0

−0.4

−0.3

−0.2

−0.1

0.0

0.1

predictive power

bias

MLresubloobootstrapcv

0.5 0.6 0.7 0.8 0.9 1.0

0.00

0.02

0.04

0.06

0.08

0.10

predictive power

varia

nce

MLresubloobootstrapcv

0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

predictive power

RM

S

MLresubloobootstrapcv

Fig. 1. Bias, deviation variance, and RMS for several IMP score estimators vs. predictive power values in the two-input AND model (fixing P1 = 0.4, P2 = 0.5and γ = 0.005), three-input AND model (fixing P1 = 0.5, P2 = 0.6, P3 = 0.65, γ(1, 2) = 0.01, γ(1, 3) = 0.025, γ(2, 3) = 0.035, and γ(1, 2, 3) = 0.02)and four-input AND model ( fixing P1 = 0.5, P2 = 0.6, P3 = 0.65, γ(1, 2) = 0.01, γ(1, 3) = 0.025, γ(2, 3) = 0.035, and γ(1, 2, 3) = 0.02), assumingsample size n = 60, respectively. All curves are obtained via Monte-Carlo sampling.

more accurate biological information than others regarding its

advantageous performance shown in our theoretical analysis.

REFERENCES

[1] C.H. Waddington, Canalization of development and the inheritance ofacquired characters, Nature (1942) 563–565.

[2] E.R. Dougherty, S. Kim, Y.D. Chen, Coefficient of determination innonlinear signal processing, Signal Processing 80 (2000) 2219–2235.

[3] D.C. Martins, U.M. Braga-Neto, R.F. Hashimoto, M.L. Bittner,E.R. Dougherty, Intrinsically multivariate predictive genes, IEEE Journalof Selected Topics in Signal Processing 2 (3) (2008) 424–439.

[4] I. Shmulevich, E.R. Dougherty, s. Kim and W. Zhang, ProbabilisticBoolean networks: a rule-based uncertainty model for gene regulatorynetworks, Bioinformatics 18 (2) (2002) 261–274.

[5] T. Chen, U.M. Braga-Neto, Maximum Likelihood Estimation of TheBinary Coefficient of Determination, Asilomar Conference on Signals,Systems & Computers, Pacific Grove, CA, November 2011.

[6] T. Chen, U.M. Braga-Neto, Exact performance of CoD estimators indiscrete prediction, EURASIP Journal of Advances in Signal Processing:Special Issue on Genomic Signal Processing (2010).

[7] G. Casella, R.L. Berger, Statistical Inference, Duxbury Press, 2002.[8] C.P. Robert, G. Casella, Monte Carlo statistical methods, Springer,

New York, 1999.

142