probabilisc models for interpreng perturba)ons in...

31
Probabilis)c models for interpre)ng perturba)ons in networks: Nested effect models Sushmita Roy [email protected] Computa)onal Network Biology Biosta2s2cs & Medical Informa2cs 826 Computer Sciences 838 hBps://compnetbiocourse.discovery.wisc.edu Dec 6 th 2016

Upload: vannga

Post on 28-Mar-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Probabilis)cmodelsforinterpre)ngperturba)onsinnetworks:Nested

effectmodels

[email protected]

Computa)onalNetworkBiologyBiosta2s2cs&MedicalInforma2cs826

ComputerSciences838hBps://compnetbiocourse.discovery.wisc.edu

Dec6th2016

Typesofalgorithmsusedtoexamineperturba)onsinnetworks

•  Graphdiffusionfollowedbysubnetworkfindingmethods–  HOTNET–  NETBAG

•  Informa2onflow-basedmethods(alsowidelyusedforintegra2ngdifferenttypesofdata)–  Prizecollec2ngsteinertree– Mincostmaxflow

•  Probabilis2cgraphicalmodel-basedmethods–  Factorgraphs–  NestedEffectModels(NEMs)

Probabilis)cgraphicalmodelsforinterpre)ngnetworkperturba)ons

•  C.-H.H.Yeang,T.Ideker,andT.Jaakkola,"Physicalnetworkmodels."Journalofcomputa.onalbiology:ajournalofcomputa.onalmolecularcellbiology,vol.11,no.2-3,pp.243-262,Mar.2004.

•  F.Markowetz,D.Kostka,O.G.Troyanskaya,andR.Spang,"Nestedeffectsmodelsforhigh-dimensionalphenotypingscreens,"Bioinforma.cs,vol.23,no.13,pp.i305-312,Jul.2007.

•  C.J.Vaske,C.House,T.Luu,B.Frank,C.-H.H.Yeang,N.H.Lee,andJ.M.Stuart,"Afactorgraphnestedeffectsmodeltoiden2fynetworksfromgene2cperturba2ons."PLoScomputa.onalbiology,vol.5,no.1,pp.e1000274+,Jan.2009.

Mo)va)onofnestedeffectmodels

•  Perturba2onofgenesfollowedbyhigh-throughputprofilingofdifferentphenotypescanbeusedtocharacterizefunc2onsofgenes

•  However,mostgenesdonotfunc2onindependentlybutinteractinanetworktodriveapar2cularfunc2on

•  Phenotypicmeasurements(e.g.mRNAlevels)areindirectmeasurementsoftheunderlyingnetworkstructure–  Includesdirectandindirecteffects

•  Givenperturba2ondatafrommul2plegenes,canwemoresystema2callyiden2fythefunc2onsofthesegenesandhowtheyinteractatapathwaylevel?

Problemoverview

•  Given–  globalmeasurementsofgeneexpressionajersinglegenedele2onsofmul2plegenes

•  Do–  Inferinterac2onsbetweengeneswithdele2onstoenablefurthercharacteriza2onofthesegenes

•  NestedEffectModelsareprobabilis2cmodel-basedapproachestosolvethisproblem

NestedEffectModels

Clustering defines groups of genes with similar phenotypicprofiles, but may miss the hierarchy in the observed perturba-tion effects, as is exemplified in Figure 1. Perturbing some genesmay have an influence on a global process, while perturbingothers affects subprocesses of it. Imagine, e.g. a signalingpathway activating several transcription factors (TFs). Blockingthe entire pathway will affect all targets of the TFs, whileperturbing a single downstream TF will only affect its directtargets, which are a subset of the phenotype obtained byblocking the complete pathway. Boutros et al. (2002) show thatby this reasoning non-transcriptional features of signalingpathways can be recovered from gene-expression profiles.However, no previous computational method is applicable toinfer models from biological subset relations on data setsscreening whole pathways.Nested effects models. We will call a model encoding the

(noisy) subset relations between the effects observed afterperturbing the target genes a Nested Effects Model (NEM).It can be seen as a generalization of similarity-based clustering,which orders (clusters of) genes according to subset relation-ships between the sets of phenotypes. In this article, we developa Bayesian method to infer NEM from large-scale data sets.Our method builds on preliminary work by Markowetz et al.,

(2005), which is specifically designed for inference from indirectinformation and also takes the imbalance between spuriousand missed effects into account. Previously, this method waslimited to small-scale scenarios of up to six genes, where modelsearch can be done by exhaustive enumeration. Scaling upmodelsearch to larger numbers of perturbed genes is a non-trivial

problem due to the constraints imposed on the model byhaving only indirect information of the underlying geneticnetwork. Here, we approach the problem of inferring a hierarchyon the set of all perturbed genes by constructing it from smallersub-models containing only pairs or triples of genes. Such ‘divide-and-conquer’-like approaches are regularly used in high-dimensional statistical inference, e.g. for estimating largephylogenetic trees (Strimmer and von Haeseler, 1996) orlearning Gaussian graphical models for regulatory networks(Wille et al., 2004). Our resulting method is the first one to makeinference of NEMs feasible on a pathway-wide scale.The next section introduces our novel methodology in detail.

In Section 3, we demonstrate the applicability of our methodsin a controlled simulation study, and in Section 4 we describeresults for two experimental data sets. We show that the subsetrelations retrieved actually reflect the regulatory functions ofthe genes involved.

2 ALGORITHM

Data. We assume that data is given in the form of a binarymatrix D with columns corresponding to perturbation experi-ments on one of n genes (replicates are possible) and rows toone of m possible effects E1, . . . ,Em. A phenotypic profile Px ofgene x consists of a binary vector of length m with a PxðEiÞ ¼ 1denoting that effect Ei occurred after perturbing gene x, andPxðEiÞ ¼ 0 denoting that it did not.Subset relations between phenotypic profiles. Instead of

similarity, we will consider subset relations between phenotypicprofiles. We say that gene x is upstream of gene y

(c) Nested Effects Model(a) Data (b) Clustering

A B C D GFE H

A B

C D E F

G H

A B

C D

E F

(d) Subset structure

G H

G HA B C D EF

Fig. 1. An introduction to Nested Effects Models. Plot (a) shows a toy dataset consisting of phenotypic profiles for eight perturbed genes (A, . . . ,H).Each profile is binary with black coding for an observed effect and white for an effect not observed. The eight profiles are hierarchically clustered,showing that they fall into four pairs of genes with almost identical phenotypic profiles: (A,B), (C,D), (E,F) and (G,H), as shown in plot (b). Animportant feature of the data missed by clustering is the subset structure visible between the profiles in the data set: the effects observed whenperturbing genes A or B are a superset to the effects observed for all other genes. The effects of perturbing G or H are a subset to all other genes’effects. The pairs (C,D) and (E,F) have different but overlapping effect sets. The directed acyclic graph (DAG) shown in plot (c) represents thesesubset relations, which are shown in plot (d). Compared to the clustering result in plot (b) the NEM additionally elucidates relationships between theclusters and thus describes the dominant features of the data set better.

F.Markowetz et al.

i306

at University of W

isconsin-Madison Libraries on N

ovember 11, 2015

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Markowetzetal,2007

NestedEffectModelsKeyproper)es

•  Ageneraliza2onofsimilaritybasedclustering•  Orderstheclustersaccordingtosubsetrela2onships–  AgeneAisupstreamofanothergeneBifB’seffectsareasubsetofA’seffects

•  Buildahierarchyofallperturbedgenesbyconstruc2ngfromsmallersub-modelsofpairsandtripletsofgenes

Subsetrela)onshipstoordergenes

(and write x! y) if the set of effects in Py is a subset of the setof effects in Px:

x! y , fi : PyðEiÞ ¼ 1g $ fi : PxðEiÞ ¼ 1g: ð1Þ

A subset relation is reflexive and transitive, and thus definesa quasi-order on phenotypic profiles. We depict the quasi-orderin a directed graph in which nodes correspond to geneperturbations and edges indicate subset relations according toEquation (1). The reflexive self-loops at nodes are usuallyomitted. Transitivity is the key feature of our model: wheneverthere is a path from one node to another, we also have adirected edge between these two nodes in the graph.

2.1 Bayesian inference for NEM models

Posterior probability A Bayesian score to evaluate how well acandidate NEM fits to the observed data can be obtained intwo steps (Markowetz et al., 2005). First, assume that it isknown which effect is specific for which perturbed gene. We callthis the complete model, and an example is given in Figure 2.A complete model M0 ¼ ðM,!Þ consists of a transitively closedgraph, M, and parameters ! ¼ f!1, . . . , !mg. The nodes of Mcorrespond to perturbed genes, and the parameters ! describethe allocation of specific effects to perturbed genes (i.e. thedashed arrows in the left plot of Fig. 2). The complete modeldefines which effects we expect to observe (see the middle plotof Fig. 2). We can directly compute the complete likelihood ofthe actually observed data D under the model ðM,!Þ by:

PðDjM,!Þ ¼Ym

i¼1

Yl

k¼1PðeikjM, !iÞ, ð2Þ

where, the first product is over all effects E1, . . . ,Em and thesecond over all replicates of gene perturbation experiments. Theprobability PðeikjM, !iÞ depends on two parameters: a FP rateof seeing a spurious effect, " (type-I error rate), and a FN rateof missing an effect, # (type-II error rate).However, in real data, it is not known which effect is specific

for which intervention, i.e. ! is unknown. Thus, in a secondstep, we average over ! to gain the likelihood of the data,which is proportional to the posterior probability of the NEMand can be written as:

PðDjMÞ /Ym

i¼1

Xn

j¼1

Yl

k¼1PðeikjM, !i ¼ jÞ, ð3Þ

where the two products are the same as in Equation (2), and thesum is due to marginalization over !.Size of model space. NEMs are defined in terms of quasi-

orders, i.e. transitively closed graphs. The number of quasi-orders is known for up to 16 nodes (Sloane, 2007, seq. A000798).For n¼ 7, we already have almost 107 possible quasi-orders andfor n¼ 8 the number is > 6 % 109. Thus, exhaustive enumerationis infeasible even for medium-sized studies. For large-scalescreens, we need search heuristics to explore model space. Ourapproach to this problem is to concentrate on small sub-modelsinvolving only pairs or triples of nodes.

2.2 Inference of pairwise relations

The smallest possible sub-model consists of pairs of genes. Weinfer pairwise relations by choosing between four models foreach gene pair (x, y): either x! y (‘‘upstream’’, effects of x area superset of the effects of y), or x y (‘‘downstream’’, effectsof x are a subset of the effects of y), or x$ y (the effects of xand y are undistinguishable) or x % % y (x and y are unrelated).For every pair (x, y), we compute the Bayesian score detailedabove and select the maximum aposteriori (MAP) modelMxy 2 fx y, x! y, x$ y, x % % yg.The greatest advantage of this procedure is the increase in

speed. The number of models we have to score for n genes isn2

! "% 4, which grows quadratically in the number of perturbed

genes and remains feasible even for hundreds of genes.Additionally, building up the final graph is easy, since it isdefined by the set of all pairwise MAP models.These advantages come at a cost. The most serious problem

is that pairwise learning treats all edges independently of eachother. But in a transitive graph, there must be a shortcut x! ywhenever there exists a longer path from x to y. To see howeasily mistakes can be introduced in pairwise inference,consider the example in Figure 2. In the observed data(rightmost plot), the profiles of x and z seem non-overlapping(because of the FNs at E5 and E6), so the edge x! z could bemissed. One can also think of scenarios, where noise inthe data induces spurious edges in pairwise inference. Toaddress these problems, we concentrate on triples of nodes inthe next section.

2.3 Inference of triple relations

Inference from triples of genes instead of pairs is a natural wayto extend our inference method beyond the independence

M′xyz:

X Y Z

Expected Observed

X X

E1 E2 E3 E4 E5 E6E1E1 E2E2 E3E3 E4E4 E5E5 E6E6

FN FN

FNFPY Y

Z Z

Fig. 2. A complete model. The left part of the figure shows a complete model M0xyz consisting of a transitively closed graph between genes andassignments of genes to specific effects (the dashed arrows). Given the complete model, we can formulate a prediction of what effects to expect:perturbing x should cause all effects, while perturbing y should only cause E3–E6, and perturbing z only E5 and E6 (middle plot). In reality, ourobservations will be noisy: there can be false positive (FP) and false negative (FN) effect observations (right plot).

Nested effects models for high-dimensional phenotyping screens

i307

at University of W

isconsin-Madison Libraries on N

ovember 11, 2015

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Acompletemodel.ThelejpartofthefigureshowsacompletemodelM’xyzconsis2ngofatransi2velyclosedgraphbetweengenesandassignmentsofgenestospecificeffects(thedashedarrows).Giventhecompletemodel,wecanformulateapredic2onofwhateffectstoexpect:perturbingxshouldcausealleffects,whileperturbingyshouldonlycauseE3–E6,andperturbingzonlyE5andE6(middleplot).Inreality,ourobserva2onswillbenoisy:therecanbefalseposi2ve(FP)andfalsenega2ve(FN)effectobserva2ons(rightplot).

Probabilis)cgraphicalmodelsforinterpre)ngnetworkperturba)ons

•  C.-H.H.Yeang,T.Ideker,andT.Jaakkola,"Physicalnetworkmodels."Journalofcomputa.onalbiology:ajournalofcomputa.onalmolecularcellbiology,vol.11,no.2-3,pp.243-262,Mar.2004.

•  F.Markowetz,D.Kostka,O.G.Troyanskaya,andR.Spang,"Nestedeffectsmodelsforhigh-dimensionalphenotypingscreens,"Bioinforma.cs,vol.23,no.13,pp.i305-312,Jul.2007.

•  C.J.Vaske,C.House,T.Luu,B.Frank,C.-H.H.Yeang,N.H.Lee,andJ.M.Stuart,"Afactorgraphnestedeffectsmodeltoiden2fynetworksfromgene2cperturba2ons."PLoScomputa.onalbiology,vol.5,no.1,pp.e1000274+,Jan.2009.

Keyproper)esofFactorGraph-NEMs(FG-NEMs)

•  NEMsassumethegenesthatareperturbedinteractinabinarymanner

•  Butmanyinterac2onshavesign–  inhibitoryors2mula2ngac2on

•  FG-NEMscaptureabroadersetofinterac2onsamongtheperturbedgenes

•  Formula2onbasedonaFactorGraph–  ProvideanefficientsearchoverthespaceofNEMs

Nota)on

•  S-genes:Setofgenesthathavebeendeletedindividually•  E-genes:Setofeffectorgenesthataremeasured•  Θ:TheaBachmentofaneffectorgenetotheS-genenetwork•  Φ:Theinterac2onmatrixofS-genes•  X:Thephenotypicprofile,eachcolumngivesthedifferencein

expressioninaknockoutcomparedtowildtype–  Rows:E-genes–  Columns:S-genes

•  Y:Hiddeneffectmatrix,eachentryis{-1,0,+1}whichspecifieswhetheranS-geneaffectstheE-gene

are listed from top to bottom according to where they are attachedto the network. Depending on the connections of the S-genes toone another and to the E-genes, a disruption in an S-gene willcause E-genes to either increase or decrease in expression relativeto wild-type. For example, E-gene E7 decreases under DB relativeto wild-type because the wild-type activation by B is absent in thedeletion. On the other hand, the expression of E10 also decreasesunder DB relative to wild-type but as a result of a differentmechanism. In wild-type, E10 is expressed at a baseline levelbecause its repressor, the product of gene D, is inhibited by B’sproduct. However, in the B deletion, D is derepressed, leading toinhibition of E10. This toy example illustrates that the disambig-uation of inhibition and activation, both for S-gene interactionsand E-gene attachments, make it possible to account for an

expanded set of mechanisms leading to the observed expressionchanges.

The E-gene expression changes are available in a data matrix Xwhere each column gives the difference in expression of each E-gene under the deletion of a single S-gene relative to wild-type. Xmay also contain replicates in the form of repeated S-gene knock-downs. The entry XeAr represents e’s expression change under therth replicate of DA. Furthermore, we assume that an unknownexpression ‘‘state’’ for each E-gene under each knock-down,determines its set of expression changes observed across the {XeAr}replicates in the microarray data. The matrix, Y, records a hiddenstate for each E-gene under each knock-down, where entry YeA isthe state of E-gene e under DA. We allow the states to be ternary-valued {+1, 21, 0} representing whether e is up-regulated, down-

Figure 1. Predicting Pair-wise Interaction Using Quantitative Nested Effects. (A) Hypothetical example with four S-genes, A, B, C, and D. Thegraph contains one inhibitory link, BxD (left). A heatmap of E-gene expression under knockdown of each S-gene shows both inhibitory andstimulatory effects (middle). Scatter plots of the C, A, B, and D knock-outs show that expression fits in the shaded preferred regions of each interaction(right). The inhibitory link explains some of the ‘‘observed’’ data: expression changes under DD (bright red or bright green entries in the heatmap)occur in a subset of the E-genes for which the opposite changes occur in DB. (B) Data from a known inhibitory interaction. Expression levels of effectgenes under the DIG1/DIG2 knock-out (y-axis) plotted against their levels under the STE2 knock-out (x-axis) as detected in [17]. Expression changessignificant at a = 0.05 indicated in gray lines. DIG1/DIG2 is known to inhibit STE12. (C) Interaction modes. Observed E-gene expression changes arecompared to five possible types of interactions between two S-genes, A and B (i–v). The top row illustrates the expected nested effects relationshipfor each type of interaction mode: circles represent sets of E-genes with expression changes consistent with either activation (blue circles) orinhibition (yellow circles). Scatter-plots for each interaction mode show the hypothetical expression changes under DA (x-axis) and DB (y-axis) for all E-genes (circles). E-gene levels are either consistent (filled) or inconsistent (open) with the mode. Shaded regions demark expression levels consistentwith each interaction model. The example shows expression changes that most closely match the inhibition mode (indicated by the greatest numberof closed circles).doi:10.1371/journal.pcbi.1000274.g001

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 3 January 2009 | Volume 5 | Issue 1 | e1000274

A->CisreflectedinthescaBerplot.WhenXAisup,XCisup.WhenXAisdown,XCisdownornochange

B-IDisalsoreflectedinthescaBerplot.XDisasubsetofoppositechangesfromXB

S-genes

E-genes

ΦΘ

X

Anexampleof4S-genesand13E-gens

S-geneinterac)onmodesandtheirexpressionsignatures

are listed from top to bottom according to where they are attachedto the network. Depending on the connections of the S-genes toone another and to the E-genes, a disruption in an S-gene willcause E-genes to either increase or decrease in expression relativeto wild-type. For example, E-gene E7 decreases under DB relativeto wild-type because the wild-type activation by B is absent in thedeletion. On the other hand, the expression of E10 also decreasesunder DB relative to wild-type but as a result of a differentmechanism. In wild-type, E10 is expressed at a baseline levelbecause its repressor, the product of gene D, is inhibited by B’sproduct. However, in the B deletion, D is derepressed, leading toinhibition of E10. This toy example illustrates that the disambig-uation of inhibition and activation, both for S-gene interactionsand E-gene attachments, make it possible to account for an

expanded set of mechanisms leading to the observed expressionchanges.

The E-gene expression changes are available in a data matrix Xwhere each column gives the difference in expression of each E-gene under the deletion of a single S-gene relative to wild-type. Xmay also contain replicates in the form of repeated S-gene knock-downs. The entry XeAr represents e’s expression change under therth replicate of DA. Furthermore, we assume that an unknownexpression ‘‘state’’ for each E-gene under each knock-down,determines its set of expression changes observed across the {XeAr}replicates in the microarray data. The matrix, Y, records a hiddenstate for each E-gene under each knock-down, where entry YeA isthe state of E-gene e under DA. We allow the states to be ternary-valued {+1, 21, 0} representing whether e is up-regulated, down-

Figure 1. Predicting Pair-wise Interaction Using Quantitative Nested Effects. (A) Hypothetical example with four S-genes, A, B, C, and D. Thegraph contains one inhibitory link, BxD (left). A heatmap of E-gene expression under knockdown of each S-gene shows both inhibitory andstimulatory effects (middle). Scatter plots of the C, A, B, and D knock-outs show that expression fits in the shaded preferred regions of each interaction(right). The inhibitory link explains some of the ‘‘observed’’ data: expression changes under DD (bright red or bright green entries in the heatmap)occur in a subset of the E-genes for which the opposite changes occur in DB. (B) Data from a known inhibitory interaction. Expression levels of effectgenes under the DIG1/DIG2 knock-out (y-axis) plotted against their levels under the STE2 knock-out (x-axis) as detected in [17]. Expression changessignificant at a = 0.05 indicated in gray lines. DIG1/DIG2 is known to inhibit STE12. (C) Interaction modes. Observed E-gene expression changes arecompared to five possible types of interactions between two S-genes, A and B (i–v). The top row illustrates the expected nested effects relationshipfor each type of interaction mode: circles represent sets of E-genes with expression changes consistent with either activation (blue circles) orinhibition (yellow circles). Scatter-plots for each interaction mode show the hypothetical expression changes under DA (x-axis) and DB (y-axis) for all E-genes (circles). E-gene levels are either consistent (filled) or inconsistent (open) with the mode. Shaded regions demark expression levels consistentwith each interaction model. The example shows expression changes that most closely match the inhibition mode (indicated by the greatest numberof closed circles).doi:10.1371/journal.pcbi.1000274.g001

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 3 January 2009 | Volume 5 | Issue 1 | e1000274

ExpectedtrendinE-genesforspecificinterac2onmodes

Interac2onmode

Consistentwithexpecta2on

Inconsistentwithexpecta2on

Factorgraphs

•  Atypeofgraphicalmodel•  Abi-par2tegraphwithvariablenodesandfactornodes•  Edgesconnectvariablestopoten2alsthatthevariablesare

argumentsof•  Representsaglobalfunc2onasproductofsmallerlocal

func2ons•  Perhapsthemostgeneralgraphicalmodel–  BayesiannetworksandMarkovnetworkshavefactorgraphrepresenta2ons

Examplefactorgraph500 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001

Fig. 1. A factor graph for the product.

of five factors, so that , ,, , , and

. The factor graph that corresponds to (2) is shown inFig. 1.

A. Expression TreesIn many situations (for example, when rep-

resents a joint probability mass function), we are interested incomputing the marginal functions . We can obtain an ex-pression for each marginal function by using (2) and exploitingthe distributive law.To illustrate, we write from Example 1 as

or, in summary notation

(3)

Similarly, we find that

(4)

In computer science, arithmetic expressions like theright-hand sides of (3) and (4) are often represented by or-dered rooted trees [28, Sec. 8.3], here called expression trees,in which internal vertices (i.e., vertices with descendants)represent arithmetic operators (e.g., addition, multiplication,negation, etc.) and leaf vertices (i.e., vertices without descen-dants) represent variables or constants. For example, the tree ofFig. 2 represents the expression . When the operatorsin an expression tree are restricted to those that are completelysymmetric in their operands (e.g., multiplication and addition),

Fig. 2. An expression tree representing .

it is unnecessary to order the vertices to avoid ambiguity ininterpreting the expression represented by the tree.In this paper, we extend expression trees so that the leaf ver-

tices represent functions, not just variables or constants. Sumsand products in such expression trees combine their operands inthe usual (pointwise) manner in which functions are added andmultiplied. For example, Fig. 3(a) unambiguously represents theexpression on the right-hand side of (3), and Fig. 4(a) unambigu-ously represents the expression on the right-hand side of (4). Theoperators shown in these figures are the function product and thesummary, having various local functions as their arguments.Also shown in Figs. 3(b) and 4(b), are redrawings of the factor

graph of Fig. 1 as a rooted tree with and as root vertex,respectively. This is possible because the global function de-fined in (2) was deliberately chosen so that the correspondingfactor graph is a tree. Comparing the factor graphs with the cor-responding trees representing the expression for the marginalfunction, it is easy to note their correspondence. This observa-tion is simple, but key: when a factor graph is cycle-free, thefactor graph not only encodes in its structure the factorizationof the global function, but also encodes arithmetic expressionsby which the marginal functions associated with the global func-tion may be computed.Formally, as we show in Appendix A, to convert a cycle-free

factor graph representing a function to the cor-responding expression tree for , draw the factor graph asa rooted tree with as root. Every node in the factor graphthen has a clearly defined parent node, namely, the neighboringnode through which the unique path from to must pass. Re-place each variable node in the factor graph with a product op-erator. Replace each factor node in the factor graph with a “formproduct and multiply by ” operator, and between a factor nodeand its parent , insert a summary operator. These

local transformations are illustrated in Fig. 5(a) for a variablenode, and in Fig. 5(b) for a factor node with parent . Trivialproducts (those with one or no operand) act as identity opera-tors, or may be omitted if they are leaf nodes in the expressiontree. A summary operator applied to a function with asingle argument is also a trivial operation, andmay be omitted.Applying this transformation to the tree of Fig. 3(b) yields theexpression tree of Fig. 3(a), and similarly for Fig. 4. Trivial op-erations are indicated with dashed lines in these figures.

B. Computing a Single Marginal FunctionEvery expression tree represents an algorithm for computing

the corresponding expression. One might describe the algorithmas a recursive “top-down” procedure that starts at the root vertexand evaluates each subtree descending from the root, combiningthe results as dictated by the operator at the root. Equivalently,we prefer to describe the algorithm as a “bottom-up” procedurethat begins at the leaves of the tree, with each operator vertex

FromKschischang,Frey,Loeliger2001

Variablenodes

500 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001

Fig. 1. A factor graph for the product.

of five factors, so that , ,, , , and

. The factor graph that corresponds to (2) is shown inFig. 1.

A. Expression TreesIn many situations (for example, when rep-

resents a joint probability mass function), we are interested incomputing the marginal functions . We can obtain an ex-pression for each marginal function by using (2) and exploitingthe distributive law.To illustrate, we write from Example 1 as

or, in summary notation

(3)

Similarly, we find that

(4)

In computer science, arithmetic expressions like theright-hand sides of (3) and (4) are often represented by or-dered rooted trees [28, Sec. 8.3], here called expression trees,in which internal vertices (i.e., vertices with descendants)represent arithmetic operators (e.g., addition, multiplication,negation, etc.) and leaf vertices (i.e., vertices without descen-dants) represent variables or constants. For example, the tree ofFig. 2 represents the expression . When the operatorsin an expression tree are restricted to those that are completelysymmetric in their operands (e.g., multiplication and addition),

Fig. 2. An expression tree representing .

it is unnecessary to order the vertices to avoid ambiguity ininterpreting the expression represented by the tree.In this paper, we extend expression trees so that the leaf ver-

tices represent functions, not just variables or constants. Sumsand products in such expression trees combine their operands inthe usual (pointwise) manner in which functions are added andmultiplied. For example, Fig. 3(a) unambiguously represents theexpression on the right-hand side of (3), and Fig. 4(a) unambigu-ously represents the expression on the right-hand side of (4). Theoperators shown in these figures are the function product and thesummary, having various local functions as their arguments.Also shown in Figs. 3(b) and 4(b), are redrawings of the factor

graph of Fig. 1 as a rooted tree with and as root vertex,respectively. This is possible because the global function de-fined in (2) was deliberately chosen so that the correspondingfactor graph is a tree. Comparing the factor graphs with the cor-responding trees representing the expression for the marginalfunction, it is easy to note their correspondence. This observa-tion is simple, but key: when a factor graph is cycle-free, thefactor graph not only encodes in its structure the factorizationof the global function, but also encodes arithmetic expressionsby which the marginal functions associated with the global func-tion may be computed.Formally, as we show in Appendix A, to convert a cycle-free

factor graph representing a function to the cor-responding expression tree for , draw the factor graph asa rooted tree with as root. Every node in the factor graphthen has a clearly defined parent node, namely, the neighboringnode through which the unique path from to must pass. Re-place each variable node in the factor graph with a product op-erator. Replace each factor node in the factor graph with a “formproduct and multiply by ” operator, and between a factor nodeand its parent , insert a summary operator. These

local transformations are illustrated in Fig. 5(a) for a variablenode, and in Fig. 5(b) for a factor node with parent . Trivialproducts (those with one or no operand) act as identity opera-tors, or may be omitted if they are leaf nodes in the expressiontree. A summary operator applied to a function with asingle argument is also a trivial operation, andmay be omitted.Applying this transformation to the tree of Fig. 3(b) yields theexpression tree of Fig. 3(a), and similarly for Fig. 4. Trivial op-erations are indicated with dashed lines in these figures.

B. Computing a Single Marginal FunctionEvery expression tree represents an algorithm for computing

the corresponding expression. One might describe the algorithmas a recursive “top-down” procedure that starts at the root vertexand evaluates each subtree descending from the root, combiningthe results as dictated by the operator at the root. Equivalently,we prefer to describe the algorithm as a “bottom-up” procedurethat begins at the leaves of the tree, with each operator vertex

Factornodes

Probabilis)cmodelforNEMs

•  Goalistofindanetwork,ΦandΘthatbestfittheobserveddata(X)

•  Thisisaninferenceproblem•  UseaMaximumaposterior(MAP)approach

J(X) = max�,✓P (�, ✓|X)

J(X) = max�,✓

X

Y

P (�, ✓, Y |X)

XisanoisymeasurementofY.Yisthequan2tyweneedtosumover

Probabilis)cmodelcon)nued

regulated, or unchanged under DA relative to wild-type respec-tively.

Nested effects models include two sets of parameters. Theparameter set W records all pair-wise interactions among the S-genes and the parameter set H describes how each E-gene isattached to the network of S-genes. In the original NEMformulations [10,15,18] W is a binary matrix with entry wAB setto one if S-gene A acts above S-gene B and zero otherwise. IfwAB = wBA = 1 then the S-genes are assumed to operate at anequivalent position in the pathway. Note that indirect interactionsare also represented in W so that if wAB = 1 and wBC = 1 it implieswAC = 1. A parsimonious network among the S-genes is solved forby computing the transitive reduction of W.

To allow for both stimulatory and inhibitory interactions inour formulation, wAB can assume six possible values for eachunique unordered S-gene pair {A, B}. We refer to these values asinteraction modes. The possible values are: (i) A activates B, ARB;(ii) A inhibits B, AxB; (iii) A is equivalent to B, A = B; (iv) A doesnot interact with B, A?B; (v) B activates A, BRA; and (vi) Binhibits A, BxA.

Plotting the response of E-genes under DA and DB yields ascatter-plot that may provide a signature for the type of interactionbetween A and B. For example, Figure 1B shows a scatter-plot ofgene expression changes from the Hughes et al. (2000) yeastknock-out compendium [17] for a pair of knock-outs of the well-known pheromone-response genes: DSTE12 and the DDIG1/DIG2 double knock-out. Comparing the scatter-plot for thesepheromone-response genes to the patterns in Figure 1C, it can beseen to match the inhibitory interaction mode more closely thanthe other modes, which is consistent with DIG1/DIG2’s knowninhibition of STE12. Figure 1C shows an example of the first fourmodes. Shaded regions denote consistent E-gene responses foreach mode. An interaction mode determines a constraint on theobserved E-gene expression changes. For example, plotting theexpression changes of E-genes that act downstream of either A orB for the generic A.B interaction mode produces points in one ofthe seven shaded regions shown in Figure 1Cv. Figure 1Cii showsan example where the inhibitory interaction mode is the bestmatch to the data because a higher number of E-gene changes fallwithin consistent regions (filled circles in the figure). In thismanner, genomewide expression changes detected on the micro-arrays can be used as quantitative phenotypes to identify a varietyof interactions between pairs of S-genes.

Note that two genes are equivalent if their knock-downs lead tosignificantly similar expression changes, which may predict, forexample, that they form a complex. Figure 1C also illustrates thegeneric interaction mode A.B used in an unsigned version of ourmethod. We compare FG-NEM results to two unsigned variants toestimate the change in predictive power as a function of theintroduction of sign. In effect, both variants consider fourinteraction modes: (i) A.B; (ii) B.A; (iii) A?B; and A = B. Forcomparison purposes, a predicted unsigned interaction was treatedas activation. In the FG-NEM AVT variant, FG-NEM is run onthe absolute value of the data. In the uFG-NEM method, weremove the component of FG-NEM which models inducedexpression, resulting in interaction modes where the top and rightfive regions are disallowed in all interaction modes.

Probabilistic Formulation of NEMsOur goal is to find a structure among the S-genes that provides

a compact description of X. To find a network that best ‘‘fits’’ thedata, we take a maximum a posteriori approach as in [15,18] jointlyidentify W and H that maximize the posterior:

J Xð Þ~maxW,H P W,HjXð Þf g ð1Þ

~maxW,H

X

Y

P W,H,Y jXð Þ

( )

ð2Þ

where we introduce the hidden E-gene states by summing over allpossible configurations of the Y matrix. Applying Bayes’ Rule anddropping P(X), which is constant with respect to the maximization,we obtain:

J Xð Þ~maxW,H P Wð ÞP HjWð ÞX

Y

P Y jW,Hð ÞP X jYð Þ

( )

ð3Þ

&maxW,H P Wð ÞX

Y

P Y jW,Hð ÞP X jYð Þ

( )

ð4Þ

The approximation in the last step uses the assumption that any E-gene attachments are equally likely given a network structure; i.e.P(H|W) is assumed to be uniformly distributed and is ignored inour approach. P(W) represents a prior over S-gene networks.

As in previous NEM formulations, we assume that each E-geneis attached to a single S-gene and that each E-gene observationvector across the knock-downs is independent of other E-geneobservations. The maximization function can then be written:

J Xð Þ~maxW,H P Wð ÞX

Y

Pe[E

P YejW,heð ÞP XejYeð Þ

( )

ð5Þ

~maxW,H P Wð Þ Pe[E

X

Y

P YejW,heð ÞP XejYeð Þ

( )

ð6Þ

~maxW,H P Wð Þ Pe[E

Le

! "ð7Þ

where Xe and Ye are the row vectors of data and hidden states forE-gene e respectively, and he records the attachment pointinformation for E-gene e. After rearranging the products andsums, we introduce the shorthand Le to represent the likelihood ofthe data restricted only to E-gene e.

Previous approaches decompose Le over the knock-downs,which assume the S-gene observations are independent given thenetwork and attachments (see [18] for an example of such aderivation). To facilitate scoring the expanded set of interactionmodes mentioned earlier, we replace Le with a functionproportional to Le, Le9. Le9 is defined as a product of pair-wise S-gene terms:

L’e~X

A,B[S

PYeA,

YeB

P YeA,YeBjwAB,heABð ÞP XeAjYeAð ÞP XeBjYeBð Þ ð8Þ

where heAB represents the attachment of E-gene e relative to thepair of S-genes A and B. Note that both heAB and wAB are indexedby the unordered pair, {A, B}, so that wAB and wBA are referencesfor the same variable. We refer to heAB as e’s local attachment which

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 4 January 2009 | Volume 5 | Issue 1 | e1000274

IndependenceoverallE-genes

HowtocomputeLe?

Re-arrangingtheterms

DigginginsidetheLeterm

•  Note:Ye={YeA,YeB,YeC..YeN},whereNisthetotalnumberofS-genes

•  DefineL’e,propor2onaltoLeusingasetofpairwisepoten2als

regulated, or unchanged under DA relative to wild-type respec-tively.

Nested effects models include two sets of parameters. Theparameter set W records all pair-wise interactions among the S-genes and the parameter set H describes how each E-gene isattached to the network of S-genes. In the original NEMformulations [10,15,18] W is a binary matrix with entry wAB setto one if S-gene A acts above S-gene B and zero otherwise. IfwAB = wBA = 1 then the S-genes are assumed to operate at anequivalent position in the pathway. Note that indirect interactionsare also represented in W so that if wAB = 1 and wBC = 1 it implieswAC = 1. A parsimonious network among the S-genes is solved forby computing the transitive reduction of W.

To allow for both stimulatory and inhibitory interactions inour formulation, wAB can assume six possible values for eachunique unordered S-gene pair {A, B}. We refer to these values asinteraction modes. The possible values are: (i) A activates B, ARB;(ii) A inhibits B, AxB; (iii) A is equivalent to B, A = B; (iv) A doesnot interact with B, A?B; (v) B activates A, BRA; and (vi) Binhibits A, BxA.

Plotting the response of E-genes under DA and DB yields ascatter-plot that may provide a signature for the type of interactionbetween A and B. For example, Figure 1B shows a scatter-plot ofgene expression changes from the Hughes et al. (2000) yeastknock-out compendium [17] for a pair of knock-outs of the well-known pheromone-response genes: DSTE12 and the DDIG1/DIG2 double knock-out. Comparing the scatter-plot for thesepheromone-response genes to the patterns in Figure 1C, it can beseen to match the inhibitory interaction mode more closely thanthe other modes, which is consistent with DIG1/DIG2’s knowninhibition of STE12. Figure 1C shows an example of the first fourmodes. Shaded regions denote consistent E-gene responses foreach mode. An interaction mode determines a constraint on theobserved E-gene expression changes. For example, plotting theexpression changes of E-genes that act downstream of either A orB for the generic A.B interaction mode produces points in one ofthe seven shaded regions shown in Figure 1Cv. Figure 1Cii showsan example where the inhibitory interaction mode is the bestmatch to the data because a higher number of E-gene changes fallwithin consistent regions (filled circles in the figure). In thismanner, genomewide expression changes detected on the micro-arrays can be used as quantitative phenotypes to identify a varietyof interactions between pairs of S-genes.

Note that two genes are equivalent if their knock-downs lead tosignificantly similar expression changes, which may predict, forexample, that they form a complex. Figure 1C also illustrates thegeneric interaction mode A.B used in an unsigned version of ourmethod. We compare FG-NEM results to two unsigned variants toestimate the change in predictive power as a function of theintroduction of sign. In effect, both variants consider fourinteraction modes: (i) A.B; (ii) B.A; (iii) A?B; and A = B. Forcomparison purposes, a predicted unsigned interaction was treatedas activation. In the FG-NEM AVT variant, FG-NEM is run onthe absolute value of the data. In the uFG-NEM method, weremove the component of FG-NEM which models inducedexpression, resulting in interaction modes where the top and rightfive regions are disallowed in all interaction modes.

Probabilistic Formulation of NEMsOur goal is to find a structure among the S-genes that provides

a compact description of X. To find a network that best ‘‘fits’’ thedata, we take a maximum a posteriori approach as in [15,18] jointlyidentify W and H that maximize the posterior:

J Xð Þ~maxW,H P W,HjXð Þf g ð1Þ

~maxW,H

X

Y

P W,H,Y jXð Þ

( )

ð2Þ

where we introduce the hidden E-gene states by summing over allpossible configurations of the Y matrix. Applying Bayes’ Rule anddropping P(X), which is constant with respect to the maximization,we obtain:

J Xð Þ~maxW,H P Wð ÞP HjWð ÞX

Y

P Y jW,Hð ÞP X jYð Þ

( )

ð3Þ

&maxW,H P Wð ÞX

Y

P Y jW,Hð ÞP X jYð Þ

( )

ð4Þ

The approximation in the last step uses the assumption that any E-gene attachments are equally likely given a network structure; i.e.P(H|W) is assumed to be uniformly distributed and is ignored inour approach. P(W) represents a prior over S-gene networks.

As in previous NEM formulations, we assume that each E-geneis attached to a single S-gene and that each E-gene observationvector across the knock-downs is independent of other E-geneobservations. The maximization function can then be written:

J Xð Þ~maxW,H P Wð ÞX

Y

Pe[E

P YejW,heð ÞP XejYeð Þ

( )

ð5Þ

~maxW,H P Wð Þ Pe[E

X

Y

P YejW,heð ÞP XejYeð Þ

( )

ð6Þ

~maxW,H P Wð Þ Pe[E

Le

! "ð7Þ

where Xe and Ye are the row vectors of data and hidden states forE-gene e respectively, and he records the attachment pointinformation for E-gene e. After rearranging the products andsums, we introduce the shorthand Le to represent the likelihood ofthe data restricted only to E-gene e.

Previous approaches decompose Le over the knock-downs,which assume the S-gene observations are independent given thenetwork and attachments (see [18] for an example of such aderivation). To facilitate scoring the expanded set of interactionmodes mentioned earlier, we replace Le with a functionproportional to Le, Le9. Le9 is defined as a product of pair-wise S-gene terms:

L’e~X

A,B[S

PYeA,

YeB

P YeA,YeBjwAB,heABð ÞP XeAjYeAð ÞP XeBjYeBð Þ ð8Þ

where heAB represents the attachment of E-gene e relative to thepair of S-genes A and B. Note that both heAB and wAB are indexedby the unordered pair, {A, B}, so that wAB and wBA are referencesfor the same variable. We refer to heAB as e’s local attachment which

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 4 January 2009 | Volume 5 | Issue 1 | e1000274

•  TheS-geneinterac2on•  ABachmentofgeneewithrespecttoAorB�A,B✓eAB

DigginginsidetheLeterm

•  TheS-geneinterac2on•  ABachmentofgeneewithrespecttoAorB�A,B✓eAB

can take on five possible values from the set {A, 2A, B, 2B, 0}representing that e is either up- or down-regulated by A, attachedand either up- or down-regulated by B, or not affected by either S-gene. wAB defines the mode of interaction between S-genes A andB. Assuming the replicates are independent given the E-genestates, P(XeA | YeA) can be written as a product over replicateterms: P

r[RA

P XeArjYeAð Þ, where P(XeAr | YeA) is modeled with a

Gaussian distribution having mean m:YeA and standard deviations estimated from the data (see Text S1).

Substituting Le9 for Le into Eq. (7) and distributing themaximization over attachment points, we obtain the maximizingfunction used in our approach:

J Xð Þ~maxW P Wð Þ Pe[E,A,B[S

maxheAB

8<

:

X

YeA ,YeB

P YeA,YeBjwAB,heABð ÞP XeAjYeAð ÞP XeAjYeAð Þ

) ð9Þ

The interaction factors P(YeA, YeB | wAB, heAB) have a value of one ifthe E-gene e is attached to either A or B and e’s state is consistentwith the interaction mode between A and B. If e’s state isinconsistent with the interaction and attachment, then the factorhas value zero. While we used hard constraints to model consistentand inconsistent expression changes (corresponding to the rigidboundaries of the regions drawn in Figure 1C), such constraintscould be softened to use factors with belief potentials between zeroand one. Note that, to simplify the example, the interaction modesin Figure 1C show defined regions. However, P(XeA | YeA) ismodeled as a Gaussian distribution and therefore assigns non-zeroprobabilities over all possible expression values rather thanclassifying some as allowed and others disallowed (i.e. probabilityzero).

An Interaction Transitivity PriorThe prior over interactions, P(W), can represent preferences

over specific interactions in the S-gene graph, allowing theincorporation of biologically-motivated constraints to guidenetwork search. For example, the interaction priors for genes ina common pathway or genes whose products have been detectedto interact in protein-protein interaction screens could be sethigher than the priors for arbitrary pairs of S-genes. In this study,we chose to test the approach both with and without externalbiological information. Without external biological information,the prior encodes a basic property of the S-gene graph: that itshould exhibit transitivity to force pair-wise interaction modes tobe consistent among all triples. Using transitivity, all paths betweenany two genes, A and B, are guaranteed to have the same overalleffect; i.e. the product of the signs of individual links along differentpaths between A and B are equal.

In order to preserve the transitivity of identified interactionmodes, the prior is decomposed over interaction configurationsinto transitivity constraints on all triples of S-genes; i.e.:

P Wð Þ! PA,B,C[S

tABC wAB,wBC ,wACð Þ! "

PA,B[S

rAB wABð Þ! "

ð10Þ

where t is zero if the triple of interactions are intransitive, and oneif the interactions are transitive (see Text S1 for full definition).Using transitivity constraints forces the search to find consistentmodels that best explain the observed changes. The transitivity

constraint includes both the direction of interactions and the signof interactions. As S-gene interactions are signed, the transitivityconstraint forces the sign of the product of two edges to equal thesign of the third; e.g. if AxB and BxC, then ARC. A result ofmodeling transitivity is that a directed cycle of stimulatoryinteractions will also imply activation between any pair of S-genesin the cycle, in both directions. Therefore, the method clusterssuch S-genes into equivalence interactions. The product over rfactors in Eq. (10) encode evidence from high-throughput assays,such as protein-protein binding and protein-DNA bindinginteractions (see ‘‘Physical Structure Priors’’ in Text S1).

While network structures are constrained to reflect moreintuitive models, the decomposition introduces interdependenciesamong the interactions, adding complexity to the search for high-scoring networks. Importantly, max-sum message passing in afactor graph [19] provides an efficient means for estimating highlyprobable S-gene configurations. We next describe how theproblem is recoded into message-passing on a factor graph.

Inference on Factor Graphs to Search for Candidate S-Gene Networks

The formulation above provides a definition of the objectivefunction to be maximized but says nothing about how to search fora good network. The search space of networks is very large makingexhaustive search [10] intractable for networks larger than five S-genes. To apply the method to larger networks, we require a fast,heuristic approach. Markowetz et al. (2007) introduced a bottom-up technique to infer an S-gene graph. They identify sub-graphs ofS-genes (pairs and triples) and then merge the sub-graphs togetherinto a final parsimonious graph. Frohlich et al. (2008) [18] usehierarchical clustering to first identify modules, subsets of S-geneswith correlated expression changes. Networks among the modulesare exhaustively searched and a final network is identified bygreedily introducing interactions across modules that increase thelikelihood.

Here, we introduce the use of a graphical model called a factorgraph to represent all possible NEM structures simultaneously.The parameters that determine the S-gene interactions, W, areexplicitly represented as variables in the factor graph. Identifying ahigh-scoring S-gene network is therefore converted to the task ofidentifying likely assignments of the W variables in the factorgraph. A factor graph is a probabilistic graphical model whoselikelihood function can be factorized into smaller terms (factors)representing local constraints or valuations on a set of randomvariables. Other graphical models, such as Bayesian networks andMarkov random fields, have straightforward factor graph analogs.A factor graph can be represented as an undirected, bi-partitegraph with two types of nodes: variables and factors. A variable isadjacent to a factor if the variable appears as an argument of thefactor. Factor graphs generalize probability mass functions as thejoint likelihood function requires no normalization and the factorsneed not be conditional probabilities. Each factor encodes a localconstraint pertaining to a few variables.

The Factor Graph for Nested EffectsFigure 2 shows the factor graph representing the NEM for the

example S-gene network from Figure 1A. Each random variable isrepresented by a circle and each conditional probability term inEqs. (9–10) is represented by a square. The factor graph containsthree types of variables. First, every unique unordered pair of S-genes {A,B} has a corresponding variable, wAB, that takes on valuesequal to one of the previously mentioned interaction modes(Figure 2, ‘‘S-Gene Interactions’’ level). Second, every E-gene-S-gene pair is associated with a variable, YeA for the hidden

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 5 January 2009 | Volume 5 | Issue 1 | e1000274

NowthejointcanbewriBeninamoretractableway

Eachofthesecondi2onaldistribu2onswillcorrespondtoafactor

Definingthefactors

can take on five possible values from the set {A, 2A, B, 2B, 0}representing that e is either up- or down-regulated by A, attachedand either up- or down-regulated by B, or not affected by either S-gene. wAB defines the mode of interaction between S-genes A andB. Assuming the replicates are independent given the E-genestates, P(XeA | YeA) can be written as a product over replicateterms: P

r[RA

P XeArjYeAð Þ, where P(XeAr | YeA) is modeled with a

Gaussian distribution having mean m:YeA and standard deviations estimated from the data (see Text S1).

Substituting Le9 for Le into Eq. (7) and distributing themaximization over attachment points, we obtain the maximizingfunction used in our approach:

J Xð Þ~maxW P Wð Þ Pe[E,A,B[S

maxheAB

8<

:

X

YeA ,YeB

P YeA,YeBjwAB,heABð ÞP XeAjYeAð ÞP XeAjYeAð Þ

) ð9Þ

The interaction factors P(YeA, YeB | wAB, heAB) have a value of one ifthe E-gene e is attached to either A or B and e’s state is consistentwith the interaction mode between A and B. If e’s state isinconsistent with the interaction and attachment, then the factorhas value zero. While we used hard constraints to model consistentand inconsistent expression changes (corresponding to the rigidboundaries of the regions drawn in Figure 1C), such constraintscould be softened to use factors with belief potentials between zeroand one. Note that, to simplify the example, the interaction modesin Figure 1C show defined regions. However, P(XeA | YeA) ismodeled as a Gaussian distribution and therefore assigns non-zeroprobabilities over all possible expression values rather thanclassifying some as allowed and others disallowed (i.e. probabilityzero).

An Interaction Transitivity PriorThe prior over interactions, P(W), can represent preferences

over specific interactions in the S-gene graph, allowing theincorporation of biologically-motivated constraints to guidenetwork search. For example, the interaction priors for genes ina common pathway or genes whose products have been detectedto interact in protein-protein interaction screens could be sethigher than the priors for arbitrary pairs of S-genes. In this study,we chose to test the approach both with and without externalbiological information. Without external biological information,the prior encodes a basic property of the S-gene graph: that itshould exhibit transitivity to force pair-wise interaction modes tobe consistent among all triples. Using transitivity, all paths betweenany two genes, A and B, are guaranteed to have the same overalleffect; i.e. the product of the signs of individual links along differentpaths between A and B are equal.

In order to preserve the transitivity of identified interactionmodes, the prior is decomposed over interaction configurationsinto transitivity constraints on all triples of S-genes; i.e.:

P Wð Þ! PA,B,C[S

tABC wAB,wBC ,wACð Þ! "

PA,B[S

rAB wABð Þ! "

ð10Þ

where t is zero if the triple of interactions are intransitive, and oneif the interactions are transitive (see Text S1 for full definition).Using transitivity constraints forces the search to find consistentmodels that best explain the observed changes. The transitivity

constraint includes both the direction of interactions and the signof interactions. As S-gene interactions are signed, the transitivityconstraint forces the sign of the product of two edges to equal thesign of the third; e.g. if AxB and BxC, then ARC. A result ofmodeling transitivity is that a directed cycle of stimulatoryinteractions will also imply activation between any pair of S-genesin the cycle, in both directions. Therefore, the method clusterssuch S-genes into equivalence interactions. The product over rfactors in Eq. (10) encode evidence from high-throughput assays,such as protein-protein binding and protein-DNA bindinginteractions (see ‘‘Physical Structure Priors’’ in Text S1).

While network structures are constrained to reflect moreintuitive models, the decomposition introduces interdependenciesamong the interactions, adding complexity to the search for high-scoring networks. Importantly, max-sum message passing in afactor graph [19] provides an efficient means for estimating highlyprobable S-gene configurations. We next describe how theproblem is recoded into message-passing on a factor graph.

Inference on Factor Graphs to Search for Candidate S-Gene Networks

The formulation above provides a definition of the objectivefunction to be maximized but says nothing about how to search fora good network. The search space of networks is very large makingexhaustive search [10] intractable for networks larger than five S-genes. To apply the method to larger networks, we require a fast,heuristic approach. Markowetz et al. (2007) introduced a bottom-up technique to infer an S-gene graph. They identify sub-graphs ofS-genes (pairs and triples) and then merge the sub-graphs togetherinto a final parsimonious graph. Frohlich et al. (2008) [18] usehierarchical clustering to first identify modules, subsets of S-geneswith correlated expression changes. Networks among the modulesare exhaustively searched and a final network is identified bygreedily introducing interactions across modules that increase thelikelihood.

Here, we introduce the use of a graphical model called a factorgraph to represent all possible NEM structures simultaneously.The parameters that determine the S-gene interactions, W, areexplicitly represented as variables in the factor graph. Identifying ahigh-scoring S-gene network is therefore converted to the task ofidentifying likely assignments of the W variables in the factorgraph. A factor graph is a probabilistic graphical model whoselikelihood function can be factorized into smaller terms (factors)representing local constraints or valuations on a set of randomvariables. Other graphical models, such as Bayesian networks andMarkov random fields, have straightforward factor graph analogs.A factor graph can be represented as an undirected, bi-partitegraph with two types of nodes: variables and factors. A variable isadjacent to a factor if the variable appears as an argument of thefactor. Factor graphs generalize probability mass functions as thejoint likelihood function requires no normalization and the factorsneed not be conditional probabilities. Each factor encodes a localconstraint pertaining to a few variables.

The Factor Graph for Nested EffectsFigure 2 shows the factor graph representing the NEM for the

example S-gene network from Figure 1A. Each random variable isrepresented by a circle and each conditional probability term inEqs. (9–10) is represented by a square. The factor graph containsthree types of variables. First, every unique unordered pair of S-genes {A,B} has a corresponding variable, wAB, that takes on valuesequal to one of the previously mentioned interaction modes(Figure 2, ‘‘S-Gene Interactions’’ level). Second, every E-gene-S-gene pair is associated with a variable, YeA for the hidden

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 5 January 2009 | Volume 5 | Issue 1 | e1000274

ModeledasGaussiandistribu2onsFourvariablefactor,overdiscretevariablesYeA:binaryvariables

�AB Fourvaluesforeachpossibletypeofinterac2on:inhibitory,ac2va2ng,equivalent,nointerac2on

✓eAB Interac2onofewithAorB:inhibitedorac2vatedbyAorBornoac2on

Thisfactorhasvalue=1iftheE-geneeisaBachedtoeitherAorB ande’sstateisconsistentwiththeinterac2onmodebetweenAandB.

TheprioroverS-genegraph

•  ThepriorP(Φ)canincorporatepriorknowledgeofinterac2onsamonggenesinpathways

•  Atitssimplest,itshouldencodeatransi2vityrela2onshiptoforceallpairwiseinterac2onstobeconsistentamongalltriples

can take on five possible values from the set {A, 2A, B, 2B, 0}representing that e is either up- or down-regulated by A, attachedand either up- or down-regulated by B, or not affected by either S-gene. wAB defines the mode of interaction between S-genes A andB. Assuming the replicates are independent given the E-genestates, P(XeA | YeA) can be written as a product over replicateterms: P

r[RA

P XeArjYeAð Þ, where P(XeAr | YeA) is modeled with a

Gaussian distribution having mean m:YeA and standard deviations estimated from the data (see Text S1).

Substituting Le9 for Le into Eq. (7) and distributing themaximization over attachment points, we obtain the maximizingfunction used in our approach:

J Xð Þ~maxW P Wð Þ Pe[E,A,B[S

maxheAB

8<

:

X

YeA ,YeB

P YeA,YeBjwAB,heABð ÞP XeAjYeAð ÞP XeAjYeAð Þ

) ð9Þ

The interaction factors P(YeA, YeB | wAB, heAB) have a value of one ifthe E-gene e is attached to either A or B and e’s state is consistentwith the interaction mode between A and B. If e’s state isinconsistent with the interaction and attachment, then the factorhas value zero. While we used hard constraints to model consistentand inconsistent expression changes (corresponding to the rigidboundaries of the regions drawn in Figure 1C), such constraintscould be softened to use factors with belief potentials between zeroand one. Note that, to simplify the example, the interaction modesin Figure 1C show defined regions. However, P(XeA | YeA) ismodeled as a Gaussian distribution and therefore assigns non-zeroprobabilities over all possible expression values rather thanclassifying some as allowed and others disallowed (i.e. probabilityzero).

An Interaction Transitivity PriorThe prior over interactions, P(W), can represent preferences

over specific interactions in the S-gene graph, allowing theincorporation of biologically-motivated constraints to guidenetwork search. For example, the interaction priors for genes ina common pathway or genes whose products have been detectedto interact in protein-protein interaction screens could be sethigher than the priors for arbitrary pairs of S-genes. In this study,we chose to test the approach both with and without externalbiological information. Without external biological information,the prior encodes a basic property of the S-gene graph: that itshould exhibit transitivity to force pair-wise interaction modes tobe consistent among all triples. Using transitivity, all paths betweenany two genes, A and B, are guaranteed to have the same overalleffect; i.e. the product of the signs of individual links along differentpaths between A and B are equal.

In order to preserve the transitivity of identified interactionmodes, the prior is decomposed over interaction configurationsinto transitivity constraints on all triples of S-genes; i.e.:

P Wð Þ! PA,B,C[S

tABC wAB,wBC ,wACð Þ! "

PA,B[S

rAB wABð Þ! "

ð10Þ

where t is zero if the triple of interactions are intransitive, and oneif the interactions are transitive (see Text S1 for full definition).Using transitivity constraints forces the search to find consistentmodels that best explain the observed changes. The transitivity

constraint includes both the direction of interactions and the signof interactions. As S-gene interactions are signed, the transitivityconstraint forces the sign of the product of two edges to equal thesign of the third; e.g. if AxB and BxC, then ARC. A result ofmodeling transitivity is that a directed cycle of stimulatoryinteractions will also imply activation between any pair of S-genesin the cycle, in both directions. Therefore, the method clusterssuch S-genes into equivalence interactions. The product over rfactors in Eq. (10) encode evidence from high-throughput assays,such as protein-protein binding and protein-DNA bindinginteractions (see ‘‘Physical Structure Priors’’ in Text S1).

While network structures are constrained to reflect moreintuitive models, the decomposition introduces interdependenciesamong the interactions, adding complexity to the search for high-scoring networks. Importantly, max-sum message passing in afactor graph [19] provides an efficient means for estimating highlyprobable S-gene configurations. We next describe how theproblem is recoded into message-passing on a factor graph.

Inference on Factor Graphs to Search for Candidate S-Gene Networks

The formulation above provides a definition of the objectivefunction to be maximized but says nothing about how to search fora good network. The search space of networks is very large makingexhaustive search [10] intractable for networks larger than five S-genes. To apply the method to larger networks, we require a fast,heuristic approach. Markowetz et al. (2007) introduced a bottom-up technique to infer an S-gene graph. They identify sub-graphs ofS-genes (pairs and triples) and then merge the sub-graphs togetherinto a final parsimonious graph. Frohlich et al. (2008) [18] usehierarchical clustering to first identify modules, subsets of S-geneswith correlated expression changes. Networks among the modulesare exhaustively searched and a final network is identified bygreedily introducing interactions across modules that increase thelikelihood.

Here, we introduce the use of a graphical model called a factorgraph to represent all possible NEM structures simultaneously.The parameters that determine the S-gene interactions, W, areexplicitly represented as variables in the factor graph. Identifying ahigh-scoring S-gene network is therefore converted to the task ofidentifying likely assignments of the W variables in the factorgraph. A factor graph is a probabilistic graphical model whoselikelihood function can be factorized into smaller terms (factors)representing local constraints or valuations on a set of randomvariables. Other graphical models, such as Bayesian networks andMarkov random fields, have straightforward factor graph analogs.A factor graph can be represented as an undirected, bi-partitegraph with two types of nodes: variables and factors. A variable isadjacent to a factor if the variable appears as an argument of thefactor. Factor graphs generalize probability mass functions as thejoint likelihood function requires no normalization and the factorsneed not be conditional probabilities. Each factor encodes a localconstraint pertaining to a few variables.

The Factor Graph for Nested EffectsFigure 2 shows the factor graph representing the NEM for the

example S-gene network from Figure 1A. Each random variable isrepresented by a circle and each conditional probability term inEqs. (9–10) is represented by a square. The factor graph containsthree types of variables. First, every unique unordered pair of S-genes {A,B} has a corresponding variable, wAB, that takes on valuesequal to one of the previously mentioned interaction modes(Figure 2, ‘‘S-Gene Interactions’’ level). Second, every E-gene-S-gene pair is associated with a variable, YeA for the hidden

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 5 January 2009 | Volume 5 | Issue 1 | e1000274

Transi2vityconstraintfortriples Physicalnetworkconstraints

Exampletransi2vity: IfAB, BC,Then,A->C

Factorgraphrepresenta)onofNEMs

expression state of effect gene e under knock-down A, (Figure 2,‘‘E-gene Expression State’’ level). Third, every observed expres-sion value is associated with a continuous variable, XeAr, where rindexes over replications of DA (Figure 2, ‘‘E-gene ExpressionObservation’’ level). Figure 2 also shows the expression factors,interaction factors, and transitivity factors of Eqs. (9–10).

Inference with message passing. The W that maximizesthe posterior is found using max-sum message passing using allterms from Eqs. (9–10) in log space. For acyclic graphs, themarginal, max-marginal and conditional probabilities of single ormultiple variables can be exactly calculated by the max-sumalgorithms [19]. Message-passing algorithms demonstrateexcellent empirical results in various practical problems even ongraphs containing cycles such as feed-forward and feed-back loops[20–23].

Here, the message passing schedule performs inference in twosteps. In the first step, messages from observations nodes XeAr arepassed through the expression factors and hidden E-gene statevariables, to calculate all messages m(YARwAB) in a single upwardpass. In the second step, messages are passed between only theinteraction variables and transitivity factors until convergence (seeText S1). In the example shown in Figure 2, running inferenceresults in assignments of activation for wAB and wBC (shaded red),inhibition for wBD and wAD (shaded green), and non-interaction forwAB and wBC (unshaded), which match the NEM structure fromFigure 1A. For display of inferred S-gene networks, we computethe transitive reduction of W by removing all links for which thereis a longer redundant path [24].

Pathway expansion with FG-NEMs. Once a signalingnetwork is identified using the message passing inferenceprocedure above, the network can be used to search for newgenes that may be part of the pathway. The NEM and FG-NEM

framework predict new members that act in the pathway by‘‘attaching’’ E-genes to S-genes in the network, or leaving themdetached if their expression data does not fit the model. AttachingE-gene, e, to S-gene, s, asserts that the expression changes of e overall knock-downs are best explained by a network in which e isdirectly downstream of s. The E-genes attached to the network arecollectively referred to as the frontier. Frontier genes may be goodcandidates for further characterization (e.g. knock-down andexpression profiling) in subsequent experiments.

To gain a global picture for where e is connected, we use amodified NEM scoring from Markowetz et al. (2005). The pair-wise attachments for a single E-gene connection variable heAB,provide local ‘‘best guesses’’ for e’s attachment. Rather thanaggregate e’s collection of local attachments, we use NEM scoring,modified to incorporate both stimulatory and inhibitory attach-ments, to estimate the attachment point using the full networklearned in the previous step (see Text S1).

We calculate a log-likelihood ratio that measures the degree towhich e’s expression data is explained by the network if it isattached to one of the S-genes compared to being disconnectedfrom the network, i.e. its likelihood was generated entirely by thebackground Gaussian distribution. For E-gene e, we compute thelog-likelihood of attachment ratio (LAR):

LAR eð Þ~logmaxi=0

P XejW,he~ið Þ

P XejW,he~0ð Þ

0

@

1

A,

where he here represents Markowetz et. al’s attachment parameterexpanded to include inhibitory and stimulatory attachments. Werank all of the E-genes according to their LAR scores. Top-scoringgenes have data that is more likely to have arisen from the modelthan a null background. Any E-gene that has a positive LAR scoreis included as a frontier gene.

Experimental Validation Procedure for Newly PredictedCancer Invasion Genes

To validate the involvement of predicted invasiveness frontiergenes, HT29 colon cancer cells were resuspended in DMEMmedium containing 0.1% FBS and seeded into the top wells(26105 per well) containing individual Matrigel inserts (BDBiosciences, San Jose, CA) according to manufacturer’s protocol.The lower wells were filled with 800 ml medium with 10% fetalbovine serum as chemoattractant. Six to ten hours followingseeding, the cells in the upper wells were transfected with theappropriate shRNA-expressing pSuper constructs [25] usingLipofectamine 2000 (Invitrogen, Carlsbad, CA). Final concentra-tion of pSuper constructs was 1.6 mg/ml. The transfected cellswere incubated at 37uC for 48 hours before assaying for invasion.Media was aspirated from the top wells and non-invading cellswere scraped from the upper side of the inserts with a cotton swaband invading cells on the lower side were fixed and stained usingDiffQuick (IMEB, Inc. San Marcos, CA). Total number ofinvading cells was counted for each insert using a light microscope.Invasion was assessed in quadruplicate and independentlyrepeated at least five times. The shRNA-expressing portion ofthe construct was designed using the siRNA Selection Program ofthe Whitehead Institute for Biomedical Research (http://jura.wi.mit.edu/bioc/siRNAext/), synthesized by Invitrogen and sub-cloned into the XhoI and BamHI sites of pSuper plasmid.Sequences for shRNA constructs are available in the Text S1.shRNA construct MYO1G targets the myosin 1G mRNA(GenBank accession number NM _033054). shRNA construct

Figure 2. Structure of the factor graph for network inference.The factor graph consists of three classes of variables (circles) and threeclasses of factors (squares). XeAr is a continuous observation of E-genee’s expression under DA and replicate r. YeA is the hidden state of E-gene e under DA, and is a discrete variable with domain {up, , down}. wAB

is the interaction between two S-genes A and B. Expression Factorsmodel expression as a mixture of Gaussian distributions. InteractionFactors constrain E-gene states to the allowed regions shown inFigure 1C. Transitivity Factors constrain pair-wise interactions to formconsistent triangles. The arrows labeled m and m9 are messagesencoding local belief potentials on wAB and are propagated duringfactor graph inference.doi:10.1371/journal.pcbi.1000274.g002

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 6 January 2009 | Volume 5 | Issue 1 | e1000274

P (YeA, YeB |�A,B , ✓eAB)

P (XeA|YeA)

Prior

Inferenceonthefactorgraph

•  Findmostlikelyconfigura2onsfor•  Useamessagepassingalgorithm(standardforfactorgraphs)•  CalledtheMax-Productalgorithm•  Messagepassinghappensintwosteps– Messagesarepassedfromobserva2onsXeAtothe– Messagesarepassedbetweentheinterac2onandtransi2vityfactorsun2lconvergence

�A,B

�A,B

DoesFG-NEMcaptureac)va)ngandinhibitoryrela)onships?

BMPR1A targets the bone morphogenetic protein receptor, typeIA mRNA (NM_004329). shRNA construct COLEC12 targets thecollectin sub-family member 12 mRNA (NM_130386). shRNAconstruct AA099748 targets an expressed sequence tag mRNA(AA099748). shRNA construct CAPN12 targets the calpain 12mRNA (NM_144691). shRNA construct scrambled serves as anonsense sequence negative control.

Results

Results on Artificial NetworksData. We evaluated FG-NEMs ability to recover artificial

networks from simulated data. Data was generated by propagatingsignals in networks containing simulated knock-downs and thensampling expression data from activated, inhibited, or unaffectedexpression change distributions (see Text S1 and Figure S3). Wefocused on how the FG-NEM approach increased recovery ofnetworks that contain both activation and inhibition. Because FG-NEMs explicitly incorporate inhibition, we hypothesized that theywould recover networks containing an appreciable amount ofinhibition more accurately than an approach lacking separatemodes for inhibition and activation. We implemented a version ofFG-NEM in which inhibition encoded in the FG-NEM model wasremoved (see Methods). We refer to this version as the ‘‘unsigned’’FG-NEM (uFG-NEM). We compared uFG-NEM to the originalNEM approach and found that the results were comparable onsmall synthetic networks of four S-genes and their associated data(see Figure S2). We therefore used uFG-NEMs as a surrogate forNEMs for the tests on larger networks on which NEM was notefficient enough to run.

To make the comparison of FG-NEM to uFG-NEM fair, wemeasured network recovery in two ways. 1) We calculated ameasure of structure recovery: a predicted interaction was calledcorrect if it matched an interaction (of either sign) in the simulated

network. In this case, whether the interaction was inhibitory orstimulatory was ignored. 2) We measured sign recovery: a predictedinteraction was recorded as correct if it matched an interaction inthe simulated network and had the matching sign.

Influence of inhibition extent on network recovery. Wetested the ability of FG-NEMs and uFG-NEMs to recover thestructure of networks simulated with varying fractions ofinhibition, 0#l#0.75, for both the amount of inhibitoryconnections between S-genes and inhibitory attachments of E-genes. We simulated and predicted 500 networks, calculated thearea under the precision-recall curve (AUC) for each predictednetwork (see Text S1), and recorded the mean and standarddeviation of these AUCs. As expected, when no inhibition waspresent, FG-NEM and uFG-NEM were equivalent in terms ofAUC when run on non-transformed data (Figure 3A).Surprisingly, FG-NEM run on the AVT data performs muchworse than FG-NEM even with no inhibition. This may be due toits interpretation of unaffected E-gene changes as affected changeswhich adds noise to its estimates of hierarchical nesting. Asincreasing amounts of inhibition is added into simulated networks,the performance of uFG-NEM degrades precipitously for structurerecovery, underperforming FG-NEM by a margin of more than0.20 units of AUC at the highest levels of simulated inhibition(Figure 3A). Even at moderate levels of inhibition, for example atthe 15% inhibition level, FG-NEM’s AUC is already significantlyhigher than uFG-NEM’s AUC. We also calculated the AUC forrecovering the correct sign of the interactions for the unsignedmodels. In this case, unsigned interactions were interpreted to beactivating interactions. As expected, the AUC decreasesquadratically since both the precision and recall decreaselinearly with increasing fraction of inhibition. Given theseresults, we expect FG-NEMs to have significantly betterperformance on real genetic networks where appreciableamounts of inhibition exist (see Figure S1). We also varied other

Figure 3. Accuracy of artificial network recovery and expansion. (A) Influence of inhibition on network recovery. AUC (y-axis) plotted as afunction of the percent of inhibitory links (x-axis). Four replicate hybridizations were used in all simulations. Points and error bars represent meansand standard deviations computed across 500 synthetically generated networks respectively. Lines in each plot represent the performance of FG-NEM (red) and uFG-NEM run on the original data (green) or on AVT data (blue) for both structure recovery (solid lines) and sign recovery (dottedlines). (B) Accuracy of FG-NEM network expansion compared to Template Matching. The percentile of an S-gene obtained from Template Matchingwas subtracted from the percentile of the LAR score (see Methods) assigned by FG-NEM and uFG-NEM obtained from the leave-one-out expansiontest. A smoothed histogram for FG-NEM (red), uFG-NEM run on the original data (green) and the AVT data (blue) was plotted and shows theproportion of S-genes (y-axis) with a particular difference in method percentile (x-axis). The underlying simulated network had 32 S-genes, eight S-genes were used for network recovery, and twenty E-genes were attached to each S-gene.doi:10.1371/journal.pcbi.1000274.g003

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 7 January 2009 | Volume 5 | Issue 1 | e1000274

FG-NEM:captureinhibitoryandac2va2ngrela2onshipsuFG-NEM:captureonlyunsignedinterac2onsFG-NEMAVT:FG-NEMrunonabsolutevaluedataSolidlines:structurerecoveryDashedlines:signrecovery

DoesFG-NEMexpandpathwaysbeSerthanthebaselineapproach

BMPR1A targets the bone morphogenetic protein receptor, typeIA mRNA (NM_004329). shRNA construct COLEC12 targets thecollectin sub-family member 12 mRNA (NM_130386). shRNAconstruct AA099748 targets an expressed sequence tag mRNA(AA099748). shRNA construct CAPN12 targets the calpain 12mRNA (NM_144691). shRNA construct scrambled serves as anonsense sequence negative control.

Results

Results on Artificial NetworksData. We evaluated FG-NEMs ability to recover artificial

networks from simulated data. Data was generated by propagatingsignals in networks containing simulated knock-downs and thensampling expression data from activated, inhibited, or unaffectedexpression change distributions (see Text S1 and Figure S3). Wefocused on how the FG-NEM approach increased recovery ofnetworks that contain both activation and inhibition. Because FG-NEMs explicitly incorporate inhibition, we hypothesized that theywould recover networks containing an appreciable amount ofinhibition more accurately than an approach lacking separatemodes for inhibition and activation. We implemented a version ofFG-NEM in which inhibition encoded in the FG-NEM model wasremoved (see Methods). We refer to this version as the ‘‘unsigned’’FG-NEM (uFG-NEM). We compared uFG-NEM to the originalNEM approach and found that the results were comparable onsmall synthetic networks of four S-genes and their associated data(see Figure S2). We therefore used uFG-NEMs as a surrogate forNEMs for the tests on larger networks on which NEM was notefficient enough to run.

To make the comparison of FG-NEM to uFG-NEM fair, wemeasured network recovery in two ways. 1) We calculated ameasure of structure recovery: a predicted interaction was calledcorrect if it matched an interaction (of either sign) in the simulated

network. In this case, whether the interaction was inhibitory orstimulatory was ignored. 2) We measured sign recovery: a predictedinteraction was recorded as correct if it matched an interaction inthe simulated network and had the matching sign.

Influence of inhibition extent on network recovery. Wetested the ability of FG-NEMs and uFG-NEMs to recover thestructure of networks simulated with varying fractions ofinhibition, 0#l#0.75, for both the amount of inhibitoryconnections between S-genes and inhibitory attachments of E-genes. We simulated and predicted 500 networks, calculated thearea under the precision-recall curve (AUC) for each predictednetwork (see Text S1), and recorded the mean and standarddeviation of these AUCs. As expected, when no inhibition waspresent, FG-NEM and uFG-NEM were equivalent in terms ofAUC when run on non-transformed data (Figure 3A).Surprisingly, FG-NEM run on the AVT data performs muchworse than FG-NEM even with no inhibition. This may be due toits interpretation of unaffected E-gene changes as affected changeswhich adds noise to its estimates of hierarchical nesting. Asincreasing amounts of inhibition is added into simulated networks,the performance of uFG-NEM degrades precipitously for structurerecovery, underperforming FG-NEM by a margin of more than0.20 units of AUC at the highest levels of simulated inhibition(Figure 3A). Even at moderate levels of inhibition, for example atthe 15% inhibition level, FG-NEM’s AUC is already significantlyhigher than uFG-NEM’s AUC. We also calculated the AUC forrecovering the correct sign of the interactions for the unsignedmodels. In this case, unsigned interactions were interpreted to beactivating interactions. As expected, the AUC decreasesquadratically since both the precision and recall decreaselinearly with increasing fraction of inhibition. Given theseresults, we expect FG-NEMs to have significantly betterperformance on real genetic networks where appreciableamounts of inhibition exist (see Figure S1). We also varied other

Figure 3. Accuracy of artificial network recovery and expansion. (A) Influence of inhibition on network recovery. AUC (y-axis) plotted as afunction of the percent of inhibitory links (x-axis). Four replicate hybridizations were used in all simulations. Points and error bars represent meansand standard deviations computed across 500 synthetically generated networks respectively. Lines in each plot represent the performance of FG-NEM (red) and uFG-NEM run on the original data (green) or on AVT data (blue) for both structure recovery (solid lines) and sign recovery (dottedlines). (B) Accuracy of FG-NEM network expansion compared to Template Matching. The percentile of an S-gene obtained from Template Matchingwas subtracted from the percentile of the LAR score (see Methods) assigned by FG-NEM and uFG-NEM obtained from the leave-one-out expansiontest. A smoothed histogram for FG-NEM (red), uFG-NEM run on the original data (green) and the AVT data (blue) was plotted and shows theproportion of S-genes (y-axis) with a particular difference in method percentile (x-axis). The underlying simulated network had 32 S-genes, eight S-genes were used for network recovery, and twenty E-genes were attached to each S-gene.doi:10.1371/journal.pcbi.1000274.g003

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 7 January 2009 | Volume 5 | Issue 1 | e1000274

Rightshijinpercen2lerankdifferenceofNEMmethods

Pathwayexpansion

•  ABachnewE-genestoS-genenetwork•  AnaBachedgeneetoS-genesassertsthateisdirectly

downstreamofs•  AllE-genesaBachedtotheS-genenetworkarecalledfron2er

genes•  AnE-gene’sconnec2vityisexaminedbasedontheLog-

likelihoodABachmentRa2o

expression state of effect gene e under knock-down A, (Figure 2,‘‘E-gene Expression State’’ level). Third, every observed expres-sion value is associated with a continuous variable, XeAr, where rindexes over replications of DA (Figure 2, ‘‘E-gene ExpressionObservation’’ level). Figure 2 also shows the expression factors,interaction factors, and transitivity factors of Eqs. (9–10).

Inference with message passing. The W that maximizesthe posterior is found using max-sum message passing using allterms from Eqs. (9–10) in log space. For acyclic graphs, themarginal, max-marginal and conditional probabilities of single ormultiple variables can be exactly calculated by the max-sumalgorithms [19]. Message-passing algorithms demonstrateexcellent empirical results in various practical problems even ongraphs containing cycles such as feed-forward and feed-back loops[20–23].

Here, the message passing schedule performs inference in twosteps. In the first step, messages from observations nodes XeAr arepassed through the expression factors and hidden E-gene statevariables, to calculate all messages m(YARwAB) in a single upwardpass. In the second step, messages are passed between only theinteraction variables and transitivity factors until convergence (seeText S1). In the example shown in Figure 2, running inferenceresults in assignments of activation for wAB and wBC (shaded red),inhibition for wBD and wAD (shaded green), and non-interaction forwAB and wBC (unshaded), which match the NEM structure fromFigure 1A. For display of inferred S-gene networks, we computethe transitive reduction of W by removing all links for which thereis a longer redundant path [24].

Pathway expansion with FG-NEMs. Once a signalingnetwork is identified using the message passing inferenceprocedure above, the network can be used to search for newgenes that may be part of the pathway. The NEM and FG-NEM

framework predict new members that act in the pathway by‘‘attaching’’ E-genes to S-genes in the network, or leaving themdetached if their expression data does not fit the model. AttachingE-gene, e, to S-gene, s, asserts that the expression changes of e overall knock-downs are best explained by a network in which e isdirectly downstream of s. The E-genes attached to the network arecollectively referred to as the frontier. Frontier genes may be goodcandidates for further characterization (e.g. knock-down andexpression profiling) in subsequent experiments.

To gain a global picture for where e is connected, we use amodified NEM scoring from Markowetz et al. (2005). The pair-wise attachments for a single E-gene connection variable heAB,provide local ‘‘best guesses’’ for e’s attachment. Rather thanaggregate e’s collection of local attachments, we use NEM scoring,modified to incorporate both stimulatory and inhibitory attach-ments, to estimate the attachment point using the full networklearned in the previous step (see Text S1).

We calculate a log-likelihood ratio that measures the degree towhich e’s expression data is explained by the network if it isattached to one of the S-genes compared to being disconnectedfrom the network, i.e. its likelihood was generated entirely by thebackground Gaussian distribution. For E-gene e, we compute thelog-likelihood of attachment ratio (LAR):

LAR eð Þ~logmaxi=0

P XejW,he~ið Þ

P XejW,he~0ð Þ

0

@

1

A,

where he here represents Markowetz et. al’s attachment parameterexpanded to include inhibitory and stimulatory attachments. Werank all of the E-genes according to their LAR scores. Top-scoringgenes have data that is more likely to have arisen from the modelthan a null background. Any E-gene that has a positive LAR scoreis included as a frontier gene.

Experimental Validation Procedure for Newly PredictedCancer Invasion Genes

To validate the involvement of predicted invasiveness frontiergenes, HT29 colon cancer cells were resuspended in DMEMmedium containing 0.1% FBS and seeded into the top wells(26105 per well) containing individual Matrigel inserts (BDBiosciences, San Jose, CA) according to manufacturer’s protocol.The lower wells were filled with 800 ml medium with 10% fetalbovine serum as chemoattractant. Six to ten hours followingseeding, the cells in the upper wells were transfected with theappropriate shRNA-expressing pSuper constructs [25] usingLipofectamine 2000 (Invitrogen, Carlsbad, CA). Final concentra-tion of pSuper constructs was 1.6 mg/ml. The transfected cellswere incubated at 37uC for 48 hours before assaying for invasion.Media was aspirated from the top wells and non-invading cellswere scraped from the upper side of the inserts with a cotton swaband invading cells on the lower side were fixed and stained usingDiffQuick (IMEB, Inc. San Marcos, CA). Total number ofinvading cells was counted for each insert using a light microscope.Invasion was assessed in quadruplicate and independentlyrepeated at least five times. The shRNA-expressing portion ofthe construct was designed using the siRNA Selection Program ofthe Whitehead Institute for Biomedical Research (http://jura.wi.mit.edu/bioc/siRNAext/), synthesized by Invitrogen and sub-cloned into the XhoI and BamHI sites of pSuper plasmid.Sequences for shRNA constructs are available in the Text S1.shRNA construct MYO1G targets the myosin 1G mRNA(GenBank accession number NM _033054). shRNA construct

Figure 2. Structure of the factor graph for network inference.The factor graph consists of three classes of variables (circles) and threeclasses of factors (squares). XeAr is a continuous observation of E-genee’s expression under DA and replicate r. YeA is the hidden state of E-gene e under DA, and is a discrete variable with domain {up, , down}. wAB

is the interaction between two S-genes A and B. Expression Factorsmodel expression as a mixture of Gaussian distributions. InteractionFactors constrain E-gene states to the allowed regions shown inFigure 1C. Transitivity Factors constrain pair-wise interactions to formconsistent triangles. The arrows labeled m and m9 are messagesencoding local belief potentials on wAB and are propagated duringfactor graph inference.doi:10.1371/journal.pcbi.1000274.g002

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 6 January 2009 | Volume 5 | Issue 1 | e1000274

OneoftheSgenes

FG-NEMbasedpathwayexpansioninyeast

Figure 4. Yeast knock-out compendium predictions. (A) Precision/recall comparison. Each method’s ability to expand a pathway wascompared. Thick lines indicate mean precision and shaded regions represent standard error of mean calculated over the networks with the fivehighest AUCS from any of the tested methods. (B) Network expansion comparison. Networks were predicted for a non-redundant set of GOcategories containing four or more S-genes in the Hughes et al. (2000) compendium and used to predict held-out genes from the same category (seeMethods). The area under the curve (AUC) for each pathway was calculated for each method. AUC ratios (y-axis) were calculated for each methodrelative to the lowest AUC. (C) Compatibility of physical evidence and predicted S-gene interactions. Each point is the margin of compatibility (MOC,see Methods) of a predicted genetic interaction to high-throughput physical interaction data when physical interaction evidence was used (y-axis)and when it was not used (x-axis). Coloring indicates two-dimensional density estimation of points. Inset shows detail of the highest density region.Prediction methods that are significantly better than the lowest performing method, excluding random, at the 0.05 level (*) and 0.01 level (**) weredetermined by a proportions test on the top 30 predictions from each method. (D) Predicted S-gene networks for the ion homeostasis pathway.Shown are predicted networks from the FG-NEM method (Signed) and the uFG-NEM method (Unsigned). Arrows indicate activating interactions andtees indicate inhibiting interactions. The absence of a link between a pair of S-genes indicates the most likely mode for the pair was the non-interaction mode. Equivalence interactions are indicated with double lines and S-genes connected by equivalence are grouped into dashed ovals.doi:10.1371/journal.pcbi.1000274.g004

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 9 January 2009 | Volume 5 | Issue 1 | e1000274

Templatematching:rankEgenesbasedonsimilarityinexpressiontoan“idealizedtemplate”

FG-NEMinfersamoreaccuratenetworkthantheunsignedversioninyeast

•  FG-NEManduFG-NEMnetworksinferredintheion-homeostasispathway

•  FG-NEMinferredmoregenesassociatedwithionhomeosta2scomparedtouFG-NEM

Figure 4. Yeast knock-out compendium predictions. (A) Precision/recall comparison. Each method’s ability to expand a pathway wascompared. Thick lines indicate mean precision and shaded regions represent standard error of mean calculated over the networks with the fivehighest AUCS from any of the tested methods. (B) Network expansion comparison. Networks were predicted for a non-redundant set of GOcategories containing four or more S-genes in the Hughes et al. (2000) compendium and used to predict held-out genes from the same category (seeMethods). The area under the curve (AUC) for each pathway was calculated for each method. AUC ratios (y-axis) were calculated for each methodrelative to the lowest AUC. (C) Compatibility of physical evidence and predicted S-gene interactions. Each point is the margin of compatibility (MOC,see Methods) of a predicted genetic interaction to high-throughput physical interaction data when physical interaction evidence was used (y-axis)and when it was not used (x-axis). Coloring indicates two-dimensional density estimation of points. Inset shows detail of the highest density region.Prediction methods that are significantly better than the lowest performing method, excluding random, at the 0.05 level (*) and 0.01 level (**) weredetermined by a proportions test on the top 30 predictions from each method. (D) Predicted S-gene networks for the ion homeostasis pathway.Shown are predicted networks from the FG-NEM method (Signed) and the uFG-NEM method (Unsigned). Arrows indicate activating interactions andtees indicate inhibiting interactions. The absence of a link between a pair of S-genes indicates the most likely mode for the pair was the non-interaction mode. Equivalence interactions are indicated with double lines and S-genes connected by equivalence are grouped into dashed ovals.doi:10.1371/journal.pcbi.1000274.g004

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 9 January 2009 | Volume 5 | Issue 1 | e1000274

FG-NEMapplica)ontocoloncancer

Cancer invasion network identification. We applied FG-NEMs to recover a network for the second-tier genes. We includedE-genes that demonstrate a robust and significant effect under atleast two of the knock-downs included in the Irby et al. (2005)study. We selected genes whose log2 ratios differ by less than 0.5 inreplicate arrays and had an absolute log2 expression change atleast equal to the mean absolute level of the activated distribution(1.75) in at least two arrays. Using these criteria, we identified 185E-genes to use for model inference. Figure 5A shows theexpression data of these E-genes plotted in order of theirpredicted attachment points as identified by FG-NEMs. For themost part, E-gene expression changes moved in the same directionfollowing knock-down across the panel of five S-genes, indicatingthe presence of mostly stimulatory links among the S-genes(Figure 5A). This is in contrast to Figure 1A, where expressionchanges of a single E-gene move in the opposite directionfollowing knock-down of S-genes connected by an inhibitory link.

The absence of inhibitory links among S-genes is expected since,according to the selection criteria, all of the S-genes were foundpreviously to act in the same direction (invasion promotion). Themethod does find many inhibitory links to E-genes, whichdramatically increases the fit of the model on the data points.These predicted attachment signs provide information about howan E-gene’s involvement in the invasion process can be tested infollow-up experiments. The model predicts that invasion can besuppressed by knocking down genes connected by stimulatoryattachments or by over-expressing genes connected by inhibitoryattachments.

FG-NEM recovered the network shown in Figure 5B. KRT20and RPL32 are predicted to be equivalent. Also, the modelpredicts TFDP1 and DHX32 are downstream of KRT20 andRPL32. The equivalent interaction of KRT20 and RPL32received significantly high likelihoods (P,0.001) as well as astrong excitatory downstream connection to TFDP1 (P,0.001).

Figure 5. Invasive colon cancer network predictions. (A) Expression changes of selected E-genes following targeted S-gene knock-downs inHT29 colon cancer cells. Gene expression was measured in HT29 cells treated with a shRNA specifically targeting an S-gene (column of the matrix)relative to cells treated with a scrambled control shRNA (Irby et al., 2005). Colors indicate putatively inhibited E-genes (rows of the matrix) with up-regulated levels relative to control (red), activated E-genes with down-regulated levels relative to control (green), and unaffected E-genes withexpression levels not significantly different from control (black). Biological replicates were available for KRT20, TFDP1, and GLS knock-downs. Geneswere sorted by their attachment point and then by their LAR scores. (B) Cancer invasion network predicted by FG-NEM. For each pair of S-genes, themost likely interaction mode is shown. The same conventions used for illustrating interactions predicted for the yeast networks were used here. Someinteractions were found to be significant at the 0.05 level (*) or 0.01 level (**) using a permutation test (see Methods). KRT20 and RPL32 werepredicted to be equivalent and are therefore grouped together in a dashed oval. (C) Matrigel invasion assay in HT29 colon cancer cells. Genespredicted to be significantly attached to the network, CAPN12 and expressed sequence tag AA099748, resulted in a loss of the invasivenessphenotype when knocked-down by RNA interference. Genes not significantly attached to the network, MYO1G, BMPR1A, and COLEC12, did not resultin significant loss of the invasive phenotype. A scrambled non-sense sequence also served as a negative control and did not result in a loss of HT29cell invasiveness. Gene knock-downs in HT29 cells were validated by quantitative real time RT-PCR where mRNA levels of targeted genes weredecreased by 70–80% compared to scrambled control shRNA-treated cells (data not shown). Data shown are the mean6S.E. of five independentexperiments performed in quadruplicate. *Significantly different from scrambled control shRNA-treated cells (P,0.05) by ANOVA and post hoc Tukeytest.doi:10.1371/journal.pcbi.1000274.g005

Factor Graph Nested Effects Model

PLoS Computational Biology | www.ploscompbiol.org 11 January 2009 | Volume 5 | Issue 1 | e1000274

Mostinterac2onsinS-genenetworkareac2va2ng

Novelexpandedgenesthathavesignificanteffectontheinvasivephenotype

Summary

•  FG-NEMs:Ageneralapproachtoinferanorderingofgenesfromknock-downphenotypes

•  Strengths–  FG-NEMscouldbeusedinanitera2vecomputa2onal-experimentalframework

–  Handlessignedinterac2onsbetweenS-genes•  Weaknesses–  Computa2onalcomplexityoftheinferenceproceduremightbehigh•  RequiredindependenceamongE-genes•  ModelpairsofS-genesata2me

Concludingremarks

•  Wehaveseenasuiteofproblems,algorithmsandapplica2onsinarealse{ng

•  Theserangedfromnetworkinference,dynamicnetworkinference,networkmodules,networkalignmentandnetwork-basedinterpreta2on

•  Whatwedidnotsee(orsawlessof)–  Integra2onofdifferenttypesofnetworks–  ExperimentaldesignforbeBerlearningofnetworks–  Moretopologicalproper2esofnetworks–  Subnetworknetworkanalysis

•  Ifyouremaininterestedinthesetopicsorwouldliketolearnmore,sendmeanemail