by quentin berthet philippe rigollet and piyush...

Submitted to the Annals of Statistics

EXACT RECOVERY IN THE ISING BLOCKMODEL

By Quentin Berthet⇤,Philippe Rigollet† and Piyush Srivastava‡

University of CambridgeMassachusetts Institute of TechnologyTata Institute of Fundamental Research

Abstract We consider the problem associated to recovering theblock structure of an Ising model given independent observations onthe binary hypercube. This new model, called the Ising blockmodel,is a perturbation of the mean field approximation of the Ising modelknown as the Curie–Weiss model: the sites are partitioned into twoblocks of equal size and the interaction between those of the sameblock is stronger than across blocks, to account for more order withineach block. We study probabilistic, statistical and computational as-pects of this model in the high-dimensional case when the number ofsites may be much larger than the sample size.

1. Introduction. The past decades have witnessed an explosion of theamount of data collected routinely. Along with this expansion comes thepromise of a better understanding of the nature of interactions betweenbasic entities. Understanding this web of interactions has profound impli-cations in social sciences, structural biology, neuroscience, marketing andfinance for example. Graphical models (a.k.a Markov Random Fields) haveproved to be a very useful tool to turn raw data into networks that cap-ture such interactions between random variables. Specifically, given obser-vations of a random vector � = (�

1

, . . . , �p

)> 2 IRp, the goal is to outputa graph on p nodes, one for each variable, where edges encode conditionalindependence between said variables (Lauritzen, 1996). Graphical modelshave been successfully employed in a variety of applications such as image

⇤Supported by an Isaac Newton Trust Early Career Support Scheme and by The AlanTuring Institute under the EPSRC grant EP/N510129/1.

†Supported by Grants NSF CAREER DMS-1541099, NSF DMS-1541100, NSF DMS-1712596, DARPA W911NF-16-1-0551, ONR N00014-17-1-2147 and a grant from the MITNEC Corporation.

‡Part of this work was performed while PS was at the Center for the Mathematics ofInformation at the California Institute of Technology supported in part by United StatesNSF grant CCF-1319745.

AMS 2000 subject classifications: Primary 62H30; secondary 82B20Keywords and phrases: Ising blockmodel, Curie-Weiss, stochastic blockmodel, planted

partition, spectral partitioning

1

http://www.imstat.org/aos/

2 BERTHET, RIGOLLET AND SRIVASTAVA

analysis (Besag, 1986), natural language processing (Manning and Schutze,1999) and genetics (Lauritzen and Sheehan, 2003; Sebastiani et al., 2005)for example. While for many of such applications, the goal is to performsubsequent clustering of the variables, graphical models with a communitystructure have seldom been studied. Instead, stochastic blockmodels thatwere introduced in Holland et al. (1983)—see Abbe (2017) for an overviewof recent developments—have played a central role in community detection.Despite their apparent simplicity, stochastic blockmodels have proved to bequite rich and interesting threshold phenomena have been revealed in re-cent years (Abbe and Sandon, 2015; Banks et al., 2016; Decelle et al., 2011;Massoulie, 2014; Mossel et al., 2013, 2015). However such models are funda-mentally di↵erent from graphical models for two reasons. First, in stochasticblockmodels, one observes a single realization of a graph of interactions be-tween basic entities and the goal is to recover the community structure.Second, it is assumed that interactions between entities are independent.

In several problems concerned with inferring community structure from agraph, a common motivation is that this graph can be constructed by con-necting individuals with a similar behavior. A natural question is thereforewhether it is possible to use this observed behavior as the data itself. Inthis work, we raise the question of recovering a community structure fromindependent copies of �. Such a question was recently investigated in thecontext of G-latent models (Bunea et al., 2015, 2016) where � is assumed tohave a covariance matrix with a block structure similar to the one employedin stochastic blockmodels, chiefly for a Gaussian �. It is worth mentioningG-latent models are not graphical models as the community structure isnot imposed on conditional independence relations but rather on the covari-ance directly and the question of learning Gaussian graphical models witha community structure is still open.

We focus on binary random variables �1

, . . . , �p

2 {�1, 1}, hereafter calledspins. Graphical models on such variables are known as Ising models andwere introduced in the context of statistical physics as a mathematical modelfor ferromagnetism (Ising, 1925). Together with Gaussian graphical models,they form the preponderant class of graphical models in the literature, per-haps owing to their shared maximum entropy property. The Ising modelhas applications reaching far beyond the limits of statistical physics. Forexample, it has been proposed to model social interactions such as politicala�nities, where �

j

may represent the vote of U.S. senator j on a randombill in Banerjee et al. (2008) (see also the data used in Diaconis et al. (2008)for the U.S. House of Representatives). It has also been used as a model forneural activity in Schneidman et al. (2006) and it is known to be the station-

THE ISING BLOCKMODEL 3

ary distribution of the so-called Glauber dynamics, which has been used tomodel the spread of information in social networks (Montanari and Saberi,2010). In the context of such applications, much e↵ort has been devoted toestimating the underlying structure of the graphical model (Bresler, 2015;Bresler et al., 2014, 2008; Ravikumar et al., 2010) primarily under sparsityassumptions.

We introduce the Ising blockmodel, which forms a canonical graphicalmodel for binary random variables with a simple community structure: thechoice �

i

= 1 or �1 of each individual is influenced by the choice of others,and more markedly by members of its own community with whom it is morelikely to agree. It gives us a framework to translate the common notionsof homophily—the tendency to connect or be aligned with others—and ofassortativity—a stronger tendency for members of the same group—fromnetwork analysis in a statistical setting. This model lends itself to a variety ofpossible applications, e.g., identifying groups of elected representatives basedon their voting record, or of customers based on their reviews of variousproducts. Unlike G-latent models, the community structure is not imposedon the covariance matrix but rather directly on interactions. Translatingthese structural assumptions to the covariance matrix requires a carefulanalysis of the ground states of this model that we carry out in Section 4.

From a computational point of view, the Ising blockmodel bears similarityto stochastic blockmodels where the community structure may be recoveredusing semidefinite programing like in many problems with the same fla-vor (Abbe et al., 2016; Bunea et al., 2016; Goemans and Williamson, 1995;Hajek et al., 2016). Our contribution falls into this line of work and we de-scribe a semidefinite program (SDP) that enjoys optimal rates of structurerecovery.

Our contribution. In Section 2, we define the Ising blockmodel withdistribution IP

↵,�

as a structured perturbation of the Curie-Weiss modelfor which ↵ = �, i.e., the interaction parameter—in this context, the inversetemperature �—is constant between all spins. In the Ising blockmodel, spinsare instead divided into two balanced communities, and the interaction pa-rameter � within each community is greater than the interaction parameter↵ across di↵erent communities. Our objective is to recover the two communi-ties based on the observation of n independent draws from this distribution.

We obtain bounds on the sample size needed for an estimator based onan SDP relaxation of the maximum likelihood estimator to recover the twocommunities with high probability. We prove that these bounds are optimalin a minimax sense, up to constants. Our proof is based on geometric ar-guments, an approach that we find more transparent than more traditional


arguments based on dual certificates—see Section 3.Our bounds depend on an unknown quantity, called gap that may scale

with the size p of the system but is not explicitly given in the model descrip-tion. To get a handle on this quantity, in Section 4 we perform an analysisof the ground states for the mean magnetization in each block, a variationof the asymptotic analysis of the Curie-Weiss model. In particular, we showthat these mean magnetizations jointly converge to a mixture of bivariateGaussian distributions with explicit centers and covariance matrices for alltemperature parameters (↵, �), except critical ones. This asymptotic char-acterization allows us to quantify the optimal scaling of the sample size—asa function of the system size p—required to recover the community struc-ture exactly. As a byproduct, our results exhibit sharp phase transitionsthat have a significant impact on the optimal rates for exact recovery. Infor-mally, we find that there are two regions for (↵, �): one in which the optimalsample size is of order log p, and another where it is of order p log p, with asharp phase transition between those. The formal statements of these resultsappear in Theorems 3.3 and 3.7 and Corollary 4.6.

Finally, note that the size p of the system has to be large enough toobserve interesting phenomena. Such high dimensional systems are especiallypertinent to the applications described above and our results are valid forlarge enough p, potentially much larger than the number of observations. Inparticular, we often consider asymptotic statements as p ! 1. However, inthe statistical applications of Section 3 we are interested in understandingthe scaling of the number of observations as a function of p. To that end,we keep track of the first order terms in p and only let higher order termsvanish when convenient.

2. The Ising block model. In this work, we introduce the Ising block-model in order to combine the notions of stochastic blockmodel and that ofgraphical model by assuming that we observe independent copies of a vec-tor of spins � = (�

1

, . . . , �p

) 2 {�1, 1}p distributed according to an Isingmodel with a block structure analogous to the one arising in the stochasticblockmodel.

2.1. Definition. Let p � 2 be an even integer and let S ⇢ [p] := {1, . . . , p}be a subset of size |S| = m = p/2. For any partition (S, S), where S = [p]\Sdenotes the complement of S, write i ⇠ j if (i, j) 2 S2 [ S2 and i ⌧ j if(i, j) 2 [p]2 \(S2 [ S2). Fix �, ↵ 2 IR and let � 2 {�1, 1}p have density f

S,↵,�


with respect to the counting measure on {�1, 1}p given by

(2.1) fS,↵,�

(�) =1

Z↵,�

exph �

2p

X

i⇠j

�i

�j

+↵

2p

X

i⌧j

�i

�j

i

,

where,

(2.2) ZS,↵,�

:=X

�2{�1,1}p

exph �

2p

X

i⇠j

�i

�j

+↵

2p

X

i⌧j

�i

�j

i

is a normalizing constant traditionally called partition function. Let IPS,↵,�

denote the probability distribution over {�1, 1}p that has density fS,↵,�

withrespect to the counting measure on {�1, 1}p. We call this model the IsingBlockmodel (IBM). In the sequel, we write simply f

↵,�

and IP↵,�

to empha-size the dependency on ↵, � and simply IP

S

to emphasize the dependency onS. We study in this work the statistical problem of identifying the partition(S, S) based on n independent draws from IP

↵,�

, when ↵ < �.

2.2. Link with the Curie-Weiss model. When ↵ = � > 0, the model (2.1)is the mean field approximation of the (ferromagnetic) Ising model and iscalled the Curie-Weiss model (without external field). It can be readily seenfrom (2.1) that vectors � 2 {�1, 1}p that present a lot of pairs (i, j) withopposite spins, i.e., �

i

�j

< 0, receive less probability than vectors where mostof the spins agree. However, there are many more vectors � of the formerkind in the discrete hypercube {�1, 1}p. This tension between energy andentropy is responsible for phase transitions in such systems.

When positive, the parameter � > 0 is called inverse temperature andit controls the strength of interactions, and therefore, the weight given tothe energy term. When � ! 0, the entropy term dominates and IP

�,�

tendsto the uniform density over {�1, 1}p. When � ! 1, IP

�,�

! .5�1 + .5��1,where �

x

denotes the Dirac point mass at x and 1 = (1, . . . , 1) 2 {�1, 1}p

denotes the all-ones vector of dimension p. In this case, energy dominatesentropy.

The Curie-Weiss model is tractable enough that the above considera-tions can be made precise. Specifically, the be behavior of the system asthe temperature 1/� varies can be described accurately. To that end, letµCW = �>1/p denote the magnetization of �. When µCW ' 0, then � hasa balanced numbers of positive and negative spins (paramagnetic behavior)and when |µCW| � 0, then � has a large proportion of spins with a given sign(ferromagnetic behavior). When p is large enough, the Curie-Weiss modelis known to obey a phase-transition from ferromagnetic to paramagnetic


behavior when the temperature crosses a threshold (see subsection A for de-tails). This striking result indicates that when the temperature decreases (�increases), the model changes from that of a disordered system (no preferredinclination towards �1 or +1) to that of an ordered system (a majority ofthe spins agree to the same sign). This behavior is interesting in the contextof modeling social interactions and indicates that if the strength of interac-tions is large enough (� > 1) then a partial consensus may be found. For-mally, the Curie-Weiss model may also be defined in the anti-ferromagneticcase � < 0—we abusively call it “inverse temperature” in this case also—tomodel the fact that negative interactions are encouraged. For such choicesof �, the distribution is concentrated around balanced configurations � thathave magnetization close to 0. Moreover, as � ! �1, IP

�,�

converges to theuniform distribution on configurations with zero magnetization (assumingthat p is even so that such configurations exist for simplicity). As a result,the anti-ferromagnic case arises when no consensus may be found and thespins are evenly split between positive and negative.

In reality though, a collective behavior may be fragmented into communi-ties and the IBM is meant to reflect this structure. Specifically, since � > ↵,the strength � of interactions within the blocks S and S is larger than thatacross blocks S and S. As will become clear from our analysis, the case where↵ < 0 presents interesting configurations whereby the two blocs S and Shave polarized behaviors, that is opposite magnetization in each block.

2.3. Covariance. The covariance matrix ⌃ = IE↵,�

[��>] captures theblock structure of IBM and thus plays a major role in the statistical ap-plications of Section 3. Moreover, the coe�cients of ⌃ can be expressedexplicitly in terms of two statistics of �. For any A ⇢ [p] define 1

A

2 {0, 1}p

to be the indicator vector of A and let µA

= �>1A

/|A| denote the localmagnetization. Akin to the Curie-Weiss model, the density f

↵,�

puts uni-form weights on configurations that have the same local magnetization µ

S

and µ¯

S

. This symmetry can be exploited to obtain the following estimates.

Lemma 2.1. Let ⌃ = IE↵,�

[��>] denote the covariance matrix of a ran-dom configuration � ⇠ IP

↵,�

. For any i 6= j 2 [p], it holds

� := ⌃ij

=m

2(m � 1)IE[µ2

S

+ µ2

¯

S

] � 1

m � 1if i ⇠ j

⌦ := ⌃ij

= IE[µS

µ¯

S

] if i ⌧ j .

Proof. In this proof, we rely on symmetry of the problem: all the spins�

i

in a given block, S or S have the same marginal distribution. Fix i 6= j.


If i ⇠ j, for example if i, j 2 S, we have by linearity of expectation.

⌃ij

= IE[�i

�j

] =1

m(m � 1)

�

IEX

(i,j)2S

2

�i

�j

� m�

=m

m � 1IE[µ2

S

] � 1

m � 1.

Since µS

and µ¯

S

are identically distributed, we obtain the desired result.For any i ⌧ j we have

⌃ij

= IE[�i

�j

] =1

m2

IEX

(i,j)2S⇥ ¯

S

�i

�j

= IE[µS

µ¯

S

] .

Unlike many models in the statistical literature, computing ⌃ exactly isdi�cult in the IBM. In particular, it is not immediately clear from Lemma 2.1that � > ⌦, while this should be intuitively true since � > ↵ and there-fore the spin interactions are stronger within blocks than across blocks. Itturns out that this simple fact can be checked by other means. For example,Lemma 3.6 implies that � � ⌦ and � � ↵ have the same sign. We derivein Section 4 asymptotic approximations as m ! 1 to prove e↵ective upperand lower bound on the gap � � ⌦, by analyzing precisely the behavior ofµ

S

and µ¯

S

, and using the result of Lemma 2.1.

3. Exact recovery. In this section, we focus on the following cluster-ing task: given n i.i.d observations drawn from IP

↵,�

with ↵ < �, recoverthe partition (S, S). We study the properties of an e�cient clustering al-gorithm together with the fundamental limitations associated to this task.Guarantees in terms of necessary sample size are expressed in terms of theparameters of the problem and of � � ⌦. This last term is analyzed inSection 4.

3.1. Maximum likelihood estimation. Fix a sample size n � 1. Given nindependent copies �(1), . . . , �(n) of � ⇠ IP

↵,�

, the log-likelihood is given by

Ln

(S) =n

X

t=1

log�

IP↵,�

(�(t))�

= �n log Z↵,�

�m

X

t=1

Hibm↵,�

(�(t)) .

where Hibm↵,�

denotes the IBM Hamiltonian defined on {�1, 1}p by

(3.1) Hibm↵,�

(�) = ��

2p

X

i⇠j

�i

�j

+↵

2p

X

i⌧j

�i

�j

�

,

and Z↵,�

is the partition function defined in (2.2). While both Z↵,�

and Hibm↵,�

could depend on the choice of the block S, it turns out that Z↵,�

is constantover choices of S such that |S| = m = p/2.


Lemma 3.1. The partition function Z↵,�

= Z↵,�

(S) defined in (2.2) issuch that Z

↵,�

(S) = Z↵,�

([m]) for all S of size |S| = m. This statementremains true even if m 6= p/2.

Proof. Fix S ⇢ [p] such that |S| = m and denote by ⇡ : [p] ! [p] anybijection that maps [m] to S. By (2.2) and (4.1), it holds

Z↵,�

(S) =X

�2{�1,1}p

exph 1

4m

⇣

2↵(�>1S

)(�>1¯

S

) � ��

(�>1S

)2 + (�>1¯

S

)2�

⌘i

=X

⌧=⇡(�)

�2{�1,1}p

exph 1

4m

⇣

2↵(⌧>1S

)(⌧>1¯

S

) � ��

(⌧>1S

)2 + (⌧>1¯

S

)2�

⌘i

since ⇡ is a bijection. Moreover, ⌧>1S

= ⇡(�)>1S

= �>1[m]

and ⌧>1¯

S

=

�>1[m]

. Hence Z↵,�

(S) = Z↵,�

([m]) .

To emphasize the fact that the partition function does not depend onS, we simply write Z

↵,�

= Z↵,�

(S). It implies that the log-likelihood is asimple function of S. Indeed, define the matrix Q = Q

S

2 IRp⇥p such thatQ

ij

= �/p for i ⇠ j and Qij

= ↵/p for i ⌧ j. Observe that (3.1) can bewritten as

Hibm↵,�

(�) = �1

2�>Q� = �1

2Tr(��>Q) .

This in turns implies

Ln

(S) = �n log Z↵,�

+n

2Tr[⌃Q] ,

where ⌃ ··= 1

n

P

n

t=1

�(t)�(t)

>denotes the empirical covariance matrix. Since

↵ < �, the likelihood maximization problem maxS⇢[p],|S|=m

Ln

(S) is equiv-alent to

(3.2) maxV 2P

Tr[⌃V ] , P = {vv> : v 2 {�1, 1}p, v>1[p]

= 0} .

In particular, estimating the blocks (S, S) amounts to estimating vS

v>S

2 P,where v

S

= 1S

� 1¯

S

2 {�1, 1}p. Note that vS

v>S

= v¯

S

v>S

. For an ad-jacency matrix A, the optimization problem max

V 2P Tr[AV ] is a specialcase of the Minimum Bisection problem and it is known to be NP-hardin general (Garey et al., 1976). To overcome this limitation, various ap-proximation algorithms were suggested over the years, culminating with a


poly-logarithmic approximation algorithm (Feige and Krauthgamer, 2002).Unfortunately, such approximations are not directly useful in the context ofmaximum likelihood estimation. Nevertheless, the maximum likelihood es-timation problem at hand is not worst case, but rather a random problem.It can be viewed as a variant of the planted partition model (aka stochas-tic blockmodel) introduced in (Dyer and Frieze, 1989). Indeed the blockstructure of ⌃ unveiled in Lemma 2.1 can be viewed as similar to the ad-jacency matrix of a weighted graph with a small bisection. Moreover, ⌃can be viewed as the matrix ⌃ planted in some noise. It is therefore notsurprising that we can use the same methodology in both cases. In partic-ular, we will use the semidefinite relaxation to the MAXCUT problem ofGoemans and Williamson (1995) that was already employed in the plantedpartition model (Abbe et al., 2016; Hajek et al., 2016). Here, unlike the orig-inal planted partition problem, the noise is correlated and therefore requiresa di↵erent analysis. In random matrix terminology, the observed matrix inthe stochastic block model is of Wigner type, whereas in the IBM, it is ofWishart type.

It can actually be impractical to use directly the matrix ⌃ in the aboverelaxations, and we apply a pre-preprocessing that amounts to a center-ing procedure, which simplifies our analysis. Given � 2 {�1, 1}p, define itscentered version � by

� = � �1>

[p]

�

p1

[p]

= P� ,

where P = Ip

� 1

p

1[p]

1>[p]

is the projector onto the subspace orthogonal to 1[p]

.

Moreover, let � = P⌃P and � = P ⌃P respectively denote the covarianceand empirical covariance matrices of the vector �.

Note that for all V 2 P, we have that Tr[�V ] = Tr[⌃V ] since V 1[p]

1>[p]

=0, so that PV P = V . It implies that the likelihood function is unchangedover P when substituting ⌃ by �. Moreover, IE[�] = � and the spectraldecomposition of � is given by

(3.3) � = (1 � �)P + p� � ⌦

2u

S

u>S

,

where uS

= vS

/p

p is a unit vector. Therefore the matrix � has leadingeigenvalue (1 � �) + p(� � ⌦)/2 with associated unit eigenvector u

S

. More-over, its eigengap is p(� � ⌦)/2. It is well known in matrix perturbationtheory that the eigengap plays a key role in the stability of the spectraldecomposition of � when observed with noise.


3.2. Exact recovery via semidefinite programming. In this subsection, weconsider the following semidefinite programming (SDP) relaxation of theoptimization problem (3.2),:

(3.4) maxV 2E

Tr[�V ] , E =�

V 2 Sp

: diag(V ) = 1[p]

, V ⌫ 0

,

where Sp

denotes the set of p ⇥ p symmetric real matrices and V ⌫ 0denotes that V is positive semidefinite. The set E is the set of correlationmatrices, and it is known as the elliptope. We recall the definition of thevector v

S

= 1S

� 1¯

S

2 {�1, 1}p and note that vS

v>S

2 P ⇢ E . Moreover,

we denote by V SDP any solution to the the above program. Our goal is toshow that (3.4) has a unique solution given by V SDP = v

S

v>S

, i.e., the SDPrelaxation is tight. In contrast to the MLE, this estimator can be computede�ciently by interior-point methods (Boyd and Vandenberghe, 2004).

While the dual certificate approach of Abbe et al. (2016) could be usedin this case (see also Hajek et al., 2016) we employ a slightly di↵erent prooftechnique, more geometric, that we find to be more transparent. This ap-proach is motivated by the idea that the relaxation is tight in the populationcase, suggesting that it might be the case as well when � is close to �.

Recall that for any X0

2 E , the normal cone to E at X0

is denoted byNE(X

0

) and defined by

NE(X0

) =�

C 2 Sp

: Tr(CX) Tr(CX0

) , 8X 2 E .

It is the cone of matrices C 2 Sp

such that maxX2E Tr(CX) = Tr(CX

0

).Therefore, v

S

v>S

is a solution of (3.4), i.e., the SDP relaxation is tight, when-

ever � 2 NE(vS

v>S

). The normal cone can be described using the followingLaplacian operator. For any matrix C 2 S

p

, define

LS

(C) := diag(CvS

v>S

) � C,

and observe that LS

(C)vS

= 0. Indeed, since vS

2 {�1, 1}p, it holds,

diag(CvS

v>S

)vS

= diag(CvS

1>[p]

)1[p]

= CvS

.

Proposition 3.2. For any matrix C 2 Sp

, the following are equivalent

1. C 2 NEp(vS

v>S

) .2. L

S

(C) = diag(CvS

v>S

) � C ⌫ 0 ,

Moreover, if LS

(C) ⌫ 0 has only one eigenvalue equal to 0, then vS

v>S

is theunique maximizer of Tr(CV ) over V 2 E.


VV

��

��

EE

PP

NE(V )NE(V )NP(V )NP(V )

Figure 1: The geometric interpretation for the analysis of this convex relax-ation. In the population case, the true value of the parameter V = v

S

v>S

isthe unique solution of both the maximum likelihood problem on P and ofthe convex relaxation on E , as � belongs to both normal cones at V . Therelaxation is therefore tight with � as input. We show that when the samplesize is large enough, the sample matrix � is close enough to � and also inboth normal cones, making V the solution to both problems.

Proof. It is known (see Laurent and Poljak, 1996) that the normal coneNE(v

S

v>S

) is given by

NE(vS

v>S

) =�

C 2 Sp

: C = D � M, D diagonal, M ⌫ 0, v>S

MvS

= 0

,

where M ⌫ 0 denotes that M is a symmetric, semidefinite positive matrix.We are going to make use of the following facts. First for any diagonal matrixD and any V 2 E , we have diag(DV ) = D. Second, taking V = v

S

v>S

, wehave

LS

(C)vS

v>S

= diag(CvS

v>S

)vS

v>S

� CvS

v>S

Taking the diagonal on both sides directly yields that

(3.5) diag(LS

(C)vS

v>S

) = 0 .

2. ) 1. Let C 2 Sp

be such that LS

(C) ⌫ 0. By definition, we haveC = diag(Cv

S

v>S

) � LS

(C) and it remains to check that v>S

LS

(C)vS

= 0,which follows readily from (3.5).

1. ) 2. Let C = D � M 2 NEp(vS

v>S

) where D is diagonal and M ⌫ 0,v>S

MvS

= 0, which implies that MvS

= 0. It yields CvS

v>S

= DvS

v>S

anddiag(Cv

S

v>S

) = diag(DvS

v>S

) = D so that the decomposition is necessarily


D = diag(CvS

v>S

) and M = LS

(C) = diag(CvS

v>S

) � C. In particular,L

S

(C) ⌫ 0.

Thus, if LS

(C) ⌫ 0 then vS

v>S

is a maximizer of Tr(CV ) over V 2 E . Toprove uniqueness, recall that for any maximizer V 2 E , we have Tr(CV ) =Tr(Cv

S

v>S

). Plugging C = diag(CvS

v>S

) � LS

(C) and using (3.5) yields

Tr(diag(CvS

v>S

)V ) � Tr(LS

(C)V ) = Tr(diag(CvS

v>S

)vS

v>S

)

= Tr(diag(CvS

v>S

)) .

Recall that Tr(diag(CvS

v>S

)V ) = Tr(diag(CvS

v>S

)) so that the above dis-play yields Tr(L

S

(C)V ) = 0. Since V ⌫ 0 and the kernel of the semidefinitepositive matrix L

S

(C) is spanned by vS

, we have that V = vS

v>S

.

It follows from Proposition 3.2 that if LS

(�) ⌫ 0 and it has only oneeigenvalue equal to zero, then v

S

v>S

is the unique solution to (3.4). In par-ticular, in this case, the SDP allows exact recovery of the block structure(S, S). Observe that the conditions of Proposition 3.2 hold if � is replacedby the population matrix �. Indeed, using (3.3), we obtain

LS

(�) =�

1 � � + p� � ⌦

2

�

Ip

� (1 � �)P � p� � ⌦

2u

S

u>S

= (1 � �)1

[p]pp

1>[p]pp

� p� � ⌦

2u

S

u>S

+ p� � ⌦

2Ip

,

where we used the fact that Ip

� P is the projector onto the linear span of1

[p]

. Therefore, the eigenvalues of LS

(�) are 0, 1 � � + p(� � ⌦)/2, bothwith multiplicity 1 and p(��⌦)/2 with multiplicity p�1. In particular, forp � 2, L

S

(�) ⌫ 0 and it has only one eigenvalue equal to zero.Extending this result to L

S

(�) yields the following theorem, as illustratedin Figure 1. Let C

↵,�

> 0 be a positive constant such that � � ⌦ > C↵,�

/p,whose existence is guaranteed by Proposition 4.4.

Theorem 3.3. The SDP relaxation (3.4) has a unique maximum atV = v

S

v>S

with probability 1 � � whenever

n > 64�

3 +2

C↵,�

� log(4p/�)

� � ⌦(1 + o

p

(1)) .

In particular, the SDP relaxation recovers exactly the block structure (S, S).


Proof. Recall that LS

(�)vS

= 0 and for any C 2 Sp

, denote by �2

[C]its second smallest eigenvalue. Our goal is to show that �

2

[LS

(�)] > 0. Tothat end, observe that

LS

(�) = LS

(�) + diag�

(� � �)vS

v>S

�

+ � � � .

Therefore, using Weyl’s inequality and the fact �2

[LS

(�)] = p(��⌦)/2, weget

(3.6) �2

[LS

(�)] � p� � ⌦

2� kdiag

�

(� � �)vS

v>S

�kop

� k� � �kop

,

where k · kop

denotes the operator norm. Therefore, it is su�cient to upperbound the above operator norms. This is ensured by the following Lemma.

Lemma 3.4. Fix � > 0 and define

Rn,p

(�) = 2p max⇣

r

(1 + 2/C↵,�

)(� � ⌦) log(4p/�)

n,

(6 + 4/C↵,�

) log(4p/�)

n

⌘

.

With probability 1 � �, it holds simultaneously that

(3.7) k� � �kop

Rn,p

(�)(1 + op

(1)) .

and

(3.8) kdiag�

(� � �)vS

v>S

�kop

Rn,p

(�)(1 + op

(1)) .

Proof. To prove (3.7), we use a Matrix Bernstein inequality for sum ofindependent matrices from Tropp (2015). To that end, note that

� � � =1

n

n

X

t=1

Mt

,

where M1

, . . . , Mn

are i.i.d random matrices given by Mt

= (�(t)�(t)> � �),t = 1, . . . , n. We have

kMt

kop

k�(t)�(t)>kop

+ k�kop

p + k�kop

.

Furthermore, we have that

IE[M2

t

] = IE[k�(t)k2�(t)�(t)> � �(t)�(t)>� � ��(t)�(t)> + �2]

= pIE[�(t)�(t)>] � �2 � �2 + �2 � p� .


As a consequence,P

n

t=1

IE[M2

t

] � p�. By Theorem 1.6.2 in Tropp (2015),this yields

(3.9) IP�k� � �k

op

> t� 2p exp

⇣

� nt2

2pk�kop

+ 2(p + k�kop

)t

⌘

.

We have k� � �kop

t with probability 1 � � for any t such that

log(2p/�) nt2

2pk�kop

+ 2(p + k�kop

)t.

This holds for all

t � max⇣

r

4pk�kop

log(2p/�)

n,

4(p + k�kop

) log(2p/�)

n

⌘

.

To conclude the proof of (3.7), observe that

k�kop

= p� � ⌦

2+ 1 � � �1 +

1

C↵,�

�

(� � ⌦)p ,

where C↵,�

> 0 is defined immediately before the statement of Theorem 3.3.

We now turn to the proof of (3.8). Recall that vS

2 {�1, 1}p so that theith diagonal element satisfies

|diag�

(� � �)vS

v>S

�

ii

| = |e>i

(� � �)vS

| ,where e

i

denotes the ith vector of the canonical basis of IRp. Hence,

kdiag�

(��)vS

v>S

�kop

= maxi2[p]

�

�diag�

(��)vS

v>S

�

ii

�

� = maxi2[p]

|e>i

(��)vS

| .

We bound the right hand-side of the above inequality by noting that

e>i

(� � �)vS

=m

n

n

X

t=1

�

�(t)

i

(µ(t)

S

� µ(t)

¯

S

) � IE[�(t)

i

(µ(t)

S

� µ(t)

¯

S

)]�

,

where µ(t)

S

= 1>S

�(t)/m 2 [�1, 1] and µ(t)

¯

S

is defined analogously. The random

variables �(t)

i

(µ(t)

S

� µ(t)

¯

S

) � IE[�(t)

i

(µ(t)

S

� µ(t)

¯

S

)] are centered, i.i.d., and arebounded in absolute value by 2 for all t 2 [n]. Moreover, it follows fromLemma 2.1 that the variance of these random variables is bounded by (forp � 4)

IE[(µ(t)

S

� µ(t)

¯

S

)2] 2(� � ⌦) +4

p � 2 2(� � ⌦) +

8

p=: ⌫2 .


By a one-dimensional Bernstein inequality, and a union bound over p terms,we have therefore that

IP�

maxi2[p]

|e>i

(� � �)vS

| >pt

n

� 2p exp⇣

� 2t2

n⌫2 + 4t/3

⌘

.

which yields

maxi2[p]

|e>i

(� � �)vS

| p max⇣

r

⌫2 log(2p/�)

n,4 log(2p/�)

3n

⌘

,

with probability 1 � �. It completes the proof of (3.8).

To conclude the proof of Theorem 3.3, note that for the prescribed choiceof n, we have

2Rn,p

(�)(1 + op

(1)) < p� � ⌦

2

and it follows from (3.6) that �2

[LS

(�)] > 0.

Remark 3.5. We have not attempted to optimize the constant term64(3 + 2/C

↵,�

) that appears in Theorem 3.3 and it is arguably suboptimal.One way to see how it can be reduced at least by a factor 2 is by noting thatthe factor p in the right-hand side of (3.9) is in fact superfluous thus result-ing in a extra logarithmic factor in (3.7). This is because, akin to the stochas-tic blockmodel analysis in Abbe et al. (2016), the matrix deviation inequalityfrom Tropp (2015) is too coarse for this problem. The extra factor p may beremoved using the concentration results of Section 4.3 but at the cost of amuch longer argument. Indeed, using Theorem 4.3, we can establish the con-centration of local magnetization around the ground states and conditionallyon these magnetizations, the configurations are uniformly distributed. Theseconditional distributions can be shown to exhibit sub-Gaussian concentrationso that �>u and thus �>u are sub-Gaussian with constant variance proxyfor any unit vector u 2 IRp. This result can yield a bound for k� � �k

op

using an "-net argument that is standard in covariance matrix estimation.With this in mind, we could get an upper bound in (3.7) that is negligiblewith respect to R

n,p

thereby removing a factor 2. Nevertheless, in absence ofa tight control of the constant C

↵,�

, exact constants are hopeless and beyondthe scope of this paper.

3.3. Information theoretic limitations. In this section, we present lowerbounds on the sample size needed to recover the partition (S, S) and com-pare them to the upper bounds of Theorem 3.3. In the sequel, we write


S ⇣ S if either (S,¯S) = (S, S) or (S,

¯S) = (S, S) to indicate that the two

partitions are the same. We write S 6⇣ S to indicate that the two partitionsare di↵erent.

For any balanced partition (S, S), consider a “neighborhood” TS

of (S, S)composed of balanced partitions such that for all (T, T ) 2 T

S

, we have⇢(S, T ) = 1 and ⇢(S, T ) = 1. We first compute the Kullback–Leibler diver-gence between the distributions IP

S

and IPT

.

Lemma 3.6. For any positive �, ↵ < �, and T 2 TS

, it holds that

KL(IPT

, IPS

) =p � 2

p(� � ↵)(� � ⌦) .

Proof. By definition of the divergence and of the distributions, we havethat

KL(IPT

, IPS

) = IET

h

log⇣ IP

T

IPS

(�)⌘i

= IET

⇥

Tr[(QT

� QS

)��>]⇤

= Tr[(QT

� QS

)⌃T

]

Note that most of the coe�cients of QT

� QS

are equal to 0. In fact, noting{s} = S \ T and {t} = S \ T , we have

(QT

� QS

)ij

=↵ � �

pif

8

>

>

>

>

<

>

>

>

>

:

i 2 S \ {s} , j = s

i = s , j 2 S \ {s}i 2 S \ {t} , j = t

i = t , j 2 S \ {t}

and

(QT

� QS

)ij

=� � ↵

pif

8

>

>

>

>

<

>

>

>

>

:

i 2 S \ {s} , j = t

i = s , j 2 S \ {t}i 2 S \ {t} , j = s

i = t , j 2 S \ {s} ,

and 0 otherwise. There are therefore p � 2 coe�cients of each sign. Further-more, whenever (Q

T

�QS

)ij

= (↵��)/p, we have (⌃T

)ij

= ⌦, and whenever(Q

T

�QS

)ij

= (��↵)/p, we have (⌃T

)ij

= �. Computing Tr[(QT

�QS

)⌃T

]explicitly yields the desired result.

From this lemma, we derive the following lower bound.


Theorem 3.7. For � 2 (0, 3/5) and p � 6 and

n � log(p/4)

(� � ↵)(� � ⌦).

We have

infˆ

S

maxS2S

IP⌦n

S

�

(S,¯S) 6⇣ (S, S)) � p � 2

p

�

1 � � � p��

> 0 ,

where the infimum is taken over all estimators of S. Note that the right-handside of the above inequality goes to 1 as p ! 1 and � ! 0.

Proof. First, note that by Lemma 3.6, for any T 2 TS

, it holds |TS

| =(p/2 � 1)2 so that

KL(IP⌦n

T

, IP⌦n

S

) = nKL(IPT

, IPS

) n(��↵)(��⌦) � log(p/4) �

2log |T

S

| .

Thus Theorem 2.5 in Tsybakov (2009) yields

infˆ

S

maxS2P

IP⌦n

S

(S 6⇣ S) �p|T

S

|1 +

p|TS

|⇣

1 � � �r

�

log(|TS

|)⌘

� p � 2

p

�

1 � � � p��

> 0 ,

for � 2 (0, 3/5).

As we will see in the next section, the gap ��⌦ scales with the size p of thesystem. Since our lower bound Theorem 3.7 exhibits the same dependencein � � ⌦ as the upper bound of Theorem 3.3, we conclude that the SDPrelaxation studied in this paper is rate optimal: it achieves exact recovery ofthe community structure for a sample size that scales optimally with p. Inthe next section we quantify this scaling in p using an asymptotic analysisof the ground states of the Ising blockmodel.

4. Optimal rates of exact recovery. As seen in Section 3, given�(1), . . . , �(n) independent copies of � ⇠ IP

↵,�

, the sample covariance matrix

⌃ defined by

⌃ =1

n

n

X

t=1

�(t)�(t)

>,

is a su�cient statistic for S. From basic concentration results (see Section 3),we have shown that this matrix concentrates around the true covariance ma-trix ⌃ = IE

↵,�

⇥

��>⇤ where IE↵,�

denotes the expectation associated to IP↵,�

.


Unfortunately, computing ⌃ directly is quite challenging. As a consequence,there is no simple expression of the quantity of interest � � ⌦ as a func-tion of the parameters of the problem p and ↵, �. Instead, we show thatIP

↵,�

converges to a mixture of bivariate Gaussians with known centers andcovariance matrices. In turn, it give us a handle of quantities of the formIE

↵,�

['(�)] for some test function ' and in particular, it allows us to quan-tify the gap � � ⌦. Obtaining an asymptotic expression of these values istherefore important to derive theoretical guarantees for the estimators con-sidered above. Beyond our statistical task, we show phase transitions thatare interesting from a probabilistic point of view.

4.1. Free energy. Recall that Hibm↵,�

denotes the IBM Hamiltonian (or“energy”) defined on {�1, 1}p by

Hibm↵,�

(�) = ��

2p

X

i⇠j

�i

�j

+↵

2p

X

i⌧j

�i

�j

�

,

so that

f↵,�

(�) =e�Hibm

↵,�(�)

Z↵,�

As noted above, the density f↵,�

assigns the same probability to configura-tions that have the same magnetization structure. It follows from elementarycomputations that

(4.1) Hibm↵,�

(�) = �m

4

⇣

2↵µS

µ¯

S

+ �(µ2

S

+ µ2

¯

S

)⌘

,

where we recall that m = p/2. Moreover, the number of configurations �with local magnetizations µ = (µ

S

, µ¯

S

) 2 [�1, 1]2 is given by

✓

mµS+1

2

m

◆✓

mµS+1

2

m

◆

This quantity can be approximated using Stirling’s formula (see Lemma C.2):For any µ 2 (�1+ ", 1� "), there exists two positive constants c, c such that

cpm

e�mh

�

µ+12

�

✓

mµ+1

2

m

◆

cpm

emh

�

µ+12

�

, 8 m � 1

where h : [0, 1] ! IR is the binary entropy function defined by h(0) = h(1) =1 and for any s 2 (0, 1) by

h(s) = �s log(s) � (1 � s) log(1 � s) .


Thus, IBM induces a marginal distribution on the local magnetizations thathas density

(4.2)`m

(µS

, µ¯

S

)

mZ↵,�

exph

� m

4g(µ

S

, µ¯

S

)i

,

where c2 `m

(µS

, µ¯

S

) c2 and

(4.3) g(µS

, µ¯

S

) = �2↵µS

µ¯

S

� �(µ2

S

+ µ2

¯

S

) � 4h�µ

S

+ 1

2

�� 4h�µ

¯

S

+ 1

2

�

.

Note that the support of this density is implicitly the set of possible valuesfor pairs of local magnetizations of vectors in {�1, 1}p, that is the set M2,where

(4.4) M :=ns>1

[m]

m, s 2 {�1, 1}m

o

⇢ [�1, 1] .

We call the function g the free energy of the Ising blockmodel and its struc-ture of minima is known to control the behavior of the system. Indeed, letg⇤ denote the minimum value of g over M2. It follows from (4.2) that anylocal magnetization (µ

S

, µ¯

S

) 2 M2 such that g(µS

, µ¯

S

) > g⇤ has a proba-bility exponentially smaller than any magnetization that minimizes g overM2. Intuitively, this results in a distribution that is concentrated around itsmodes. Before quantifying this e↵ect, we study the minima, known as groundstates of the free energy g, when defined over the continuum [�1, 1]2.

4.2. Ground states. Recall that when ↵ = �, the block structure vanishesand the IBM reduces to the well-known Curie-Weiss model. We gather inthe supplementary material facts about the Curie-Weiss model that we usein this section.

The following proposition characterizes the ground states of the Isingblockmodel. For any p 2 [1, 1], we denote by k · k

p

the `p

norm of IR2 andby B

p

= {x 2 IR2, : kxkp

1} the unit ball with respect to that norm.

Proposition 4.1. For any b 2 IR, let ±x(b) 2 (�1, 1), x(b) � 0 denotethe ground state(s) of the Curie-Weiss model with inverse temperature b.The free energy g

↵,�

of the IBM defined in (4.3) has the following minima:If � + |↵| 2, then g

↵,�

has a unique minimum at (0, 0).If � + |↵| > 2, then three cases arise:

1. If ↵ = 0, then g↵,�

has four minima at (±x(�/2), ±x(�/2)),

2. If ↵ > 0, g↵,�

has two minima at s = (x(�+↵

2

), x(�+↵

2

)) and �s,

3. If ↵ < 0, g↵,�

has two minima at s = (x(��↵

2

), �x(��↵

2

)) and �s.


In particular, for all values of the parameters ↵ and �, all ground states(x, y) satisfy x2 = y2 < 1.

This result is illustrated in Figure 2, composed of contour plots of the freeenergy g

↵,�

on the square [�1, 1]2, for several values of the parameters. Thedi↵erent regions are also represented in Figure 3 below.

↵ = �6 ↵ = �2.5 ↵ = �0.5 ↵ = 0

↵ = �4 ↵ = �0.9 ↵ = �0.2 ↵ = 0

Figure 2: Contour plots of the values of the free energy g↵,�

with highervalues in red and lower values in blue, corresponding to ground states. Toprow : Several choices for ↵ < 0, and � = 1.5 < 2. Bottom row : Severalchoices for ↵ < 0, and � = 2.5 > 2. The same plots with ↵ > 0 can beobtained by a 90� rotation, by symmetry of the function.

Proof. Throughout this proof, for any b 2 IR, we denote by gcwb

(x), x 2[�1, 1], the free energy of the Curie-Weiss model with inverse temperatureb. We write g := g

↵,�

for simplicity to denote the free energy of the IBM.Note that

(4.5) g(x, y) = gcw�+↵

2

(x) + gcw�+↵

2

(y) + ↵(x � y)2 .

We split our analysis according to the sign of ↵. Note first that if ↵ = 0, wehave

g(x, y) = gcw�2

(x) + gcw�2

(y) .

It yields that:


• If � 2, then gcw�2

has a unique local minimum at x = 0 which implies

that g has a unique minimum at (0, 0)• If � > 2, then gcw

�2

has exactly two minima at x(�/2) and �x(�/2), where

x(�/2) 2 (�1, 1). It implies that g has four minima at (±x(�/2), ±x(�/2)).

Next, if ↵ > 0, in view of (4.5) we have

g(x, y) � gcw�+↵

2

(x) + gcw�+↵

2

(y)

with equality i↵ x = y. It follows that:

• If ↵ + � 2, then g has a unique minimum at (0, 0)• If ↵ + � > 2, then g has two minima on A at (x(�+↵

2

), x(�+↵

2

)) and at

(�x(�+↵

2

), �x(�+↵

2

)).

Finally, note that (x � y)2 2x2 + 2y2 with equality i↵ x = �y. Thus, if↵ < 0, in view of (4.5) we have

(4.6) g(x, y) � gcw��↵

2

(x) + gcw��↵

2

(y)

with equality i↵ x = �y. It implies that

• If � � ↵ 2, then g has a unique minimum at (0, 0)• If � � ↵ > 2, then g has two minima at (x(��↵

2

), �x(��↵

2

)) and at

(�x(��↵

2

), x(��↵

2

)).

Using the localization of the ground states from Lemma A.1, we also getthe following local and global behaviors of the free energy of the IBM aroundthe ground states.

Lemma 4.2. Assume that � + |↵| 6= 2. Denote by (x, y) any ground stateof Ising blockmodel and recall that x2 = y2. Then the following holds:

1. The Hessian H↵,�

of g↵,�

at (x, y) is given by

H↵,�

= �2

✓

� ↵↵ �

◆

+4

1 � x2

I2

.

In particular H↵,�

has eigenvalues 2(↵ � �) + 4/(1 � x2) and�2(↵ + �) + 4/(1 � x2) associated with eigenvectors (1, �1) and (1, 1)respectively.

22 BERTHET, RIGOLLET AND SRIVASTAVA6 BERTHET, RIGOLLET AND SRIVASTAVA

�2 2�4 4�1 1

2

4

(III)(I)

(II)

↵

�

Figure 1: The phase diagram of the Ising block model, with three regions for theparameters ↵ and � > 0. In region (I), where ↵ < 0 and � + |↵| > 2, there aretwo ground states of the form (x, �x) and (�x, x), x > 0. In region (II), where� + |↵| < 2, there is one ground state at (0, 0). In region (III), where ↵ > 0 and� + |↵| > 2, there are two ground states of the form (x, x) and (�x, �x), x > 0.At the boundary between regions (I) and (III), there are four ground states. Thedotted line has equation ↵ = �, we only consider parameters in the region to itsleft, where � > ↵.

Proof. Throughout this proof, for any b 2 IR, we denote by gcwb

(x), x 2[�1, 1], the free energy of the Curie-Weiss model with inverse temperature b. Wewrite g := g

↵,�

for simplicity to denote the free energy of the IBM.Note that

(2.7) g(x, y) = gcw�+↵

2

(x) + gcw�+↵

2

(y) + ↵(x � y)2 .

We split our analysis according to the sign of ↵. Note first that if ↵ = 0, we have

g(x, y) = gcw�2

(x) + gcw�2

(y) .

It yields that:

• If � 2, then gcw�2

has a unique local minimum at x = 0 which implies that g

has a unique minimum at (0, 0)• If � > 2, then gcw

�2

has exactly two minima at x(�/2) and �x(�/2), where

x(�/2) 2 (�1, 1). It implies that g has four minima at (±x(�/2), ±x(�/2)).

Next, if ↵ > 0, in view of (2.7) we have

g(x, y) � gcw�+↵

2

(x) + gcw�+↵

2

(y)

with equality i↵ x = y. It implies that:

• If ↵ + � 2, then g has a unique minimum at (0, 0)

Figure 3: Phase diagram of the Ising block model, with three regions for ↵and � > 0. In region (I), where ↵ < 0 and � + |↵| > 2, there are two groundstates of the form (x, �x) and (�x, x). In region (II), where �+|↵| < 2, thereis one ground state at (0, 0). In region (III), where ↵ > 0 and � + |↵| > 2,there are two ground states of the form (x, x) and (�x, �x). The dotted linehas equation ↵ = �, we only consider parameters in the region to its left.

2. There exists positive constants � = �(� + |↵|), 2 = 2(� + |↵|) suchthat the following holds. For any (x, y) 2 (�1, 1)2, we have

(4.7) g(x, y) � g(x, y) +2

2

⇣

k(x, y) � (x, y)k1 ^ �⌘

2

.

Moreover,

If � + |↵| > 2, we can take � = e�2(�+|↵|) � + |↵| � 2

4(� + |↵|) and

2 = 1 � 2

� + |↵| .If � + |↵| < 2, we can take � =

p

(2 � (� + |↵|))/6 and2 = 2 � (� + |↵|).

Proof. Elementary calculus yields directly that

H↵,�

=

�2� + 4

1�x

2 �2↵

�2↵ �2� + 4

1�y

2

!

.

Moreover, it follows from Proposition 4.1 that all ground states satisfy x2 =y2. This completes the proof of the first point.


We now turn to the proof of the second point and split the analysis intofive cases: (i) ↵ � 0 and � + ↵ < 2, (ii) ↵ > 0 and � + ↵ > 2, (iii) ↵ < 0 and� � ↵ < 2, (iv) ↵ < 0 and � � ↵ > 2, (v) ↵ = 0 and � + ↵ > 2.

Case (i): ↵ � 0 and � + ↵ < 2. Recall that in this case, g has a uniqueminimum at (0, 0). Therefore, in view of (4.5) and Lemma A.1, we have

g(x, y) � g(0, 0) = gcw�+|↵|

2

(x) � gcw�+|↵|

2

(0) + gcw�+|↵|

2

(y) � gcw�+|↵|

2

(0) + ↵(x � y)2

� 1

2

�

2 � (� + |↵|)�⇥(|x � 0| ^ "0)2 + (|y � 0| ^ "0)2⇤

� 1

2

�

2 � (� + |↵|)��k(x, y) � (0, 0)k1 ^ "0�

2

.

where "0 =p

(2 � (� + |↵|))/6 which concludes this case.

Case (ii): ↵ > 0 and � + ↵ > 2. Recall that in this case, g has two minimadenoted generically by (x, y) where x = y. Therefore, in view of (4.5) andLemma A.1, we have

g(x, y) � g(x, y) = gcw�+|↵|

2

(x) � gcw�+|↵|

2

(x) + gcw�+|↵|

2

(y) � gcw�+|↵|

2

(y) + ↵(x � y)2

� 1

2

�

1 � 2

� + |↵|�⇥

(|x � 0| ^ ")2 + (|y � 0| ^ ")2⇤

� 1

2

�

1 � 2

� + |↵|��k(x, y) � (0, 0)k1 ^ "

�

2

.

where " = e�2(�+|↵|) � + |↵| � 2

4(� + |↵|) which concludes this case.

Case (iii): ↵ < 0 and � � ↵ < 2. Recall that in this case, g has a uniqueminimum at (0, 0). Moreover, in view of (4.6) and Lemma A.1, it holds

g(x, y) � g(0, 0) � gcw�+|↵|

2

(x) � gcw�+↵

2

(0) + gcw�+|↵|

2

(y) � gcw�+↵

2

(0)

� gcw�+|↵|

2

(x) � gcw�+|↵|

2

(0) + gcw�+|↵|

2

(y) � gcw�+|↵|

2

(0)

� 1

2

�

2 � (� + |↵|)��k(x, y) � (0, 0)k1 ^ "0�

2

.

where in the second inequality, we used the fact that

gcw�+↵

2

(0) = gcw�+|↵|

2

(0) = �4h(1/2) ,

and we concluded as in Case (i).


Case (iv): ↵ < 0 and � � ↵ > 2. Recall that in this case, g has two minimadenoted generically by (x, y) where x = �y. Therefore, in view of (4.5)and (4.6), we have

g(x, y) � g(x, y) � gcw�+|↵|

2

(x) � gcw�+↵

2

(x) + gcw�+|↵|

2

(y) � gcw�+↵

2

(�x) � 4↵x2 .

Next, observe that from the definition (A.1) of the free energy in the Curie-Weiss model, we have

�gcw�+↵

2

(x) � gcw�+↵

2

(�x) � 4↵x2 = �gcw�+|↵|

2

(x) � gcw�+|↵|

2

(�x) .

The above two displays yield

g(x, y) � g(x, y) � gcw�+|↵|

2

(x) � gcw�+|↵|

2

(x) + gcw�+|↵|

2

(y) � gcw�+|↵|

2

(�x)

� 1

2

�

1 � 2

� + |↵|��k(x, y) � (0, 0)k1 ^ "

�

2

.

where we concluded as in Case (ii).Case (v): ↵ = 0 and � + ↵ > 2 This case can be handled by combining thearguments made in cases (ii) and (iv), but it is easier to proceed directly.Recall that in this case, there are four distinct minima of g, of the form(±x, ±x), where x is the unique positive minimum of gcw

�/2

. For any such

minimum (x, y), we have, in view of (4.5) followed by an application ofLemma A.1:

g(x, y) � g(x, y) � gcw�2

(x) � gcw�2

(x) + gcw�2

(y) � gcw�2

(�x).

� 1

2

�

1 � 2

�

�⇥

(|x � x| ^ ")2 + (|y � y| ^ ")2⇤

� 1

2

�

1 � 2

�

��k(x, y) � (x, y)k1 ^ "�

2

.

where " = e�2�

� � 2

4�, which concludes this case.

4.3. Concentration. As mentioned above, quantities of the form IE↵,�

['(�)]cannot in general be computed explicitly in the IBM. Fortunately, it will besu�cient for us to compute quantities of the form IE

↵,�

['(µ)], where we re-call that µ = (µ

S

, µ¯

S

) denotes the pair of local magnetizations of a randomconfiguration � 2 {�1, 1}p drawn according to IP

↵,�

. While exact computa-tion is still a hard problem, these quantities can be be well approximatedusing the fact that IP

↵,�

is highly concentrated around its ground states forlarge enough p.


To leverage concentration, we need to consider the “large m” (or equiv-alently “large p”) asymptotic framework. As a result, it will be convenientto write for two sequences a

m

, bm

that am

'm

bm

if am

= (1 + om

(1))bm

.Our main result hinges on the following theorem that compares the dis-

tribution of µ = (µS

, µ¯

S

) 2 [�1, 1] to a certain mixture of Gaussians thatare centered at the ground states.

Theorem 4.3. Consider the IBM with parameters ↵, � such that � +|↵| 6= 2. Let ' : IR2 ! IR�0 be a non-negative test function such that'([�1, 1]2) ✓ [0, 1], and for which there exists a positive constant � < 3/2such that for any ground state s,

(4.8) IE⇥

'�

s +2pm

H�1/2Z�⇤ � Cm�� ,

where Z ⇠ N2

(0, I2

) and H = H↵,�

denotes the Hessian of the free energyg↵,�

at s. Assume further that there exist D, ✏ > 0 and a positive integer dsuch that '(x) D + Dkxk2d

2

for all x 2 IR2, and such that in the ✏ size`1-neighborhood ⌘ of each ground state s, ' satisfies one of the followingtwo regularity conditions:

1. There exist positive constants C1

, C2

such that 0 < C1

< '(x) and 'is C

2

-Lipschitz in ⌘ in the sense that |'(x) � '(y)| C2

kx � yk1 forall x, y 2 ⌘, or

2. There exists a positive constant C1

such that

|'(x) � '(y)| C1

max (kx � sk1, ky � sk1) · kx � yk1

for all x, y 2 ⌘.

Then

IE↵,�

['(µ)] 'm

1

|G|X

s2G

IE⇥

'(s +2pm

H�1/2Z)⇤

.

where G ⇢ {(±x, ±x)} denotes the set of ground states of the IBM.

Applying this result to the covariance statistic yields the following.

Proposition 4.4. Let � and ⌦ be defined as in Lemma 2.1 and recallthat G denotes the set of ground states of the IBM with parameters ↵, � suchthat � + |↵| 6= 2. Then

� � ⌦ =1

m

� (� � ↵)(1 � x2)2

2 � (� � ↵)(1 � x2)

�

+1 + o

m

(1)

2|G|X

(x,y)2G

(x � y)2 + om

✓

1

m

◆

,

where x is the x coordinate of an arbitrary (x, y) 2 G. In particular,


• If � + |↵| < 2, then � � ⌦ 'm

1

m

� � � ↵

2 � (� � ↵)

�

, which is positive when

↵ < �.• If � + |↵| > 2, then three cases arise:

1. if ↵ = 0, then � � ⌦ 'm

x2,

2. if ↵ > 0, then � � ⌦ 'm

1

m

� (� � ↵)(1 � x2)2

2 � (� � ↵)(1 � x2)

�

, which is positive

when ↵ < �.

3. if ↵ < 0, then � � ⌦ 'm

2x2 .

The proofs of Theorem 4.3 and Proposition 4.4 are deferred to the supple-mentary material. It follows from Proposition 4.4 that if �+ |↵| 6= 2 then thecovariance matrix ⌃ takes two values that are separated by a term of orderat least 1/m and even sometimes of order 1. This result makes it possible toexpress the theoretical guarantees on the sample size of Sections 3.2 and 3.3only in terms of p, �, ↵.

Remark 4.5. Note that Theorem 4.3 and hence Proposition 4.4 do notcover the case � + |↵| = 2, which includes the boundary along which thephase transitions in the model and the sample complexity occur. The techni-cal reason for this is that the Hessian of the free energy at the ground statebecomes singular on this boundary. Thus, the free energy is no more ap-proximately quadratic in the vicinity of the ground states and the Gaussianbehavior exhibited in Theorem 4.3 is lost.

A similar subtlety arises at the critical boundary in many related contexts.An example is the recent identification, following a long line of work, ofphase transitions in the hard core lattice gas and antiferromagnetic Isingmodel with a complexity theoretic phase transition (Li et al., 2013; Sinclairet al., 2014; Sly and Sun, 2014; Weitz, 2006), where the behavior of theproblem on the critical boundary is still not fully resolved. We leave the exactdetermination of the behavior of the IBM at the phase transition boundaryto future work.

4.4. Rates of exact recovery. Combining the results of Proposition 4.4that quantifies the gap ��⌦ in terms of the dimension p and of Theorem 3.3readily yields the following corollary.

Corollary 4.6. Let � and ↵ be parameters for the IBM such that � +|↵| 6= 2 and ↵ < �. There exist positive constants C

1

and C2

that dependon ↵ and � such that the following holds. The SDP relaxation (3.4) recoversthe block structure (S, S) exactly with probability 1 � � whenever


1. n � C1

p log(p/�) if � + |↵| < 2 or ↵ > 02. n � C

2

log(p/�) otherwise.

In particular, if ��↵ > 2, ↵ 0 a number of observations that is logarithmicin the dimension p is su�cient to recover the blocks exactly.

These results suggest that there is a sharp phase transition in samplecomplexity for this problem, depending on the value of the parameters ↵and �. We address this question further in Section 5. As noted above, ourupper and lower bounds on the sample complexity (in Theorems 3.3 and 3.7respectively) match up to a numerical constant. The sample complexitystated in Corollary 4.6 has optimal dependence on the dimension p. Notethat past work on exact recovery in the stochastic blockmodel (Abbe et al.,2016; Hajek et al., 2016) has shown that some SDP relaxation was alsooptimal with respect to constants. We do not pursue this question in thepresent paper.

5. Conclusion and open problems. This paper introduces the Isingblock model (IBM) for large binary random vectors with an underlying clus-ter structure. In this model, we studied the sample complexity of recoveringexactly the clusters. Unsurprisingly, this paper bears similarities with thestochastic blockmodel, but also di↵erences. For example, in the stochasticblockmodel one is given only one observation of the graph. In the IBM,given one realization �(1) 2 {�1, 1}p, the maximum likelihood estimator isthe trivial clustering that assigns i 2 [p] to a cluster according to the sign

of �(1)

i

, up to a trivial reassignment to keep the partition balanced.

Below is a summary of our main findings:

1. The model exhibits three phases depending on the values taken by twoparameters.

2. In one phase, where the two clusters tend to have opposite behavior,the sample complexity is logarithmic in the dimension; in the othertwo, it is near linear. These sample complexities are proved to beoptimal in an information theoretic sense.

3. Akin to the stochastic blockmodel, the optimal sample complexity isachieved using the natural semidefinite relaxation to the MAXCUTproblem.

Many questions regarding this model remain open. The first and mostnatural is the determination of exact constants. Theorems 3.3 and 3.7 sug-gest that there exists a universal constant C? such that the optimal sample


complexity isC? log(p)

(� � ↵)(� � ⌦)(1 + o

p

(1)) .

Throughout this paper, we have only loosely kept track of the correct de-pendency of the constants in the upper and lower bounds as function ofthe constants (↵, �). We have shown that the optimal sample complexityis a product of log(p)/(� � ⌦) and of a constant term that only becomesarbitrarily large when ↵ is arbitrarily close to �, with a divergence of order(� � ↵)�1, which is consistent with our lower bound. In the spirit of exactthresholds for the stochastic blockmodel (Abbe et al., 2016; Massoulie, 2014;Mossel et al., 2015), we find that proving existence of the constant C? andcomputing it worthy of investigation but is beyond the scope of the presentpaper.

Another possible development is the extension of this model to settingswith multiple blocks, possibly of unbalanced sizes. This has been studied inthe case of the stochastic blockmodel for graphs in the sparse case (Abbeand Sandon, 2015; Banks et al., 2016) and in the dense case (Gao et al.,2015, 2016; Rohe et al., 2011), as well as in the case of clustering for Gaus-sian variables (Bunea et al., 2015, 2016). For the Ising blockmodel, the mainchallenge is that the population covariance matrix cannot be directly com-puted from the parameters of the problem, and an analysis of the groundstates of the free energy is required. Developing a general approach to thistask, rather than having to do an ad hoc analysis for each case would bean important step in this direction. In another direction, a possible exten-sion would be to study the impact of a an external magnetic field µ onthe Ising blockmodel. Problems related to the influence of a sparse exter-nal magnetic field in a Curie-Weiss model have recently been investigated(Mukherjee et al., 2016), and similar phenomena of phase transitions forsample complexities have been exhibited.

As pointed out in the introduction, the Ising model is in general related toa corresponding Glauber dynamics, a natural Markov chain with the desiredGibbs distribution as a stationary distribution. Exact recovery from depen-dent observations arising from the Glauber dynamics is a natural extension.Tools to study this question have been developed in Bresler et al. (2014).

We have only analyzed in this work the performance of the semidefinitepositive relaxation of the maximum likelihood problem, but other methodscan be considered for total or partial recovery. In related problems, beliefpropagation is used to recover communities (see e.g. Abbe and Sandon,2016a,b; Lesieur et al., 2017; Moitra et al., 2016; Mossel et al., 2014, andwork cited above). In particular, Lesieur et al. (2017) covers Hopfield models,


which are a generalization of our model. It is possible that studying thesetypes of algorithms is necessary in order to obtain sharper rates.

Finally, in view of the simple spectral decomposition (3.3) of �, one maywonder about the behavior of a simple method that consists in computingthe leading eigenvector of � and clustering according to the sign of its en-tries. Such a method is the basis of the approach in denser graph models inMcSherry (2001) or Alon et al. (1998). The results of such an approach areeasily implementable as follows.

Let u denote a leading unit eigenvectors of � and consider the followingestimate for the partition (S, S):

(5.1) S ⇣ {i 2 [p] | ui

> 0} .

It follows from the Perron-Frobenius theorem that S ⇣ S whenever sign(�) =sign(�). This allows for perfect recovery of S, but only holds with high prob-ability when n is of order log(p)/(��⌦)2, which is suboptimal. It is howeverpossible to obtain partial recovery guarantees for the spectral recovery. Inorder to state our result, for any two partitions (S, S), (T, T ) define

|S3T | = min⇣

|S 4 T |, |S 4 T |⌘

where 4 denotes the symmetric di↵erence.

Proposition 5.1. Fix � 2 (0, 1) and let S ⇢ [p] be defined in (5.1).Then, there exits a constant �

↵,�

> 0 such that with probability 1 � �,

1

p|S3S| �

↵,�

log(4p/�)

n(� � ⌦).

Proof. Let u denote the leading unit eigenvector of � and let v =p

pu.Recall that v

S

= 1S

� 1¯

S

and observe that

|S3S| = min⇣

p

X

i=1

1I(vi

· (vS

)i

0),p

X

i=1

1I(vi

· (vS

)i

� 0)⌘

min�kv � v

S

k2, kv + vS

k2

�

= p min�ku � u

S

k2, ku + uS

k2

�

,

where in the inequality, we used the fact that vS

2 {�1, 1}p so that

1I(vi

· (vS

)i

0) |vi

� (vS

)i

|1I(vi

· (vS

)i

0) |vi

� (vS

)i

|2 .


Using a variant of the Davis-Kahan lemma (see, e.g, Wang et al. (2016)),we get

1

p|S3S| k� � �k2

op

(�1

(�) � �2

(�))2,

and the result follows readily from (3.7) and the fact that the eigengap of �is given by p(� � ⌦)/2.

In terms of exact recovery, this result is quite weak as it only gives guar-antees for a sample complexity of the order of p log(p/�)/(� � ⌦), which issuboptimal by a factor of p. Moreover, for the bound of Proposition 5.1 to benon-trivial, one already needs the sample size to be of the same order as theone required for exact recovery by semi-definite programming. NeverthelessProposition 5.1 raises the question of the optimal rates of estimation of Swith respect to the metric |S3S|/p. While partial recovery is beyond thescope of this paper, it would be interesting to establish the optimal rate.

Acknowledgements. The authors thank Andrea Montanari for point-ing out a connection to the Hopfield model, and the anonymous referees fortheir very helpful input.

References.

Abbe, E. (2017). Community detection and stochastic block models: recentdevelopments. arXiv:1703.10146 .

Abbe, E., Bandeira, A. S. and Hall, G. (2016). Exact recovery in thestochastic block model. IEEE Transactions on Information Theory 62471–487.

Abbe, E. and Sandon, C. (2015). Detection in the stochastic block modelwith multiple clusters: proof of the achievability conjectures, acyclic BP,and the information-computation gap. arXiv:1512.09080 .

Abbe, E. and Sandon, C. (2016a). Achieving the ks threshold in thegeneral stochastic block model with linearized acyclic belief propagation.In Advances in Neural Information Processing Systems 29 (D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett, eds.). CurranAssociates, Inc., 1334–1342.

Abbe, E. and Sandon, C. (2016b). Crossing the KS threshold in thestochastic block model with information theory. In Proceedings of the 2016IEEE International Symposium on Information Theory (ISIT). IEEE,840–844.


Alon, N., Krivelevich, M. and Sudakov, B. (1998). Finding a largehidden clique in a random graph. In Proceedings of the 1998 ACM-SIAMsymposium on Discrete algorithms. SIAM, Philadelphia, PA, USA, 594–598.

Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008). Modelselection through sparse maximum likelihood estimation for multivariateGaussian or binary data. J. Mach. Learn. Res. 9 485–516.

Banks, J., Moore, C., Neeman, J. and Netrapalli, P. (2016).Information-theoretic thresholds for community detection in sparse net-works. arXiv:1601.02658 .

Besag, J. (1986). On the statistical analysis of dirty pictures. J. Roy.Statist. Soc. Ser. B 48 259–302.

Boyd, S. and Vandenberghe, L. (2004). Convex optimization. CambridgeUniversity Press, Cambridge.

Bresler, G. (2015). E�ciently learning Ising models on arbitrary graphs[extended abstract]. In Proceedings of the 2015 ACM Symposium on The-ory of Computing. ACM, New York, 771–782.

Bresler, G., Gamarnik, D. and Shah, D. (2014). Learning graphicalmodels from the glauber dynamics. arXiv:1410.7659 .

Bresler, G., Mossel, E. and Sly, A. (2008). Reconstruction of Markovrandom fields from samples: some observations and algorithms. In Ap-proximation, randomization and combinatorial optimization, vol. 5171 ofLecture Notes in Comput. Sci. Springer, Berlin, 343–356.

Bunea, F., Giraud, C. and Luo, X. (2015). Minimax optimal variableclustering in G-models via Cord. arXiv:1508.01939 .

Bunea, F., Giraud, C., Royer, M. and Verzelen, N. (2016). PECOK:a convex optimization approach to variable clustering. arXiv:1606.05100.

Decelle, A., Krzakala, F., Moore, C. and Zdeborova, L. (2011).Asymptotic analysis of the stochastic block model for modular networksand its algorithmic applications. Physical Review E 84 066106.

Diaconis, P., Goel, S. and Holmes, S. (2008). Horseshoes in multidi-mensional scaling and local kernel methods. Ann. Appl. Stat. 2 777–807.


Dyer, M. E. and Frieze, A. M. (1989). The solution of some randomNP-hard problems in polynomial expected time. J. Algorithms 10 451–489.

Feige, U. and Krauthgamer, R. (2002). A polylogarithmic approxima-tion of the minimum bisection. SIAM J. Comput. 31 1090–1118 (elec-tronic).

Gao, C., Ma, Z., Zhang, A. Y. and Zhou, H. H. (2015). Achiev-ing optimal misclassification proportion in stochastic block model.arXiv:1505.03772 .

Gao, C., Ma, Z., Zhang, A. Y. and Zhou, H. H. (2016). Communitydetection in degree-corrected block models. arXiv:1607.06993 .

Garey, M. R., Johnson, D. S. and Stockmeyer, L. (1976). Somesimplified NP-complete graph problems. Theoret. Comput. Sci. 1 237–267.

Goemans, M. X. and Williamson, D. P. (1995). Improved approx-imation algorithms for maximum cut and satisfiability problems usingsemidefinite programming. J. Assoc. Comput. Mach. 42 1115–1145.

Hajek, B., Wu, Y. and Xu, J. (2016). Achieving exact cluster recoverythreshold via semidefinite programming. IEEE Transactions on Informa-tion Theory 62 2788–2797.

Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochasticblockmodels: First steps. Social Networks 5 109 – 137.

Ising, E. (1925). Beitrag zur Theorie des Ferromagnetismus. Zeitschrift furPhysik 31 253–258.

Laurent, M. and Poljak, S. (1996). On the facial structure of the set ofcorrelation matrices. SIAM Journal on Matrix Analysis and Applications17 530–547.

Lauritzen, S. L. (1996). Graphical models, vol. 17 of Oxford StatisticalScience Series. The Clarendon Press, Oxford University Press, New York.Oxford Science Publications.

Lauritzen, S. L. and Sheehan, N. A. (2003). Graphical models forgenetic analyses. Statist. Sci. 18 489–514.


Lesieur, T., Krzakala, F. and Zdeborova, L. (2017). Constrained low-rank matrix estimation: Phase transitions, approximate message passingand applications. arXiv:1701.00858 .

Li, L., Lu, P. and Yin, Y. (2013). Correlation decay up to uniquenessin spin systems. In Proceedings of the 2013 ACM-SIAM Symposium onDiscrete Algorithms (SODA). SIAM, 67–84.

Manning, C. D. and Schutze, H. (1999). Foundations of statistical nat-ural language processing. MIT Press, Cambridge, MA.

Massoulie, L. (2014). Community detection thresholds and the weak Ra-manujan property. In Proceedings of the 46th Annual ACM Symposiumon Theory of Computing. ACM.

McSherry, F. (2001). Spectral partitioning of random graphs. In 42ndIEEE Symposium on Foundations of Computer Science (Las Vegas, NV,2001). IEEE Computer Soc., Los Alamitos, CA, 529–537.

Moitra, A., Perry, W. and Wein, A. S. (2016). How robust are recon-struction thresholds for community detection? Proceedings of the forty-eighth annual ACM symposium on Theory of Computing 828–841.

Montanari, A. and Saberi, A. (2010). The spread of innovations insocial networks. Proceedings of the National Academy of Sciences 10720196–20201.

Mossel, E., Neeman, J. and Sly, A. (2013). A proof of the block modelthreshold conjecture. arXiv:1311.4115 .

Mossel, E., Neeman, J. and Sly, A. (2014). Belief propagation, robustreconstruction, and optimal recovery of block models. Proceedings of the2014 Conference on Learning Theory (COLT) .

Mossel, E., Neeman, J. and Sly, A. (2015). Reconstruction and estima-tion in the planted partition model. Probability Theory and Related Fields162 431–461.

Mukherjee, R., Mukherjee, S. and Yuan, M. (2016). Global testingagainst sparse alternatives under Ising models. arXiv:1611.08293 .

Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using `

1

-regularized logistic regression.Ann. Statist. 38 1287–1319.


Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering andthe high-dimensional stochastic blockmodel. Ann. Statist. 39 1878–1915.

Schneidman, E., Berry, M. J., Segev, R. and Bialek, W. (2006). Weakpairwise correlations imply strongly correlated network states in a neuralpopulation. Nature 440 1007–1012.

Sebastiani, P., Ramoni, M. F., Nolan, V., Baldwin, C. T. and Stein-berg, M. H. (2005). Genetic dissection and prognostic modeling of overtstroke in sickle cell anemia. Nature genetics 37 435–440.

Sinclair, A., Srivastava, P. and Thurley, M. (2014). Approximationalgorithms for two-state anti-ferromagnetic spin systems on bounded de-gree graphs. J. Stat. Phys. 155 666–686.

Sly, A. and Sun, N. (2014). Counting in two-spin models on d-regulargraphs. Ann. Probab. 42 2383–2416.

Tropp, J. A. (2015). An Introduction to Matrix Concentration Inequalities.Foundations and Trends in Machine Learning.

Tsybakov, A. B. (2009). Introduction to nonparametric estimation.Springer Series in Statistics, Springer, New York. Revised and extendedfrom the 2004 French original, Translated by Vladimir Zaiats.

Wang, T., Berthet, Q. and Samworth, R. J. (2016). Statistical andcomputational trade-o↵s in estimation of sparse principal components.Ann. Statist. 44 1896–1930.

Weitz, D. (2006). Counting independent sets up to the tree threshold. InProceedings of the 2006 ACM Symposium on the Theory of Computing.ACM, 140–149.

Quentin BerthetDepartment of Pure Mathematicsand Mathematical Statistics,University of CambridgeWilbeforce RoadCambridge, CB3 0WB, UKE-mail: [email protected]

Philippe RigolletDepartment of MathematicsMassachusetts Institute of Technology77 Massachusetts Avenue,Cambridge, MA 02139-4307, USAE-mail: [email protected]

Piyush SrivastavaSchool of Technology and Computer ScienceTata Institute of Fundamental Research1 Dr. Homi Bhabha Road, Navynagar, ColabaMumbai, MH 400 005, IndiaE-mail: [email protected]

mailto:[email protected]



Submitted to the Annals of Statistics

SUPPLEMENTARY MATERIAL TO “EXACT RECOVERYIN THE ISING BLOCKMODEL”

By Quentin Berthet⇤,Philippe Rigollet† and Piyush Srivastava‡

University of CambridgeMassachusetts Institute of TechnologyTata Institute of Fundamental Research

APPENDIX A: FACTS ABOUT THE CURIE-WEISS MODEL

We begin by stating some well known facts about the Curie-Weiss model.These results are standard in the statistical physics literature and the in-terested reader can find more details in Ellis (2006); Friedli and Velenik(2016) for example. However, the precise behavior of the free energy thatwe need for our subsequent analysis does not seem to be readily available inthe literature so we prove below a lemma that suits our purposes.

Recall that the Curie-Weiss model is a special case of the Ising blockmodel when ↵ = � = b. In this case, the free energy takes the form:

(A.1) g

cwb (µ) = �2bµ2 � 4h

�µ+ 1

2

�

where we recall that µ = �

>1/p is the global magnetization of �. Theminima x 2 (�1, 1) of g are called ground states and satisfy the first orderoptimality condition, also known as mean field equation

log�1 + x

1� x

�= 2bx .

If b 1, then the unique solution to the mean field equation is x = 0.Moreover, gcwb is increasing on [0, 1].

If b > 1, then the mean field equation has two solutions x > 0 and �x in(�1, 1). In any case, these solutions are global minima that are also the only

⇤Supported by an Isaac Newton Trust Early Career Support Scheme and by The AlanTuring Institute under the EPSRC grant EP/N510129/1.

†Supported by Grants NSF CAREER DMS-1541099, NSF DMS-1541100, NSF DMS-1712596, DARPA W911NF-16-1-0551, ONR N00014-17-1-2147 and a grant from the MITNEC Corporation.

‡Part of this work was performed while PS was at the Center for the Mathematics ofInformation at the California Institute of Technology supported in part by United StatesNSF grant CCF-1319745.

1

http://www.imstat.org/aos/


local minima of gcwb . In particular, when b > 1, gcwb is monotone decreasingin the interval (0, x) and monotone increasing in the interval (x, 1).

The following lemma is a refinement of these well-known facts that quan-tifies the curvature of gcwb around its minima.

Lemma A.1. Fix b > 1 in the Curie-Weiss model and denote by x > 0and �x the two ground states. Then it holds:

1� 2b

2b2 + b� 1< x

2

< 1� e

�2b.

Moreover, for any x 2 (0, 1), it holds

(A.2) g

cwb (x) � g

cwb (x) +

b� 1

2b(|x� x| ^ ")2,

and

(A.3) g

cwb (�x) � g

cwb (�x) +

b� 1

2b(|x�x| ^ ")2,

where " = e�4b

4

�1� 1

b

�.

Fix b 1 in the Curie-Weiss model and recall that x = 0 is the uniqueground state. Then for any x 2 (�1, 1) it holds

(A.4) g

cwb (x) � g

cwb (0) + (1� b)(x ^ "0)2 .

where

"

0 =

r1� b

3.

Proof. When b > 1, x > 0, and hence we have

(A.5) 2bx = log�1 + x

1� x

�<

2x

1� x

2

� �x

3

, 8� 1 .

Taking � = 0 implies that x >

p1� 1/b. Plugging this into (A.5) with

� = 1 yields

2bx <

2x

1� x

2

� x

�1� 1

b

�.

Solving for x once again yields

(A.6)2

1� x

2

> 2b+ 1� 1

b


Or equivalently that

x

2

> 1� 2b

2b2 + b� 1.

Moreover, the mean field equation yields

2b > 2bx = log�1 + x

1� x

�> � log(1� x)

so that

(A.7) x < 1� e

�2b

which readily yields the desired upper bound on x

2.We conclude this proof by showing that g

cwb is at least quadratic in a

neighborhood of its minima when b 6= 1. To that end, observe first that thesecond and third derivatives of g are given respectively by

@

2

@x

2

g

cwb (x) = �4b+

4

1� x

2

,

@

3

@x

3

g

cwb (x) =

8x

(1� x

2)2.

Note that the third derivative is non-negative for x � 0.We first consider the case b > 1. A Taylor expansion of gcwb around x

together with (A.6) and (A.7), and the fact that the third derivative ofg

cwb (x) non-negative for positive x yields that for any x 2 [x, 1),

g

cwb (x) � g

cwb (x) +

�1� 1

b

�(x� x)2.

We now consider x x. For " 2 (0, 1) and x such that

0 � x� x � " :=e

�4b

4·�1� 1

b

�,

we have, again using a Taylor expansion of gcwb around x together with (A.6)and (A.7):

g

cwb (x) � g

cwb (x) +

�1� 1

b

�(x� x)2 � 4

3(1� (x)2)2|x� x|3

� g

cwb (x) +

�1� 1

b

�(x� x)2 � 4"

3e�4b(x� x)2

� g

cwb (x) +

1

2

�1� 1

b

�(x� x)2 .

Using the fact that gcwb is monotone decreasing on (0, x� "), we obtain theclaim in (A.2). The lower bound (A.3) follows by symmetry.


Next, we consider the case b < 1. A Taylor expansion of gcwb around 0yields that for any x such that |x| < "

0, "

0 2 (0, 1),

g

cwb (x) > g

cwb (0) +

⇥2(1� b)� 4"2

3(1� "

2)2⇤x

2

� g

cwb (0) + (1� b)x2

for

"

0 r

1� b

3.

Using the fact that g

cwb is monotone decreasing on [1,�") and monotone

increasing on (", 1] yields (A.4).

Remark A.2. When b = 1, the Hessian of gcwb vanishes at 0. In thiscase, gcwb is not lower bounded by a quadratic term.

APPENDIX B: PROOFS OF TECHNICAL RESULTS

Proof of Theorem 4.3. Recall that M2 defined in (4.4) denotes theset of possible values for pairs of local magnetization and observe that

IE↵,� ['(µ)] =1

Z↵,�

X

µ2M2

'(µ)zm(µ) ,

where(B.1)

zm(µ) := exp⇣�m

4

��2↵µSµ ¯S � �(µ2

S + µ

2

¯S)�⌘✓

m

µS+1

2

m

◆✓m

µS+1

2

m

◆

We split the local magnetizations µ according to their `1 distance to theclosest ground state. Let G ⇢ [0, 1]2 denote the set of ground states and fix�

··= (⇢/)p

(logm)/m, where ⇢ > 0 is a constant to be chosen later and is defined in Lemma 4.2. For any ground state s 2 G, define Vs to be theneighborhood of s defined by

Vs =�µ 2 M2 : kµ� sk1 �

,

where � > 0 is also defined in Lemma 4.2. Moreover, define

V =[

s2GVs ,

and assume that m is large enough so that (i) the above union is a disjointone and (ii), there exists a constant C > 0 depending on ↵ and � such that


for any (x, y) 2 V, we have ||x|� 1| ^ ||y|� 1| � C > 0. Denote by g

⇤↵,� the

value of the free energy at any of the ground states.We first treat the magnetizations outside V. Using Lemma 4.2 together

with Lemma C.1, we get

0 exp�m

4g

⇤↵,�

�X

µ/2V

'(µ)zm(µ) exp�m

4g

⇤↵,�

�X

µ/2V

exp⇣�m

4g↵,�(µ)

⌘

m

2 exp�� m

4

2

�

2

2

� m

2� ⇢2

8 = om(m��) ,(B.2)

for ⇢ > 4p1 + �.

Next assume that µ 2 V. Our starting point is the following approxima-tion, that follows from Lemma C.2: for any µ 2 V,

(B.3) zm(µ) =1

⇡m

exp��m

4

g↵,�(µS , µ ¯S)�

q(1� µ

2

S)(1� µ

2

¯S)

(1 + om(1)) ,

Define V 0 = {u | s+ u 2 Vs}. A Taylor expansion around s gives for nayu 2 V 0,

g↵,�(s+ u) = g↵,�(s) +1

2u

>Hu+O(�3).

where H = H↵,� denotes the Hessian of g↵,� at the ground state s. Notethat since � + |↵| 6= 2, Lemma 4.2 implies that H is positive definite. Theabove two displays now yield (here, for a point x, we use bxcm to representthe point in V 0 obtained by rounding down the coordinates of x)

exp�m

4g

⇤↵,�

� X

µ2Vs

'(µ)zm(µ)

= exp�m

4g

⇤↵,�

� X

u2V 0

'(s+ u)zm(s+ u)

'm1

⇡m(1� x

2)

X

u2V 0

'(s+ u) exp�� m

8u

>Hu

�

'mm

4⇡(1� x

2)

Z

�B1

'(s+ bxcm) exp�� m

8x

>Hx

�dx

=1

⇡(1� x

2)

1pdetH

Z

kH� 12 zk1 �

pm

2

'

�s+

�2pm

H

�1/2z

⌫

m

�exp

�� kzk2

2

2

�dz

=1

1� x

2

2pdetH

IE

'

✓s+

�2pm

H

�1/2Z

⌫

m

◆· 1IkH� 1

2Zk1 �

pm

2

��.

(B.4)


where Z ⇠ N2

(0, I2

). Here, the third equality replaces the sum by a Riemannintegral, and incurs a factor of 1+om(1) because of the replacement of bxcmby bxc in the argument of the exponential. From eq. (B.4), we now have

(B.5) exp�m

4g

⇤↵,�

� X

µ2Vs

'(µ)zm(µ)

'm1

1� x

2

2pdetH

✓IE

'

✓s+

2pm

H

�1/2Z

◆�� "

1

� "

2

.

◆

where

"

1

··= IE

'

✓s+

2pm

H

�1/2Z

◆· 1IkH� 1

2Zk1 >

�

pm

2

��

and

"

2

··= E✓'

✓s+

2pm

H

�1/2Z

◆� '

✓s+

�2pm

H

�1/2Z

⌫

m

◆◆

·1IkH� 1

2Zk1 �

pm

2

��.

We first estimate "1

using the assumed polynomial bound on '. For somepositive constant D0 = D

0(D, s), we have

0 "

1

IE

"'

✓s+

2pm

H

�1/2Z

◆· 1I"kZk

2

>

⇢

plogm

2p�

max

(H)

##

IE

" D

0 +D

0✓

2pm

kH�1/2Zk

2

◆2d!

· 1I"kZk

2

>

⇢

plogm

2p�

max

(H)

##

IE

" D

0 +D

0✓

4

m · �min

(H)

◆d

kZk2d2

!· 1I"kZk

2

>

⇢

plogm

2p�

max

(H)

##.

(B.6)

It then follows from Proposition C.3 that |"1

| = o (m��) when ⇢ is chosento be a large enough constant (in terms of �, H, , d, and D

0).For estimating ✏

2

, we proceed based on which of the two regularity con-ditions in the statement of the theorem is satisfied by '.

Case 1: When ' satisfies the first regularity condition, we use kx�bxcm k1


2/m to get

|✏2

| 2C2

mC

1

IE

��'✓s+

2pm

H

�1/2Z

◆�� · 1IkH� 1

2Zk1 �

pm

2

��

2C2

mC

1

IE

��'✓s+

2pm

H

�1/2Z

◆��

�

=2C

2

mC

1

IE

'

✓s+

2pm

H

�1/2Z

◆�,

where the last inequality uses the fact that ' is non-negative. Com-bining this and the above estimate on "

2

with eq. (B.5) gives

(B.7) exp�m

4g

⇤↵,�

� X

µ2Vs

'(µ)zm(µ)

'm1

1� x

2

2pdetH

IE

'

✓s+

2pm

H

�1/2Z

◆�

in this case.Case 2: When ' satisfies the second regularity condition, we again use

kx� bxcm k1 2/m to get

|✏2

| 4C1

�

m

IE

1I

kH� 1

2Zk1 �

pm

2

��

4C1

⇢

plogm

m

3/2= o

�m

��,

where the first inequality uses the fact that we are restricting theexpectation to Z for which k 2p

mH

�1/2Zk1 �, and the last equality

uses � < 3/2. The estimate in eq. (B.7) thus follows in this case aswell.

Since the same calculations as above hold for all ground states in G, andbecause the sets Vs, s 2 G are disjoint, we get that

exp�m

4g

⇤↵,�

�X

µ2V'(µ)zm(µ) 'm

1

1� x

2

2pdetH

X

s2GIE⇥'

�s+

2pm

H

�1/2Z

�⇤.

Together with (B.2), the above display yields

X

µ2M2

'(µ)zm(µ) 'm2e�

m4 g⇤↵,�

(1� x

2)pdetH

X

s2GIE⇥'(s+

2pm

H

�1/2Z)⇤,


In particular, this expression yields for ' ⌘ 1 (which trivially satisfies thefirst regularity condition),

Z↵,� 'm2|G|e�

m4 g⇤↵,�

(1� x

2)pdetH

.

The above two displays yield the desired result.

Proof of Proposition 4.4. It follows from Lemma 2.1 that

�� ⌦ = 2IE↵,� ['(µ)] +1

m� 1IE↵,� [ (µ)]�

1

m� 1

where '(x, y) ··= (x� y)2/4 and (x, y) ··= (x2+ y

2)/2. Our goal is to applyTheorem 4.3 to estimate the above expectations. To this end, we first notethat both � and are non-negative functions which which map [�1, 1]2

to [0, 1], and further satisfy condition (4.8) of Theorem 4.3 with � = 1,for any point s 2 G ✓ (�1, 1)2. Next, we note that in a suitably chosenneighborhood of any s 2 G, ' satisfies condition 1 in Theorem 4.3 if s doesnot lie on the line y = x, and condition 2 otherwise. Similarly, in a suitablychosen neighborhood of any s 2 G, satisfies condition 1 in Theorem 4.3if s 6= (0, 0) and condition 2 otherwise. Thus, the conditions of Theorem 4.3apply to both ' and at all s 2 G. We can therefore apply the theorem tothe two expectations in the above display to get

��⌦ =(1 + om(1))

2|G|X

s2G(x� y)2+

2

m

v

>H

�1

v� 1

m� 1(1� x

2)+ om

✓1

m

◆.

where v = (1,�1). (The above calculation also uses the fact that the valuesof x2 = y

2 and hence H are the same for all (x, y) 2 G.) Lemma 4.2 impliesthat v is an eigenvector of H and thus of H�1 and

v

>H

�1

v =1

↵� � + 2/(1� x

2).

This completes the first part of the proof and it remains only to check thedi↵erent cases.

• If � + |↵| < 2, then x = y = 0 is the unique ground state, which yieldsthe result by substitution. The positivity follows directly when � > ↵.

• If � + |↵| > 2, and

1. if ↵ = 0, then |G| = 4 and there are two ground states (x,�x) and(�x, x) (with x 6= 0) for which (x� y) does not vanish. The term in1/m is negligible;


2. if ↵ > 0, then for both ground states (x� y)2 = 0 so that

�� ⌦ 'm1

m

� (� � ↵)(1� x

2)2

2� (� � ↵)(1� x

2)

�

The fact that this quantity is positive when ↵ < � follows from (A.5)with � = 0.

3. if ↵ < 0, then there are two ground states (x,�x) and (�x, x) (withx 6= 0) and we can conclude as in the case ↵ = 0 but gain a factor of2 because all the ground states contribute to the constant term.

APPENDIX C: INEQUALITIES

C.1. Bounds on binomial coe�cients. We need the following wellknown information theoretic estimate. Recall that the binary entropy func-tion h : [0, 1] ! IR is defined by h(0) = h(1) = 0 and for any s 2 (0, 1)by

h(s) = �s log(s)� (1� s) log(1� s) .

Lemma C.1. Let m be a positive integer and let � 2 [0, 1] be such that�m is an integer. Then

✓m

�m

◆ exp(mh(�)) .

Proof. Let X ⇠ Bin(n, �) be a binomial random variable. Then

1 � IP(X = �m) =

✓m

�m

◆�

�m(1� �)(1��)m =

✓m

�m

◆exp(�mh(�)) .

The following sharper estimate follows from the Stirling approximationof n! developed in Robbins (1955).

Lemma C.2. Let " > 0, m a positive integer let � 2 [", 1 � "] be suchthat �m is an integer. We then have

exp

✓� 1

12"2m

◆p2⇡m�(1� �) exp(mh(�))

✓m

�m

◆ exp

✓1

12m

◆.

Proof. It follows from Robbins (1955)) that for any positive integer n,

1 exp

✓1

12n+ 1

◆ n!p

2⇡n(n/e)n exp

✓1

12n

◆.


Applying this to ✓m

�m

◆=

m!

(�m)!((1� �)m)!

yields the desired bounds.

C.2. Inequalities for the Gaussian tail. We will need the followingstandard inequalities for the Gaussian tail.

Proposition C.3. Let Z ⇠ N2

(0, I2

). Then, for any � > 0 and positiveinteger d, there exists c

0

> 0 such that for c > c

0

,

IEh1IhkZk

2

�p2c logm

ii= om

�m

��

IEhkZk2d

2

1IhkZk

2

�p2c logm

ii= om

�m

��.

Proof. We integrate in polar coordinates. For the first inequality wehave

IEh1IhkZk

2

�p

2c logmii

=1

2⇡

Z2⇡

0

Z 1

p2c logm

exp��r

2

/2�rdrd✓

=

Z 1

c logmexp (�x) dx = m

�c,

where the second equality uses the substitution x = r

2

/2. Thus, we only needto ensure c

0

> � to ensure the first inequality. For the second inequality, weagain use the same substitution to obtain:

IEhkZk2d

2

1IhkZk

2

�p2c logm

ii=

1

2⇡

Z2⇡

0

Z 1

p2c logm

r

2d exp��r

2

/2�rdrd✓

= 2dZ 1

c logmx

d exp (�x) dx.

We now use the inequalityR1a x

d exp (�x) dx ((d+1)!) ·ad exp (�a), validfor a � 1 and positive integral d, to conclude (for c and m large enough thatc logm � 1)

IEhkZk2d

2

1IhkZk

2

�p

2c logmii

(2c logm)d · ((d+ 1)!) ·m�c = o

�m

��,

when c is chosen to be large enough.


REFERENCES

Ellis, R. S. (2006). Entropy, large deviations, and statistical mechanics.Classics in Mathematics, Springer-Verlag, Berlin. Reprint of the 1985original.

Friedli, S. and Velenik, Y. (2016). Statistical Mechanics of Lattice Sys-tems: a Concrete Mathematical Introduction. Cambridge University Press.

Robbins, H. (1955). A remark on Stirling’s formula. Amer. Math. Monthly62 26–29.

Quentin BerthetDepartment of Pure Mathematicsand Mathematical Statistics,University of CambridgeWilbeforce RoadCambridge, CB3 0WB, UKE-mail: [email protected]

Philippe RigolletDepartment of MathematicsMassachusetts Institute of Technology77 Massachusetts Avenue,Cambridge, MA 02139-4307, USAE-mail: [email protected]

Piyush SrivastavaSchool of Technology and Computer ScienceTata Institute of Fundamental Research1 Dr. Homi Bhabha Road, Navynagar, ColabaMumbai, MH 400 005, IndiaE-mail: [email protected]




by quentin berthet philippe rigollet and piyush...

Documents