ieee transactions on information theory, vol. 59, no. …€¦ · ieee transactions on information...

26
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo Tang, Member, IEEE, Badri Narayan Bhaskar, Parikshit Shah, and Benjamin Recht, Associate Member, IEEE Abstract—This paper investigates the problem of estimating the frequency components of a mixture of complex sinusoids from a random subset of regularly spaced samples. Unlike previous work in compressed sensing, the frequencies are not assumed to lie on a grid, but can assume any values in the normalized fre- quency domain . An atomic norm minimization approach is proposed to exactly recover the unobserved samples and identify the unknown frequencies, which is then reformulated as an exact semidenite program. Even with this continuous dictionary, it is shown that random samples are sufcient to guar- antee exact frequency localization with high probability, provided the frequencies are well separated. Extensive numerical experi- ments are performed to illustrate the effectiveness of the proposed method. Index Terms—Atomic norm, basis mismatch, compressed sensing, continuous dictionary, line spectral estimation, nuclear norm relaxation, Prony’s method, sparsity. I. INTRODUCTION C OMPRESSED sensing has demonstrated that data acqui- sition and compression can often be combined, dramati- cally reducing the time and space needed to acquire many sig- nals of interest [1]–[4]. Despite the tremendous impact of com- pressed sensing on signal processing theory and practice, its de- velopment thus far has focused on signals with sparse repre- sentations in nite discrete dictionaries. However, signals en- countered in applications such as radar, array processing, com- munication, seismology, and remote sensing are usually speci- ed by parameters in a continuous domain [5]–[7]. In order to apply the theory of compressed sensing to such applications, re- searchers typically adopt a discretization procedure to reduce the continuous parameter space to a nite set of grid points [8]–[16]. While this simple strategy yields state-of-the-art per- formance for problems where the true parameters lie on the grid, discretization has several signicant drawbacks—1) In cases where the true parameters do not fall onto the nite grid, the Manuscript received August 16, 2012; revised April 26, 2013; accepted July 10, 2013. Date of publication August 07, 2013; date of current version October 16, 2013. G. Tang was supported by DARPA under Grant N66001-11-1-4090. B. Bhaskar and P. Shah were supported by NSF award CCF-1139953. B. Recht was supported by ONR Award N00014-11-1-0723, NSF Award CCF-1148243, a Sloan Research Fellowship, and NSF Award CCF-1139953. This paper was presented at the Proceedings of the 50th Annual Allerton Conference, Urbana, IL, USA, 2012. G. Tang and B. N. Bhaskar are with the Department of Electrical and Com- puter Engineering, University of Wisconsin-Madison, WI 53715 USA (e-mail: [email protected]; [email protected]). P. Shah and B. Recht are with the Department of Computer Sciences, Uni- versity of Wisconsin-Madison, WI 53715 USA (e-mail: [email protected]. edu; [email protected]). Communicated by M. Elad, Associate Editor for Signal Processing. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIT.2013.2277451 signal cannot often be sparsely represented by the discrete dic- tionary [13], [17], [18]. 2) It is difcult to characterize the per- formance of discretization using standard compressed sensing analyses since the dictionary becomes very coherent as we in- crease the number of grid points. 3) Although ner grids may improve the reconstruction error in theory, very ne grids often lead to numerical instability issues. We sidestep the issues arising from discretization by working directly on the continuous parameter space for estimating the continuous frequencies and amplitudes of a mixture of complex sinusoids from partially observed time samples. In particular, the frequencies are not assumed to lie on a grid, and can in- stead take arbitrary values across the bandwidth of the signal. With a time-frequency exchange, our model is exactly the same as the one in Candès et al.’s foundational work on compressed sensing [1], except that we do not assume the spikes to lie on an equispaced grid. This major difference presents a signicant technical challenge as the resulting dictionary is no longer an orthonormal Fourier basis, but is an innite dictionary with con- tinuously many atoms and arbitrarily high correlation between candidate atoms. We demonstrate that a sparse sum of complex sinusoids can be reconstructed exactly from a small sampling of its time samples provided the frequencies are sufciently far apart from one another. Our computational method and theoretical analysis is based upon the atomic norm induced by samples of complex exponen- tials [19]. Chandrasekaran et al. argue that the atomic norm is the best convex heuristic for underdetermined, structured linear inverse problems, and it generalizes the norm for sparse re- covery and the nuclear norm for low-rank matrix completion. The norm is a convex function, and in the case of complex ex- ponentials, can be computed via semidenite programming. We show how the atomic norm for moment sequences can be de- rived either from the perspective of sparse approximation or rank minimization [20], illuminating new ties between these re- lated areas of study. Much as was the case in other problems where the atomic norm has been studied, we prove that atomic norm minimization achieves nearly optimal recovery bounds for reconstructing sums of sinusoids from incomplete data. To be precise, we consider signals whose spectra consist of spike trains with unknown locations in the normalized interval (we identify 0 and 1). Rather than sampling the signal at all times we sample the signal at a subset of times with each . Our main contribution is summarized by the following theorem. Theorem I.1: Suppose we observe the signal (I.1) 0018-9448 © 2013 IEEE

Upload: others

Post on 25-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465

Compressed Sensing Off the GridGongguo Tang, Member, IEEE, Badri Narayan Bhaskar, Parikshit Shah, and

Benjamin Recht, Associate Member, IEEE

Abstract—This paper investigates the problem of estimating thefrequency components of a mixture of complex sinusoids froma random subset of regularly spaced samples. Unlike previouswork in compressed sensing, the frequencies are not assumed tolie on a grid, but can assume any values in the normalized fre-quency domain . An atomic norm minimization approach isproposed to exactly recover the unobserved samples and identifythe unknown frequencies, which is then reformulated as an exactsemidefinite program. Even with this continuous dictionary, it isshown that random samples are sufficient to guar-antee exact frequency localization with high probability, providedthe frequencies are well separated. Extensive numerical experi-ments are performed to illustrate the effectiveness of the proposedmethod.

Index Terms—Atomic norm, basis mismatch, compressedsensing, continuous dictionary, line spectral estimation, nuclearnorm relaxation, Prony’s method, sparsity.

I. INTRODUCTION

C OMPRESSED sensing has demonstrated that data acqui-sition and compression can often be combined, dramati-

cally reducing the time and space needed to acquire many sig-nals of interest [1]–[4]. Despite the tremendous impact of com-pressed sensing on signal processing theory and practice, its de-velopment thus far has focused on signals with sparse repre-sentations in finite discrete dictionaries. However, signals en-countered in applications such as radar, array processing, com-munication, seismology, and remote sensing are usually speci-fied by parameters in a continuous domain [5]–[7]. In order toapply the theory of compressed sensing to such applications, re-searchers typically adopt a discretization procedure to reducethe continuous parameter space to a finite set of grid points[8]–[16]. While this simple strategy yields state-of-the-art per-formance for problems where the true parameters lie on the grid,discretization has several significant drawbacks—1) In caseswhere the true parameters do not fall onto the finite grid, the

Manuscript received August 16, 2012; revised April 26, 2013; accepted July10, 2013. Date of publication August 07, 2013; date of current version October16, 2013. G. Tang was supported by DARPA under Grant N66001-11-1-4090.B. Bhaskar and P. Shah were supported by NSF award CCF-1139953. B. Rechtwas supported by ONR Award N00014-11-1-0723, NSF Award CCF-1148243,a Sloan Research Fellowship, and NSF Award CCF-1139953. This paper waspresented at the Proceedings of the 50th Annual Allerton Conference, Urbana,IL, USA, 2012.G. Tang and B. N. Bhaskar are with the Department of Electrical and Com-

puter Engineering, University of Wisconsin-Madison, WI 53715 USA (e-mail:[email protected]; [email protected]).P. Shah and B. Recht are with the Department of Computer Sciences, Uni-

versity ofWisconsin-Madison,WI 53715 USA (e-mail: [email protected]; [email protected]).Communicated by M. Elad, Associate Editor for Signal Processing.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIT.2013.2277451

signal cannot often be sparsely represented by the discrete dic-tionary [13], [17], [18]. 2) It is difficult to characterize the per-formance of discretization using standard compressed sensinganalyses since the dictionary becomes very coherent as we in-crease the number of grid points. 3) Although finer grids mayimprove the reconstruction error in theory, very fine grids oftenlead to numerical instability issues.We sidestep the issues arising from discretization by working

directly on the continuous parameter space for estimating thecontinuous frequencies and amplitudes of a mixture of complexsinusoids from partially observed time samples. In particular,the frequencies are not assumed to lie on a grid, and can in-stead take arbitrary values across the bandwidth of the signal.With a time-frequency exchange, our model is exactly the sameas the one in Candès et al.’s foundational work on compressedsensing [1], except that we do not assume the spikes to lie onan equispaced grid. This major difference presents a significanttechnical challenge as the resulting dictionary is no longer anorthonormal Fourier basis, but is an infinite dictionary with con-tinuously many atoms and arbitrarily high correlation betweencandidate atoms. We demonstrate that a sparse sum of complexsinusoids can be reconstructed exactly from a small samplingof its time samples provided the frequencies are sufficiently farapart from one another.Our computational method and theoretical analysis is based

upon the atomic norm induced by samples of complex exponen-tials [19]. Chandrasekaran et al. argue that the atomic norm isthe best convex heuristic for underdetermined, structured linearinverse problems, and it generalizes the norm for sparse re-covery and the nuclear norm for low-rank matrix completion.The norm is a convex function, and in the case of complex ex-ponentials, can be computed via semidefinite programming. Weshow how the atomic norm for moment sequences can be de-rived either from the perspective of sparse approximation orrank minimization [20], illuminating new ties between these re-lated areas of study. Much as was the case in other problemswhere the atomic norm has been studied, we prove that atomicnormminimization achieves nearly optimal recovery bounds forreconstructing sums of sinusoids from incomplete data.To be precise, we consider signals whose spectra consist of

spike trains with unknown locations in the normalized interval(we identify 0 and 1). Rather than sampling the signal at all

times we sample the signal at a subset of timeswith each . Our main contribution

is summarized by the following theorem.Theorem I.1: Suppose we observe the signal

(I.1)

0018-9448 © 2013 IEEE

Page 2: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7466 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

with unknown frequencies on an index setof size selected uniformly at random.

Additionally, assume are drawn i.i.d. fromthe uniform distribution on the complex unit circle and

where the distance is understood as the wrap-arounddistance on the unit circle. If , then there existsa numerical constant such that

is sufficient to guarantee that we can recover and localize thefrequencies via a semidefinite program with probability at least

with respect to the random samples and signs.The frequencies may be identified directly using the dual so-

lution of the atomic norm minimization problem we propose inthis paper. Alternatively, once the missing entries are recoveredexactly, the frequencies can be identified by Prony’s method[21], a matrix pencil approach [22], or other linear predictionmethods [23]. After identifying the frequencies, the coefficients

can be obtained by solving a linear system.Remark I.2 (Resolution): An interesting artifact of using

convex optimization methods is the necessity of a partic-ular resolution condition on the spectrum of the underlyingsignal. For the signal to be recoverable via our methodsusing random time samples from the set

, the spikes in the spectrum need to beseparated by roughly . In contrast, if one chose to acquire

consecutive samples from this set (equis-paced sampling), the required minimum separation wouldbe ; this sampling regime was studied by Candèsand Fernandez-Granda [24]. Therefore, in some sense, theresolution is determined by the region over which we take thesamples, either in full or in a uniform random manner. Wecomment that numerical simulations of Section V suggest thatthe critical separation is actually . We leave tightening ourbounds by the extra constant of 4 to future work.Remark I.3 (Random Signs): The randomness of the signs

of the coefficients essentially assumes that the sinusoids haverandom phases. Such a model is practical in many spectrumsensing applications as argued in [5, Ch. 4.1]. Our proof willreveal that the phases can obey any symmetric distribution onthe unit circle, not simply the uniform distribution.Remark I.4 (Band-Limited Signal Models): Note that any

mixture of sinusoids with frequencies bandlimited to ,after appropriate normalization, can be assumed to have fre-quencies in . Consequently, a bandlimited signal of such aform leads to samples of the form (I.1). More precisely, supposethe frequencies lie in , and is a continuoussignal of the form

By taking regularly spaced Nyquist samples at, we observe

which is exactly the same as our model (I.1) after a trivial trans-lation of the frequency domain. We emphasize that one does notneed to actually acquire all of these Nyquist samples and thendiscard some of them to obtain the set . In practice, one couldpredetermine the set and sample only at locations specifiedby . This procedure allows one to achieve compression at thesensing stage, the central innovation behind the theory of com-pressed sensing.Remark I.5 (Basis Mismatch): Finally, we note that our result

completely obviates the basis mismatch conundrum [13], [17],[18] of discretizationmethods, where the frequenciesmight wellfall off the grid. Since our continuous dictionary is globallycoherent, Theorem I.1 shows that the global coherence of theframe is not an obstacle to recovery. What matters more is thelocal coherence between the atoms composing the true signal,as characterized by the separation between the frequencies.This paper is organized as follows. First, we specify our re-

construction algorithm as the solution to an atomic norm min-imization problem in Section II. We show that this convex op-timization problem can be exactly reformulated as a semidefi-nite program and that our methodology is thus computationallytractable. We outline connections to prior art and the founda-tions that we build upon in Section III. We then proceed to de-velop the proofs in Section IV. Our proof requires the construc-tion of an explicit certificate that satisfies certain interpolationconditions. The production of this certificate requires us to con-sider certain random polynomial kernels, and derive concentra-tion inequalities for these kernels that may be of independentinterest to the reader. In Section V, we validate our theory byextensive numerical experiments, confirming that random un-dersampling as a means of compression coupled with atomicnorm minimization as a means of recovery are a viable, supe-rior alternative to discretization techniques.

II. ATOMIC NORM AND SEMIDEFINITE CHARACTERIZATIONS

Our signal model is a positive combination of complex sinu-soids with arbitrary phases. As motivated in [19], a natural reg-ularizer that encourages a sparse combination of such sinusoidsis the atomic norm induced by these signals. Precisely, defineatoms , and as

and rewrite the signal model (I.1) in vector form

(II.1)

Page 3: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7467

where is an index set with values being eitheror for some positive integer and , andis the phase of the complex number . In the rest of the paper,we use to denote the unknown set offrequencies. In the representation (II.1), we could also choose toabsorb the phase into the coefficient as we did in (I.1).We will use both representations in the following and explicitlyspecify that the coefficient is positive when the phase termis in the atom .The set of atoms are

building blocks of the signal , the same way that canonicalbasis vectors are building blocks for sparse signals, and unit-norm rank one matrices are building blocks for low-rank ma-trices. In sparse recovery and matrix completion, the unit ballsof the sparsity-enforcing norms, e.g., the norm and the nu-clear norm, are exactly the convex hulls of their correspondingbuilding blocks. In a similar spirit, we define an atomic norm

by identifying its unit ball with the convex hull of

Roughly speaking, the atomic norm can enforce sparsityin because low-dimensional faces of correspond tosignals involving only a few atoms. The idea of using atomicnorms to enforce sparsity for a general set of atoms was firstproposed and analyzed in [19].When the phases are all 0, the set

is called themoment curvewhich traces out a 1-D varietyin . It is well known that the convex hull of this curve ischaracterizable in terms of linear matrix inequalities, and mem-bership in the convex hull can thus be computed in polynomialtime (see [25] for a proof of this result and a discussion of manyother algebraic varieties whose convex hulls are characterizedby semidefinite programming). When the phases are allowed torange in , a similar semidefinite characterization holds.Proposition II.1: For any with

or ,

(II.2)

In the proposition, we used the superscript to denote conju-gate transpose and to denote the Toeplitz matrix whosefirst column is equal to . The proof of this proposition relies onthe following classical Vandermonde decomposition lemma forpositive semidefinite Toeplitz matricesLemma II.2 (Caratheodory-Toeplitz, [26]–[28]): Any pos-

itive semidefinite Toeplitz matrix can be represented asfollows:

where

are real positive numbers, and .

The Vandermonde decomposition can be computed effi-ciently via root finding or by solving a generalized eigenvalueproblem [22].

Proof of Proposition II.1: We prove the case. The other case can be proved in a similar

manner. Denote the value of the right-hand side of (II.2) by. Suppose with . Defining

and , we note that

Therefore, the matrix

(II.3)

is positive semidefinite. Now,so that . Since this holds for any decomposi-tion of , we conclude that .Conversely, suppose for some and ,

(II.4)

In particular, . Form a Vandermondedecomposition

as promised by Lemma II.2. Sinceand , we

have .Using this Vandermonde decomposition and the matrix in-

equality (II.4), it follows that is in the range of , and hence

for some complex coefficient vector . Finally, by the SchurComplement Lemma, we have

Let be any vector such that . Such a vectorexists because is full rank. Then,

implying that . By the arithmetic geo-metric mean inequality,

Page 4: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7468 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

implying that since the previous chain of in-equalities hold for any choice of that are feasible.There are several other approaches to prove the semidefi-

nite programming characterization of the atomic norm. As wewill see below, the dual norm of the atomic norm is related tothe maximum modulus of trigonometric polynomials [see equa-tion (II.7)]. Thus, proofs based on Bochner’s Theorem [29], thebounded real lemma [30], [31], or spectral factorization [32]would also provide a tight characterization. It is interesting thatthe SDP for the continuous case is not very different from anSDP derived for basis pursuit denoising on a grid in [15], [16],and [33]. The work [33] also provides a heuristic to deal withoff-grid frequencies, though there is no theory to certify exactrecovery.

A. Atomic Norm Minimization for Continuous CompressedSensing

Recall that we observe only a subset of entries . Asprescribed in [19], a natural algorithm for estimating themissingsamples of a sparse sum of complex exponentials is the atomicnorm minimization problem

(II.5)

or, equivalently, the semidefinite program

(II.6)

Themain result of this paper is that this semidefinite program al-most always recovers the missing samples and identifies the fre-quencies provided the number of measurements is large enoughand the frequencies are reasonably well separated.We formalizethis statement for the case in the fol-lowing theorem.Theorem II.3: Suppose we observe the time samples of

on the index set of size selecteduniformly at random. Additionally, assume are drawni.i.d. from a symmetric distribution on the complex unit circle.If , then there exists a numerical constant such that

is sufficient to guarantee that with probability at least ,is the unique optimizer to (II.5).We prove this theorem in Section IV. Note that Theorem I.1

is a corollary of Theorem II.3 via a simple reformulation. Weprovide a proof of the equivalence in Appendix A.

B. Duality and Frequency Localization

To every norm, there is an associated dual norm, and the dualof the atomic norm for complex sinusoids has useful structurefor both analysis and implementations. In this section, we an-alyze the structure of the dual problem and show that the dualoptimal solution can be used as a method to identify the frequen-cies that comprise the optimal .Define the inner product as , and the real inner

product as . Then, the dual norm ofis defined as

(II.7)

that is, the dual atomic norm is equal to the maximum modulusof the polynomial on the unit circle. Thedual problem of (II.5) is thus

(II.8)

which follows from a standard Lagrangian analysis [19]. Notethat the dual atomic norm problem is an optimization over poly-nomials with bounded modulus on the unit disk.Let be primal-dual feasible to (II.5) and (II.8). We have

that

Since the primal is only equality constrained, Slater’s condi-tion naturally holds, implying strong duality [34, Sec. 5.2.3].By weak duality, we always have

for any primal feasible and any dual feasible. Strong dualityimplies equality holds if and only if is dual optimal and isprimal optimal. A straightforward consequence of strong dualityis a certificate of the support of the solution to (II.5).Proposition II.4: Suppose the atomic set is composed of

atoms defined by with being eitheror . Then, is the unique

optimizer to (II.5) if there exists a dual polynomial

(II.9)

satisfying

(II.10)

(II.11)

(II.12)

Page 5: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7469

The polynomial works as a dual certificate to certifythat is the primal optimizer. The conditions on are im-posed on the values of the dual polynomial [conditions (II.10)and (II.11)] and on the coefficient vector [condition (II.12)].To prove Theorem II.3, we will construct a dual certificate sat-isfying the conditions of Proposition II.4 in Section IV.

Proof of Proposition II.4: Any vector that satisfies theconditions of Proposition II.4 is dual feasible. We also have that

where the last inequality is due to the definition ofatomic norm. On the other hand, Hölder’s inequalitystates , implying

. Since is primal-dual feasible,we conclude that is a primal optimal solution and is a dualoptimal solution because of strong duality.For uniqueness, suppose with

is another optimal solution. We then have for the dualcertificate :

due to condition (II.11) if is not solely supported on , con-tradicting strong duality. So all optimal solutions are supportedon . Since for both and ,the set of atoms with frequencies in are linearly independent,the optimal solution is unique.A consequence of this proposition is that the dual solution

provides a way to determine the composing frequencies of .One could evaluate the dual trigonometric polynomial

and localize the frequencies by identifying the loca-tions where the polynomial achieves modulus 1. The eval-uation of can be performed efficiently using Fast FourierTransform. Once the frequencies are estimated, the coefficientscan be obtained by solving a linear system of equations. We il-lustrate frequency localization from dual polynomial in Fig. 1.We would like to caution the reader that the dual optimal so-

lutions are not unique in general. However, we can show that

Fig. 1. Frequency localization from dual polynomial. Original frequencies inthe signal (dashed blue) with their heights representing the coefficient magni-tudes and the dual polynomial modulus (solid red) obtained from a dual op-timum. The recovered frequencies are obtained by identifying points where thedual polynomial has modulus one.

any dual optimal solution must contain the frequencies inwhenever the optimal primal solution is . To see this, let

be the set of recovered frequencies, andassume , then we have

where the strict inequality follows because for. This strict inequality contradicts strong duality.

Therefore, we have .In general, the set might contain spurious frequencies.

However, if we leverage the fact that the atomic norm is rep-resentable in terms of linear matrix inequalities, we can showthat most semidefinite programming solvers will find gooddual certificates. To make this precise, let us study the dualsemidefinite program of the atomic norm problem (II.6). Thisdual problem has also a semidefinite programming formulation

(II.13)

(II.14)

,.

(II.15)

(II.16)

Here, the three linear matrix inequalities (II.14), (II.15), and(II.16) are equivalent to the dual norm constraint .

Page 6: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7470 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

We emphasize that most solvers can directly return a dual op-timal solution for free when solving the primal problem. So it isnot necessary to solve the dual semidefinite program to obtaina dual optimum.The following proposition addresses when . The proof

is given in Appendix B.Proposition II.5 (Exact Frequency Localization): For signal

, denote by the set ofoptimal solutions of the dual semidefinite program. If there ex-ists one such that the dual polynomial

satisfies

(II.17)

(II.18)

then the following statements hold.1) all in the relative interior of form strictly comple-mentary pairs with the (unique) primal optimal solution,i.e.,

(II.19)

2) all in the relative interior of satisfy (II.17) and(II.18);

3) the dual central path converges to a point in the relativeinterior of .

Statement (2) of the proposition implies that we could useany optimal solution in the relative interior of the dual optimalset to localize frequencies, while statement (3) says any primal-dual path following algorithm for solving semidefinite programs(e.g., SDPT3) will produce such an interior point in the dualoptimal set.Under the conditions of Proposition II.5, the relative interior

of excludes dual optimal solutions that contain spurious fre-quencies. A particular pathological case is when all coefficientsin (I.1) are positive and (assume

and every index is observed), which is apparently a dual optimalsolution. The dual polynomial corresponding to is constant1 and contains every frequency in . It is easy to show thatthe only such that satisfies (II.14), (II.15), and (II.16)is . Hence, is not maximal complementaryunless contains at least frequencies, in which case theonly degree trigonometric polynomial satisfying (II.17) isconstant 1.However, this pathological case will not happen if we have a

few well-separated frequencies and the phases are random.1 In-deed, as we will see in Section IV, the way we prove TheoremsII.3 and I.1 is to explicitly construct a dual polynomial satis-fying (II.17) and (II.18). Therefore, combining these theoremsand Proposition II.5, we obtain the following corollary.Corollary II.6: Under the conditions of Theorem I.1 (or The-

orem II.3), we can identify the frequencies in from a dualoptimal solution in the relative interior of the set of all dual op-timal solutions with high probability.

1We believe the randomness requirement of phases are merely technical.

C. Power of Rank Minimization

The semidefinite programming characterization of the atomicnorm also allows us to draw connections to the study of rankminimization [20], [35]–[37]. A direct way to exploit sparsityin the frequency domain is via minimization of the following“ -norm” type quantity

This penalty function chooses the sparsest representation ofa vector in terms of complex exponentials. This combinato-rial quantity is closely related to the rank of positive definiteToeplitz matrices as delineated by the following Proposition.Proposition II.7: The quantity is equal to the optimal

value of the following rank minimization problem:

(II.20)

Proof: The case for is trivial. For ,denote by the optimal value of (II.20). We first show

. Suppose . Assume thedecomposition withachieves , and set so that

.Then, as we saw in (II.3),

This implies that .We next show . The case is trivial as we

could always expand on a Fourier basis, implying. We focus on . Suppose is an optimal solution of(II.20). Then, if

is a Vandermonde decomposition, positive semidefiniteness im-plies that is in the range of which means that can be ex-pressed as a combination of at most atoms, completing theproof.Hence, for this particular set of atoms, atomic norm mini-

mization is a trace relaxation of a rank minimization problem.The trace relaxation has been proven to be a powerful relax-ation for recovering low-rank matrices subject to random linearequations [20], values at a specified set of entries [35], Eu-clidean distance constraints [38], and partial quantum expecta-tion values [39]. However, our sampling model is far more con-strained and none of the existing theory applies to our problem.Indeed, typical results on trace-norm minimization demand thatthe number of measurements exceeds the rank of the matrixtimes the number of rows in the matrix. In our case, this wouldamount to measurements for an sparse signal.We prove

Page 7: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7471

Fig. 2. Moments and their convex hulls. (a) The real moment curve for the first three moments. (b) The moment curve for the same frequencies, but adding inphase. (c) The convex hull of (a). (d) The convex hull of (b). Whereas all of the secants of (a) are extreme in their convex hull (c), many segments between atomsof (b) lie inside the convex hull (d).

in the sequel that only samples are required,dramatically reducing the dependence on .We close this section by noting that a positive combination

of complex sinusoids with zero phases observed at the firstsamples can be recovered via the trace relaxation with no lim-itation on the resolution [40], [41]. Why does the story changewhen we bring unknown phases into the picture? A partial an-swer is provided by Fig. 2. Fig. 2(a) and (b) displays the set ofatoms with no phase (i.e., ) and phase either 0 or ,respectively. That is, Fig. 2(a) plots the set

while (b) displays the set

Note that is simply . Their convex hulls are dis-played in Fig. 2(c) and (d), respectively. The convex hull ofis neighborly in the sense that every edge between every pair ofatoms is an exposed face and every atom is an extreme point.On the other hand, the only secants between atoms in thatare faces of the convex hull of are those between atoms withfar apart phase angles and frequencies. Problems only worsen ifwe let the phase range in . Thus, our intuition from posi-tive moment curves does not extend to the compressed sensingproblem of sinusoids with complex phases. Nonetheless, we areable to demonstrate that under mild resolution assumptions, wecan still recover sparse superpositions from very small samplingsets.

III. PRIOR ART AND INSPIRATIONS

Frequency estimation is extensively studied and techniquesfor estimating sinusoidal frequencies from time samples dateback to the work of Prony [21]. Many linear prediction algo-rithms based on Prony’s method were proposed to estimate the

Page 8: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7472 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

frequencies from regularly spaced time samples. A survey ofthese methods can be found in [42] and an extensive list of ref-erences is given in [23]. With equispaced samples, these root-finding-based procedures deal with the problem directly on thecontinuous frequency domain, and can recover frequencies pro-vided the number of samples is at least twice of the number offrequencies, regardless of how closely these frequencies are lo-cated [5], [21], [23], [42].In recent work [24], Candès and Fernandez-Granda studied

this problem from the point of view of convex relaxations andproposed a total-variation norm minimization formulation thatprovably recovers the spectrum exactly. However, the convexrelaxation requires the frequencies to be well separated by theinverse of the number of samples. The proof techniques of thisprior work form the foundation of analysis in the sequel, butmany major modifications are required to extend their results tothe compressed sensing regime.In [31], the authors proposed using atomic norm to denoise

a line spectral signal corrupted with Gaussian noise, and refor-mulated the resulting atomic norm minimization problem as asemidefinite program using the bounded real lemma [30]. De-noising is important to frequency estimation since the frequen-cies in a line spectral signal corrupted with moderate noise canbe identified by linear prediction algorithms. Since the atomicnorm framework in [31] is essentially the same as the total-vari-ation norm framework of [24], the same semidefinite programcan also be applied to total-variation norm minimization.What is common to all aforementioned approaches, including

linear prediction methods, is the reliance on observing uniformor equispaced time samples. In sharp contrast, we show thatnonuniform sampling is not only a viable option, and that theoriginal spectrum can be recovered exactly in the continuousdomain, but in fact is a means of compressive or compressedsampling. Indeed nonuniform sampling allows us to effectivelysample the signal at a sub-Nyquist rate. We point out that wehave only handled the case of undersampling uniform samplesas opposed to arbitrarily nonuniform samples in this paper.However, this is still of practical importance. For array signalprocessing applications, this corresponds to a reduction inthe number of sensors required for exact recovery, since eachsensor obtains one spatial sample of the field. An extensivejustification of the necessity of using randomly located sensorarrays can be found in [43]. To the best of our knowledge, littleis known about exact line-spectrum recovery with nonuniformsampling using parametric methods, except sporadic workusing -norm minimization to recover the missing samples[44], or based on nonlinear least square data fitting [45]. Non-parametric methods such as Periodogram and Correlogram fornonuniform sampling have gained popularity in recent years[46], [47], but they do not typically provide exact frequencylocalization even when all of the samples are observed.An interesting feature related to using convex optimization-

based methods for estimation such as [24] is a particular resolv-ability condition: the separation between frequencies is requiredto be greater than where is the number of measurements.Linear predictionmethods do not have a resolvability limitation,but it is known that in practice the numerical stability of rootfinding limits how close the frequencies can be. Theorem II.3

can be viewed as an extension of the theory to nonuniform sam-ples. Note that our approach gives an exact semidefinite char-acterization and is hence computationally tractable. We believeour results have potential impact on two related areas: extendingcompressed sensing to continuous dictionaries, and extendingline spectral estimation to nonuniform sampling, thus providingnew insight in sub-Nyquist sampling and superresolution.

IV. PROOF OF THEOREM II.3

The key to show that the optimization (II.5) succeeds is toconstruct a dual variable satisfying the conditions (II.10), (II.11), and (II.12) in Proposition II.4 to certify the optimality of. The rest of the paper’s proofs focus on the symmetric case

.As shown in Proposition II.4, the dual certificate can be inter-

preted as a polynomial with bounded modulus on the unit circle.The polynomial is constrained to have most of its coefficientsequal to zero. In the case that all of the entries are observed, thepolynomial constructed by Candès and Fernandez-Granda [24]suffices to guarantee optimality. Indeed they write the certificatepolynomial via a kernel expansion and show that one can explic-itly find appropriate kernel coefficients that certify optimality.We review this construction in Section IV-A. The requirementsof the certificate polynomial in our case are far more stringentand require a nontrivial modification of the Candès–Fernandez-Granda construction. We use a random kernel that has nonzerocoefficients only in the indices corresponding to observed loca-tions. (The randomness enters because the samples are observedat random.) The expected value of our random kernel is a mul-tiple of the kernel developed in [24].Using a matrix Bernstein inequality, we show that we can

find suitable coefficients to satisfy most of the optimality con-ditions. We then write our solution in terms of the deterministickernel plus a random perturbation. The remainder of the proofis dedicated to showing that this random perturbation is smalleverywhere. First, we show that the perturbation is small on afine grid of the circle in Section IV-E. To do so, we emulatethe proof of Candès and Romberg for reconstruction from in-coherent bases [48]. Finally, in Section IV-F, we complete theproof by estimating the Lipschitz constant of the random poly-nomial, and, in turn, proving that the perturbations are smalleverywhere. Our proof is based on Bernstein’s polynomial in-equality which was used to estimate the noise performance ofatomic norm denoising by Bhaskar et al. [31].

A. Detour: When All Entries are Observed

Before we consider the random observation model, we ex-plain how to construct a dual polynomial when all entries in

are observed, i.e., . The kernel-basedconstruction method, which was first proposed in [24], inspiresour random kernel-based construction in Section IV-C. The re-sults presented in this section are also necessary for our laterproofs.When all entries are observed, the optimization problem (II.5)

has a trivial solution, but we can still apply duality to certify theoptimality of a particular decomposition. Indeed, a dual poly-nomial satisfying the conditions given in Proposition II.4 with

Page 9: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7473

means that , namely, the decompo-sition achieves the atomic norm. To con-struct such a dual polynomial, Candès and Fernandez-Grandasuggested considering a polynomial of the following form[24]:

(IV.1)

Here, is the squared Fejér kernel

(IV.2)

(IV.3)

with

(IV.4)the discrete convolution of two triangular functions. Thesquared Fejér kernel is a good candidate kernel because itattains the value of 1 at its peak, and rapidly decays to zero.Provided a separation condition is satisfied by the originalsignal, a suitable set of coefficients and can always befound.We use to denote the first three derivatives of. We list some useful facts about the kernel :

For the weighting function , we have

(IV.5)

We require that the dual polynomial (IV.1) satisfies

(IV.6)

(IV.7)

for all . The constraint (IV.6) guarantees that satis-fies the interpolation condition (II.10), and the constraint (IV.7)

is used to ensure that achieves its maximum at frequen-cies in . Note that the condition (II.12) is absent in this sec-tion’s setting since the set is empty.We rewrite these linear constraints in the matrix-vector form

where

and is the vector with . We have rescaledthe system of linear equations such that the system matrix issymmetric, positive semidefinite, and very close to identity.Positive definiteness follows because the system matrix isa positive combination of outer products. To get an idea ofwhy the system matrix is near the identity, observe that issymmetric with diagonals one, is antisymmetric, and issymmetric with negative diagonals . We define

(IV.8)

and summarize properties of the system matrix and its sub-matrices in the following proposition, whose proof is given inAppendix C.Proposition IV.1: Suppose . Then, is

invertible and

(IV.9)

(IV.10)

(IV.11)

Here, denotes the matrix operator norm.For notational simplicity, partition the inverse of as

where and are both . Then, solving for and

yields

(IV.12)

Page 10: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7474 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

Then, the th derivative of the dual polynomial (after normal-ization) is

(IV.13)

where we have defined

...

...

(IV.14)with the th derivative of .To certify that the polynomial with the coefficients (IV.12)

are bounded uniformly on the unit circle, Candès and Fer-nandez-Granda divide the domain into regions near toand far from the frequencies of . Define

with . On , was analyzeddirectly, while on is bounded by showing that itssecond order derivative is negative. The following results arederived in the proofs of Lemmas 2.3 and 2.4 in [24].Proposition IV.2: Assume . Then, we have

(IV.15)

and for(IV.16)

(IV.17)

(IV.18)

(IV.19)

(IV.20)

and as a consequence,

(IV.21)

B. Bernoulli Observation Model

The uniform sampling model is difficult to analyze directly.However, the same argument used in [1] shows that the prob-ability of recovery failure under the uniform model is at mosttwice of that under a Bernoulli model. Here by “recoveryfailure,” we refer to that (II.5) would not recover the originalsignal . Therefore, without loss of generality, we focus onthe following Bernoulli observation model in our proof.We observe entries in independently with probability . Let

or 0 indicate whether we observe the th entry. Then,are i.i.d. Bernoulli random variables such that

On average in this model, we will observe entries. For, we use

C. Random Polynomial Kernels

We now turn to designing a dual certificate for the Bernoulliobservation model. As for the case that all entries are observed,the challenge is to construct a dual polynomial satisfying

as well as an additional constraint

(IV.22)

The main difference in our random setting is that the demandson the polynomial are much stricter as manifested by(IV.22), namely, we require that most of the coefficients of thepolynomial are equal to zero. Our approach mimics the con-struction in the deterministic case to write

(IV.23)

but using a random kernel , which has nonzero coeffi-cients only on the random subset and satisfies .We will then prove that concentrates tightly around .Our random kernel is simply the expansion (IV.3), but with

each term multiplied by a Bernoulli random variable corre-sponding to the observation of a component

As in (IV.4), is convolution of two discrete triangularfunctions. The th derivative of is

Both and are random trigono-metric polynomials of degree at most . More importantly,they contain monomial only if , or equivalently,

Page 11: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7475

Fig. 3. Plots of the random kernel. (a) and .(b) and .

. Hence, in (IV.23) is of the form (II.9) and satis-fies . It is easy to calculate the expected valuesof and its th derivatives

(IV.24)

In Fig. 3, we plot and laid overand , respectively. We see that far away from

the peak, the random coefficients induce bounded oscillations tothe kernel. Near 0, however, the random kernel remains sharplypeaked.In order to satisfy the conditions (II.10) and (II.11), we re-

quire that the polynomial in (IV.23) satisfies

(IV.25)

(IV.26)

for all . As for , the constraint (IV.25) guaranteesthat satisfies the interpolation condition (II.10), and theconstraint (IV.26) helps ensure that achieves its max-imum at frequencies in .We now have linear constraints (IV.25), (IV.26) on

unknown variables . The remainder of the proof consistsof three steps.1) Show that the linear system (IV.25), (IV.26) is invertiblewith high probability using matrix Bernstein inequality[49].

2) Show , the random perturbations in-troduced by the random observation process, are small ona set of discrete points with high probability, implying therandom dual polynomial satisfies the constraints in Propo-sition II.4 on the grid. This step is proved using a modifi-cation of the idea in [48].

3) Extend the result to using Bernstein’s polynomial in-equality [50] and eventually show for .

D. Invertibility

In this section, we show the linear system (IV.25) and (IV.26)is invertible. Rewrite the linear system (IV.25) and (IV.26) intothe following matrix-vector form:

(IV.27)where , and . Note thatwe still rescale the derivatives using the deterministic quantity

rather than the random variable .The expectation computation (IV.24) implies that

where

. Define

where

...

...

(IV.28)

Page 12: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7476 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

Then, we have

with defined in (IV.8). As a consequence, we have

with a zero mean randomself-adjoint matrix. We will apply the noncommutative Bern-stein inequality to show that concentrates about its meanwith high probability.Lemma IV.3 (Noncommutative Bernstein Inequality, [49, Th.

1.4]): Let be a finite sequence of independent, randomself-adjoint matrices of dimension . Suppose that

Then, for all ,

For , define the event

(IV.29)

The following lemma, proved in Appendix D, shows thathas a high probability if is large enough.Lemma IV.4: If , then we have

provided

Note that an immediate consequence of Lemma IV.4 is thatis invertible on . Additionally, Lemma IV.4 allows us to

control the norms of the submatrices of . For that purpose,we partition as

with and both and obtain following corollary.Corollary IV.5: On the event with , we have

The proof of this corollary uses elementary matrix analysisand can be found in Appendix E. Since on the event with

the matrix is invertible, we solve

for and from (IV.27)

(IV.30)

In the next section, we will plug (IV.30) back into (IV.23), andanalyze the effect of random perturbations on the polynomial

.

E. Random Perturbations

In this section, we show that the dual polynomial con-centrates around on a discrete set .We introduce a random analog of , defined by (IV.14), as

...

...

(IV.31)

with the th derivative of , and defined in (IV.28).The expectation of is equal to times its deterministic coun-terpart defined by (IV.14)

Then, in a similar fashion to (IV.13), we rewrite

We decompose into three parts:

Page 13: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7477

which induces a decomposition on

(IV.32)

Here, as in

(IV.13) and we have defined

and

The goal of the remainder of this section is to show, inLemmas IV.8 and IV.9, that and are small on aset of grid points with high probability.The proof of Lemma IV.8, which shows is

small on , essentially follows that of Candès andRomberg [48]. We include the proof details here for com-pleteness, but very little changes in the argument. Since

is a weighted sum of inde-pendent random variables following a symmetric distributionon the complex unit circle, for fixed , we applyHoeffding’s inequality to control its value. This in turn requiresan estimate of . In Lemma IV.6, wefirst use concentration of measure (Lemma F.1) to establishthat is small with high probability. InLemma IV.7, we then combine Lemmas IV.6 and IV.4 to show

is small. The extension from a fixedto a finite set relies on the union bound.We start with bounding in the following

lemma. The proof given in Appendix F is based on an inequalityof Talagrand.Lemma IV.6: Fix . Let

and fix a positive number

if

otherwise

Then, we have

for some .

The following lemma combines Lemma IV.6 and Corol-lary IV.5 to show is small with highprobability.Lemma IV.7: Let . Consider a finite set. With the same notation as last lemma, we have

Proof of Lemma IV.7: Conditioned on the events and

we have

where we have used Proposition IV.1 and Corollary IV.5, andplugged in . The claim of the lemma then followsfrom the union bound.Lemma IV.7 together with Hoeffding’s inequality allow us to

control the size of .Lemma IV.8: There exists a numerical constant such that

if

then we have

Next lemma controls . Its proof is similar to the proofof Lemma IV.8.Lemma IV.9: There exists a numerical constant such that

if

then we have

Page 14: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7478 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

Both Lemmas IV.8 and IV.9 are proven in the Appendix.Denote the event as

(IV.33)

Combining the decomposition (IV.32), Lemma IV.8, andLemma IV.9 with suitable redefinition of and immediatelyyields the following proposition.Proposition IV.10: Suppose is a finite set of

points, controls the maximal deviation of the randompolynomial and its derivatives from their expectations as in(IV.33), and is small constant to control probability.Then, there exists constant such that

(IV.34)

is sufficient to guarantee

F. Extension to Continuous Domain

We have proved that and

are not far on a set of grid points.

This section aims extending this statement to everywhere in, and show for eventually. The key is

the following Bernstein’s polynomial inequality.Lemma IV.11 (Bernstein’s Polynomial Inequality, [50]): Letbe any polynomial of degree with complex coefficients.

Then,

Our first proposition verifies that our random dual polynomialis close to the deterministic dual polynomial on all of .Proposition IV.12: Suppose and

(IV.35)

Then, with probability , we have that

(IV.36)

for all and .

Proof: It suffices to prove (IV.36) on and andthen modify the lower bound (IV.34). We first give a very roughestimate of on the set :

where we have used and . To seethe latter, we note

where we have used

and

Viewing as a trigonometric polynomial in

of degree , according to Bernstein’s polyno-mial inequality, we obtain

Page 15: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7479

We select such that for any , there existsa point satisfying . The size of

is less than .With this choice of , on the set we have

Finally, we modify the condition (IV.34) according to ourchoice of :

An immediate consequence of Proposition IV.12 and thebound (IV.15) of Proposition IV.2 is the following estimate on

for .Lemma IV.13: Suppose and

Then, with probability , we have

(IV.37)

Proof: It suffices to choose . The rest followsfrom (IV.36), triangle inequality, and modification of the con-stant in (IV.35).A similar statement holds for

.Lemma IV.14: Suppose and

Then, we have for allProof: Define and

. Since and with the latterimplying

we only need to show on . Take the secondorder derivative of :

So it suffices to show that for

As a consequence of (IV.36) in Proposition IV.12, triangleinequality, and (IV.16)–(IV.20) of Proposition IV.2, we haveon the set for any

implying

when assumes a sufficiently small numerical value. With thischoice of , the condition of becomes

Therefore, on except for . We actuallyproved a stronger result that with probability at least

(IV.38)

Proof of Theorem II.3: Finally, if and

combining Lemmas IV.13 and IV.14, we have proved the claimof Theorem II.3.

V. NUMERICAL EXPERIMENTS

We conducted a series of numerical experiments to testthe performance of (II.5) under various parameter settings

Page 16: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7480 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

Fig. 4. Frequency Estimation: Blue represents the true frequencies, while red represents the estimated ones.

TABLE IPARAMETER CONFIGURATIONS

(see Table I). We use for all numericalexperiments.We compared the performance of two algorithms: the

semidefinite program (II.6) and the basis pursuit obtainedthrough discretization

(V.1)

Here, is a DFTmatrix of appropriate dimension depending onthe grid size. Note that since the components of are complex,this is a second-order cone problem. In the following, we useSDP and BP to label the semidefinite program algorithm and thebasis pursuit algorithm, respectively. We solved the SDP withthe SDPT3 solver [51] and the basis pursuit (V.1) with CVX[52] coupled with SDPT3. All parameters of the SDPT3 solverwere set to default values and CVX precision was set to “high.”For the BP, we used three levels of discretization at 4, 16, and64 times the signal dimension.To generate our instances of form (II.1), we sampled

normalized frequencies from , either randomly, or equi-spaced. Random frequencies are sampled randomly onwith an additional constraint on the minimal separation .Given , equispaced frequencies are generated withthe same separation with an additional random shift. Thisrandom shift ensures that in most cases, basis mismatch oc-curs for discretization method. The signal coefficient magni-tudes are either unit, i.e., equal to 1, or fading, i.e.,equal to with a zero mean unit variance Gaussianrandom variable. The signs follow eitherBernoulli distribution, labeled as real, or uniform distribu-tion on the complex unit circle, labeled as complex. A lengthsignal was then formed according to model (II.1). As a final

step, we uniformly sample entries of the resultingsignal.We tested the algorithms on three sets of experiments. In the

first experiment, by running the algorithms on a randomly gen-

erated instance with , and samples se-lected uniformly at random, we compare SDP and BP’s abilityof frequency localization and visually illustrate the effect of dis-cretization. We see from Fig. 4 that SDP recovery followed byretrieving the frequencies according to Proposition II.5 givesthe most accurate result. We also observe that increasing thelevel of discretization can increase BP’s accuracy in locatingthe frequencies.In the second set of experiments, we compare the perfor-

mance of SDP and BPwith three levels of discretization in termsof solution accuracy and running time. The parameter configu-rations are summarized in Table I. Each configuration was re-peated ten times, resulting a total of 1920 valid experiments ex-cluding those with .We use the performance profile as a convenient way to com-

pare the performance of different algorithms. The performanceprofile proposed in [53] visually presents the performance ofa set of algorithms under a variety of experimental conditions.More specifically, let be the set of experiments andspecify the performance of algorithm on experiment forsome metric (the smaller the better), e.g., running time andsolution accuracy. Then, the performance profile is de-fined as

Roughly speaking, is the fraction of experiments suchthat the performance of algorithm is within a factor of thatof the best performed one.We show the performance profiles for numerical accuracy and

running times in Fig. 5(a) and (b), respectively. We see that SDPsignificantly outperforms BP for all tested discretization levelsin terms of numerical accuracy. When the discretization levelsare higher, e.g., 64×, the running times of BP exceed that ofSDP.To give the reader a better idea of the numerical accuracy and

the running times, in Table II we present their medians and me-dian absolute deviations for the four algorithms. As one wouldexpect, the running time increases as the discretization level in-creases. We also observe that SDP is very accurate, with a me-dian error at the order of . Increasing the level of discretiza-tion can increase the accuracy of BP. However, with discretiza-

Page 17: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7481

Fig. 5. Performance profiles for solution accuracy and running times. Note the-axes are in logarithm scale for both plots. (a) Solution accuracy. (b) Runningtimes.

TABLE IIMEDIANS AND MEDIAN ABSOLUTE DEVIATION (MAD) FOR SOLUTION

ACCURACY AND RUNNING TIME

tion level , we get a median accuracy at the order of, but the median running time already exceeds that of SDP.

In the third set of experiments, we compiled two phase transi-tion plots. To prepare Fig. 6(a), we pick and vary

and . For each fixed ,we randomly generate frequencies while maintaininga frequency separation . The coefficients are generatedwith randommagnitudes and random phases, and the entries areobserved uniform randomly. We then run the SDPT3-SDP al-gorithm to recover the missing entries. The recovery is consid-ered successful if the relative error .This process was repeated ten times and the rate of success wasrecorded. Fig. 6(a) shows the phase transition results. The -axisindicates the fraction of observed entries , while the -axis

Fig. 6. Phase transition: The phase transition plots were prepared with, and . The frequencies were generated randomly

with minimal separation . Both signs and magnitudes of the coefficients arerandom. In Fig. 6(a), the separation and ,while in Fig. 6(b), the separation and .(a) Phase transition: . (b) Phase transition: .

is . The color represents the rate of success with red cor-responding to perfect recovery and blue corresponding to com-plete failure.We also plot the line . Since a signal of frequen-

cies has degrees of freedom, including frequency locationsand magnitudes, this line serves as the boundary above whichany algorithm should have a chance to fail. In particular, Prony’smethod requires consecutive samples in order to recover thefrequencies and the magnitudes.From Fig. 6(a), we see that there is a transition from perfect

recovery to complete failure. However, the transition boundaryis not very sharp. In particular, we notice failures below theboundary of the transition where complete success shouldhappen. Examination of the unsuccessful instances show thatthey correspond to instances with minimal frequency separa-tions marginally exceeding . We expect to get cleaner phasetransitions if the frequency separation is increased.To prepare Fig. 6(b), we repeated the same process in

preparing Fig. 6(a) except that the frequency separation wasincreased from to . In addition, to respect the minimalseparation, we reduced the range of possible sparsity levels to

. We now see a much sharper phase transition.The boundary is actually very close to the line.When is close to 1, we even observe successful recoveryabove the line.

VI. CONCLUSION AND FUTURE WORK

By leveraging the framework of atomic norm minimization,we were able to solve the problem of compressed sensing of line

Page 18: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7482 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

spectra. For signals with well-separated frequencies, we showthe number of samples needed is roughly proportional to thenumber of frequencies, up to polylogarithmic factors. This re-covery is possible even though our continuous dictionary is notincoherent at all and does not satisfy any sort of restricted isom-etry conditions.There are several interesting future directions to be explored

to further expand the scope of this work. First, it would beuseful to understand what happens in the presence of noise. Wecannot expect exact support recovery in this case, as our dic-tionary is continuous and any noise will make the exact fre-quencies unidentifiable. In a similar vein, techniques like thoseused in [24] that still rely on discretization are not applicablefor our current setting. However, since our numerical method israther stable, we are hopeful that a theoretical stability result ispossible.We show a simple example which provides some evidence for

conjecturing that our proposed technique is stable. The signalwas generated with , , random frequencies,

fading amplitudes, and random phases. A total number of 18time samples were chosen uniformly from . The noisyobservation was generated by adding complex noise withbounded norm to . We denoised and recovered thesignal by solving the following optimization:

(VI.1)

which can be solved by a semidefinite program. We observein Fig. 7 that the frequencies were approximately recovered bythe optimization problem (VI.1) in presence of noise. We leavethe verification of stability with more extensive numerical sim-ulations and theoretical analysis of this phenomenon to futurework.Second, we saw in our numerical experiments that modest

discretization introduces substantial error in signal reconstruc-tion and fine discretization carries significant computationalburdens. In this regard, it would be of great interest to speedup our semidefinite programming solvers so that we can scaleour algorithms beyond the synthetic experiments of this paper.Our rudimentary experimentation with first-order methodsdeveloped in [31] did not suffice for this problem as they wereunable to achieve the precision necessary for fine frequencylocalization. So, instead, it would be of interest to exploresecond-order alternatives—such as active set methods—tospeed up our computations.Finally, we are interested in exploring the class of signals

that are semidefinite characterizable in hopes of understandingwhich signals can be exactly recovered. Our continuous fre-quency model is an instance of applying compressed sensingto problems with continuous dictionaries. It would be of greatinterest to see how our techniques may be extended to other con-tinuously parametrized dictionaries. Models involving imagemanifolds may fall into this category [54]. Fully exploring thespace of signals that can be acquired with just a few speciallycoded samples provides a fruitful and exciting program of fu-ture work.

Fig. 7. Noisy frequency recovery: (a) Real part of true, noisy, and re-covered signals, (b)True frequencies (blue) and recovered frequencies(red). The (randomly generated) true signal has three frequencies at

with coefficients. The (real parts of) noisy observations at

out of indices are indicated by black circles. The noisecomponents were generated according to i.i.d. complex Gaussian distributionand then scaled to have an norm of 2. The observed signal has an norm5.9453.

APPENDIX APROOF OF THEOREM I.1

Proof: Assume with andor 4. Suppose the signal has decomposition

...

...

...

The rest of the proof argues that the dual polynomial con-structed for the symmetric case can be modified to certify theoptimality of for the general case.

Page 19: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7483

If the coefficients have uniform randomcomplex signs, for fixed , also haveuniform random complex signs. In addition, the Bernoulli ob-servation model on index set naturallyinduces a Bernoulli observation modelon with . Denote

. Therefore, ifand

(A.1)

according to the proof of Theorem II.3, with probability greatthan , we could construct a dual polynomial

satisfying

Now define

otherwise

and

Clearly, the polynomial satisfies

where . The theorem thenfollows from rewriting (A.1) in terms of and Proposition II.4.

APPENDIX BPROOF OF PROPOSITION II.5

We need the following lemma about maximal rank spectralfactorization.

Lemma B.1: Suppose withis a nonnegative trigonometric polynomial, which

has zeros on the unit circle. Then, there exists a positivesemidefinite matrix with rank such that

where

...

Furthermore, this is the representation with maximal possiblerank.

Proof: According to the positive real lemma [30, Th. 2.5],the nonnegativity of

is equivalent to

(B.1)

In this case, we can write

Such a representation is called a Gram matrix representation.The set of all Gram matrices satisfying (B.1) is a compactconvex set. By the spectral factorization theorem [30, Th. 1.1],all of the rank one elements of are of the form suchthat

with the scalar factor determined by. Here, are

the zeros of , as a function defined on the complexplane, on the unit circle with multiplicity two. The rest

zeros of are divided intoconjugate pairs . Choosingas either or yields different . Therefore, there are

total such rank-one representations of .Since has zeros on , we have for any ,

, implying. Therefore, the null space of has dimension at least

, and hence the rank of any Gram matrix is at most .In the following we construct a maximal rank representation

of . For that purpose, define for via

and

Page 20: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7484 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

We first claim that are linearly indepen-dent. Suppose there exist constants such that

Postmultiplying both sides of the above equation by foreach we obtain

implying for . Since ,must also be zero. Therefore, the set islinearly independent. Now form the matrix

where , and . Then, the linear independenceof implies that is of rank , and

Proof of Proposition II.5: We prove for the case. Note that our semidefinite program (II.6) and

its dual have interior feasible points, a common assumption un-derlying many results about semidefinite programs, includingthe ones cited in the following arguments. For example, to ob-tain an interior feasible point for (II.6), one could take onand zero elsewhere, the Toeplitz matrix equal to ,

and .1) For the dual optimal solution given in the proposition,define

The trigonometric polynomial is positive withzeros at since

ifotherwise.

Now according to Lemma B.1, there exists withsuch that

Define which satisfies. In other words, we have a

pair such that

and . Therefore, is a dualoptimal solution.The unique primal optimal solution has the form (refer toProposition II.4 for a proof of the uniqueness):

where . Since the following decomposi-tion holds:

the rank of is , and hence is a strictly comple-mentary pair (see [55] for more information about strictcomplementarity).According to [56, Lemma 3.1], all matrices in the rel-ative interior of the optimal solution set of a semidefi-nite program share the same range space. Therefore, forany other in the relative interior of , the rank of

is also , implying that also

form strictly complementary pairs with the primal optimalsolution .

2) Since for the in part 1)

the null space of is

(B.2)

Page 21: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7485

Again using [56, Lemma 3.1], we conclude that the null

space of any with in the relative inte-

rior of is given by (B.2). Therefore, we have for each :

(B.3)

or equivalently

using .For any , if , then a

calculation similar to (B.3) concludes that

with is in the null space

of . Due to the linear independence of

with , the null spacehave dimension greater than , contradicting with

. Therefore, we

have

3) Due to the existence of strictly complementary primal-dualoptimal solutions, the statement is a direct consequence of[57, Lemma 3.4] (See also [58], [59]).

APPENDIX CPROOF OF PROPOSITION IV.1

Proof: Under the assumption that , we cite theresults of [24, Proof of Lemma 2.2] as follows:

where is the matrix infinity norm, namely, the maximumabsolute row sum. Since is symmetric and has zero diag-onals, the Geršhgorin circle theorem [60] implies that

As a consequence, is invertible and

APPENDIX DPROOF OF LEMMA IV.4

1) Proof of Lemma IV.4: We start with computing thequantities necessary to apply Lemma IV.3

Here, we have used

We continue with bounding :

To further bound , we note

which leads to

Therefore, we have

Page 22: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7486 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

Invoking the noncommutative Bernstein’s inequality and set-ting , we have

if

Consequently, when accordingto (IV.9), we have, confirming the invertibility of .

APPENDIX EPROOF OF COROLLARY IV.5

Assuming is invertible and , wehave the following two inequalities:

which are rearrangements of

Therefore, we establish that when on the set

:

Since the operator norm of a matrix dominates that of all sub-matrices, this completes the proof.

APPENDIX FPROOF OF LEMMA IV.6

The proof uses Talagrand’s concentration of measureinequality.

Lemma F.1 ([61, Corollary 7.8]): Let be a finite se-quence of independent random variables taking values in a Ba-nach space and let be defined as

for a countable family of real valued functions . Assume thatand for all and every . Then for

all

where

and is a numerical constant.Proof of Lemma IV.6: Based on the definition of

in (IV.31) and in (IV.14), we explicitly writeas

where is defined in (IV.28) and we have defined as

It is clear that are independent random vectors withzero mean.Define

and

To compute the quantities necessary to apply Lemma F.1, wewill extensively use the following elementary bounds:

and

Page 23: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7487

First, we obtain an upper bound on :

The expected value of is upper bounded asfollows:

(F.1)

Observe that .We apply Jensen’sinequality and combine with (F.1) to obtain

Next, we upper bound :

implying

where is a matrix in whose th column is. Note that

Therefore, we have

In conclusion, Lemma F.1 shows that

Suppose . Then, define

and fix . Then, it follows that

for some provided . The same is true ifand if we define

. Therefore, let

and fix obeying

otherwise

Then, we have

for some . Application of the union bound proves thelemma.

APPENDIX GPROOF OF LEMMA IV.8

The proof of Lemma IV.8 is based on Hoeffding’s inequalitypresented below.

Lemma G.1 (Hoeffding’s Inequality): Let the componentsof be sampled i.i.d. from a symmetric distribution on the

Page 24: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7488 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

complex unit circle, , and be a positive real number.Then,

Proof of Lemma IV.8: Consider the random inner productwhere are i.i.d. symmetric

random variables with values on the complex unit circle.Conditioned on a particular realization where

Hoeffding’s inequality and the union bound then imply

Elementary probability calculation shows

Setting

in and applying Lemma IV.7 yield,

For the second term to be less than , we choose such that

and assume this value from now on. The first term is less thanif

(G.1)

First assume that . The condition in Lemma IV.6is or equivalently

(G.2)

In this case, we have , leading to

Now suppose that . If , thenwhich again gives the above lower bound on .

On the other hand if , then and

Therefore, to verify (G.1) it suffices to take obeying (G.2)and

This analysis shows that the first term is less than if

According to Lemma IV.4, the last term is less than if

Setting , combining all lower bounds on together,and absorbing all constants into one, we obtain

is sufficient to guarantee

with probability at least . The union bound then provesthe lemma.

APPENDIX HPROOF OF LEMMA IV.9

1) Proof of Lemma IV.9: Recall that

On the set defined in (IV.29), we established in CorollaryIV.5 that

We use the norm to bound the norm of :

To get a uniform bound on ,

we need the following bound:

Page 25: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

TANG et al.: COMPRESSED SENSING OFF THE GRID 7489

for suitably chosen numerical constants and . The boundover the region is a consequence of the more accuratebound established in [24, Lemma 2.6], while the uniform boundcan be obtained by checking the expression of . Con-

sequently, we have

We conclude that on the set

Again, application of Hoeffding’s inequality and the unionbound gives

To make the first term less than , it suffices to take

To have the second term less than , we require

Another application of the union bound with respect toproves the lemma.

ACKNOWLEDGMENT

The authors would like to thank R. Nowak for many helpfuldiscussions about this work

REFERENCES[1] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles:

Exact signal reconstruction from highly incomplete frequency infor-mation,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb.2006.

[2] E. J. Candès and M. Wakin, “An introduction to compressive sam-pling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar.2008.

[3] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol.52, no. 4, pp. 1289–1306, Apr. 2006.

[4] R. Baraniuk, “Compressive sensing [lecture notes],” IEEE SignalProcess. Mag., vol. 24, no. 4, pp. 118–121, Jul. 2007.

[5] P. Stoica and R. L. Moses, Spectral Analysis of Signals, 1 ed. UpperSaddle River, NJ, USA: Prentice-Hall, 2005.

[6] C. Ekanadham, D. Tranchina, and E. P. Simoncelli, “Neural spike iden-tification with continuous basis pursuit,” presented at the Comput. Syst.Neurosci., Salt Lake City, UT, USA, Feb. 2011.

[7] C. E. Parrish and R. D. Nowak, “Improved approach to lidar airportobstruction surveying using full-waveform data,” J. Surv. Eng., vol.135, no. 2, pp. 72–82, May 2009.

[8] D. Malioutov, M. Cetin, and A. S. Willsky, “A sparse signal recon-struction perspective for source localization with sensor arrays,” IEEETrans. Signal Process., vol. 53, no. 8, pp. 3010–3022, Aug. 2005.

[9] M. A. Herman and T. Strohmer, “High-resolution radar via compressedsensing,” IEEE Trans. Signal Process., vol. 57, no. 6, pp. 2275–2284,Jun. 2009.

[10] A. C. Fannjiang, T. Strohmer, and P. Yan, “Compressed remote sensingof sparse objects,” SIAM J. Imag. Sci., vol. 3, no. 3, pp. 595–618, Jan.2010.

[11] R. Baraniuk and P. Steeghs, “Compressive radar imaging,” in Proc.IEEE Radar Conf., Waltham, MA, USA, Apr. 2007, pp. 128–133.

[12] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressedchannel sensing: A new approach to estimating sparse multipath chan-nels,” Proc. IEEE, vol. 98, no. 6, pp. 1058–1076, Jun. 2010.

[13] M. F. Duarte and R. Baraniuk, “Spectral compressive sensing,” Jul.2011 [Online]. Available: dsp.rice.edu

[14] H. Rauhut, “Random sampling of sparse trigonometric polynomials,”Appl. Comput. Harmon. Anal., vol. 22, no. 1, pp. 16–42, Jan. 2007.

[15] P. Stoica and P. Babu, “Spice and likes: Two hyperparameter-freemethods for sparse-parameter estimation,” Signal Process., vol. 92,no. 7, pp. 1580–1590, 2012.

[16] P. Stoica, P. Babu, and J. Li, “New method of sparse parameter estima-tion in separable models and its use for spectral analysis of irregularlysampled data,” IEEE Trans. Signal Process., vol. 59, no. 1, pp. 35–47,Jan. 2011.

[17] Y. Chi, L. L. Scharf, A. Pezeshki, and A. R. Calderbank, “Sensitivity tobasis mismatch in compressed sensing,” IEEE Trans. Signal Process.,vol. 59, no. 5, pp. 2182–2195, May 2011.

[18] M. A. Herman and T. Strohmer, “General deviants: An analysis of per-turbations in compressed sensing,” IEEE J. Sel. Top. Signal Process.,vol. 4, no. 2, pp. 342–349, Apr. 2010.

[19] V. Chandrasekaran, B. Recht, P. A. Parrilo, A. S. Willsky, “The convexgeometry of linear inverse problems,”, “Found. Comput. Math.,”, vol.12, pp. 805–849, Dec. 2012.

[20] B. Recht,M. Fazel, and P. A. Parrilo, “Guaranteedminimum-rank solu-tions of linear matrix equations via nuclear norm minimization,” SIAMRev., vol. 52, no. 3, pp. 471–501, Jan. 2010.

[21] B. G. R. de Prony, “Essai éxperimental et analytique: Sur les lois de ladilatabilité de fluides élastique et sur celles de la force expansive de lavapeur de l’alkool, à différentes températures,” J. de l’école Polytech-nique, vol. 1, no. 22, pp. 24–76, 1795.

[22] Y. Hua and T. K. Sarkar, “Matrix pencil method for estimating param-eters of exponentially damped/undamped sinusoids in noise,” IEEETrans. Acoust., Speech, Signal Process., vol. 38, no. 5, pp. 814–824,May 1990.

[23] P. Stoica, “List of references on spectral line analysis,” Signal Process.,vol. 31, no. 3, pp. 329–340, Apr. 1993.

[24] E. J. Candès and C. Fernandez-Granda, “Towards a mathematicaltheory of super-resolution,” Mar. 2012, vol. cs.IT [Online]. Available:arXiv.org

[25] R. Sanyal, F. Sottile, and B. Sturmfels, “Orbitopes,”Mathematika, vol.57, pp. 275–314, 2011.

[26] C. Carathéodory, “Über den variabilitätsbereich der Fourier’schen kon-stanten von positiven harmonischen funktionen,” Rendiconti del Cir-colo Matematico di Palermo (1884–1940), vol. 32, no. 1, pp. 193–217,1911.

[27] C. Carathéodory and L. Fejér, “Über den zusammenhang der extremenvon harmonischen funktionen mit ihren koeffizienten und über denPicard-Landau’schen satz,” Rendiconti del Circolo Matematico diPalermo (1884–1940), vol. 32, no. 1, pp. 218–239, 1911.

[28] O. Toeplitz, “Zur theorie der quadratischen und bilinearen formenvon unendlichvielen veränderlichen,” Math. Ann., vol. 70, no. 3, pp.351–376, 1911.

Page 26: IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. …€¦ · IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465 Compressed Sensing Off the Grid Gongguo

7490 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013

[29] A. Megretski, “Positivity of trigonometric polynomials,” in Proc. 42ndIEEE Conf. Decision Control, Maui, HI, USA, Dec. 2003, vol. 4, pp.3814–3817.

[30] B. Dumitrescu, Positive Trigonometric Polynomials and Signal Pro-cessing Applications. New York, NY, USA: Springer-Verlag, Feb.2007.

[31] B. N. Bhaskar, G. Tang, and B. Recht, “Atomic norm denoising withapplications to line spectral estimation,” Apr. 2012, vol. cs.IT [Online].Available: arXiv.org

[32] T. Roh and L. Vandenberghe, “Discrete transforms, semidefinite pro-gramming, and sum-of-squares representations of nonnegative polyno-mials,” SIAM J. Optim., vol. 16, no. 4, pp. 939–964, Jan. 2006.

[33] P. Stoica and P. Babu, “Sparse estimation of spectral lines: Grid selec-tion problems and their solutions,” IEEE Trans. Signal Process., vol.60, no. 2, pp. 962–967, Feb. 2012.

[34] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, Mar. 2004.

[35] E. J. Candès and B. Recht, “Exact matrix completion via convex op-timization,” Found. Comput. Math., vol. 9, no. 6, pp. 717–772, Dec.2009.

[36] D. Gross, “Recovering low-rank matrices from few coefficients in anybasis,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 1548–1566, Mar.2009.

[37] B. Recht, “A simpler approach to matrix completion,” J. Mach.Learning, vol. 12, pp. 3413–3430, Jan. 2011.

[38] A. Javanmard and A. Montanari, “Localization from incomplete noisydistance measurements,” in Proc. IEEE Int. Symp. Inf. Theory, St. Pe-tersburg, Russia, Jul. 2011, pp. 1584–1588.

[39] D. Gross, Y. K. Liu, S. T. Flammia, S. Becker, and J. Eisert, “Quantumstate tomography via compressed sensing,” Phys. Rev. Lett., vol. 105,no. 15, p. 150401, Oct. 2010.

[40] D. L. Donoho and J. Tanner, “Sparse nonnegative solution of underde-termined linear equations by linear programming,” in Proc. Nat. Acad.Sci. USA, 2005, vol. 102, no. 27, pp. 9446–9451.

[41] J.-J. Fuchs, “Sparsity and uniqueness for some specific under-deter-mined linear systems,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess., 2005, vol. 5, pp. v/729–v/732.

[42] T. Blu, P.-L. Dragotti, M. Vetterli, P. Marziliano, and L. Coulot,“Sparse sampling of signal innovations,” IEEE Signal Process. Mag.,vol. 25, no. 2, pp. 31–40, Mar. 2008.

[43] L. Carin, D. Liu, and B. Guo, “Coherence, compressive sensing, andrandom sensor arrays,” IEEE Antennas Propag. Mag., vol. 53, no. 4,pp. 28–39, Aug. 2011.

[44] E. R. Dowski, C. A. Whitmore, and S. K. Avery, “Estimation ofrandomly sampled sinusoids in additive noise,” IEEE Trans. Acoust.,Speech, Signal Process., vol. 36, no. 12, pp. 1906–1908, Dec. 1988.

[45] P. Stoica, J. Li, and H. He, “Spectral analysis of nonuniformly sampleddata: A new approach versus the periodogram,” IEEE Trans. SignalProcess., vol. 57, no. 3, pp. 843–858, Mar. 2009.

[46] Y.Wang, J. Li, and P. Stoica, Spectral Analysis of Signals: TheMissingData Case, 1 ed. San Rafael, CA, USA: Morgan & Claypool Pub-lishers, Jul. 2005.

[47] M. Shaghaghi and S. A. Vorobyov, “Spectral estimation from under-sampled data: Correlogram and model-based least squares,” Feb. 2012,vol. math.ST [Online]. Available: arXiv.org

[48] E. J. Candès and J. Romberg, “Sparsity and incoherence in compressivesampling,” Inverse Problems, vol. 23, no. 3, pp. 969–985, Apr. 2007.

[49] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,”Found. Comput. Math., vol. 12, no. 4, pp. 389–434, Aug. 2011.

[50] A. C. Schaeffer, “Inequalities of A. Markoff and S. Bernstein for poly-nomials and related functions,” Bull. Amer. Math. Soc., vol. 47, pp.565–579, Nov. 1941.

[51] K. C. Toh, M. J. Todd, and R. H. Tütüncü, “SDPT3: A MATLAB soft-ware package for semidefinite-quadratic-linear programming,” [On-line]. Available: http://www.math.nus.edu.sg/~mattohkc/sdpt3.html

[52] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, Version 1.21,” Apr. 2011 [Online]. Available: http://cvxr.com/cvx

[53] E. D. Dolan and J. J. Moré, “Benchmarking optimization software withperformance profiles,”Math. Prog., A, vol. 91, no. 2, pp. 201–203,Mar.2002.

[54] M. B. Wakin, “A manifold lifting algorithm for multi-view compres-sive imaging,” in Proc. Picture Coding Symp., Chicago, IL, USA, May2009, pp. 1–4.

[55] F. Alizadeh, J. Haeberly, and M. Overton, “Complementarity and non-degeneracy in semidefinite programming,” Math. Program., vol. 77,no. 1, pp. 111–128, 1997.

[56] D. Goldfarb and K. Scheinberg, “Interior point trajectories in semidef-inite programming,” SIAM J. Optim., vol. 8, no. 4, pp. 871–886, 1998.

[57] Z. Luo, J. Sturm, and S. Zhang, “Superlinear convergence of a sym-metric primal-dual path following algorithm for semidefinite program-ming,” SIAM J. Optim., vol. 8, no. 1, pp. 59–81, 1998.

[58] E. de Klerk, C. Roos, and T. Terlaky, “Initialization in semidefiniteprogramming via a self-dual skew-symmetric embedding,” Oper. Res.Lett., vol. 20, no. 5, pp. 213–221, 1997.

[59] M. Halická, E. de Klerk, and C. Roos, “On the convergence of thecentral path in semidefinite optimization,” SIAM J. Optim., vol. 12, no.4, pp. 1090–1099, 2002.

[60] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge, U.K.:Cambridge Univ. Press, Feb. 1990.

[61] M. LeDoux, The Concentration of Measure Phenomenon. Provi-dence, RI, USA: American Mathematical Society, 2001.

GongguoTang (S’09–M’11) received the Ph.D. degree in electrical engineeringfrom Washington University in St. Louis in 2011. He is currently a Postdoc-toral Research Associate at the Department of Electrical and Computer Engi-neering, University of Wisconsin-Madison. His research interests are in the areaof signal processing, convex optimization, information theory and statistics, andtheir applications.

Badri Narayan Bhaskar is a student in Electrical Engineering at University ofWisconsin-Madison.

Parikshit Shah (S’06–M’09) is currently aMember of Research Staff at PhilipsResearch, New York. He has been a research associate at the University of Wis-consin (affiliated with the Wisconsin Institutes of Discovery) in 2011–2012.He received his Ph.D. in Electrical Engineering and Computer Science fromthe Massachusetts Institute of Technology in June 2011. Earlier, he received aMaster’s degree from Stanford and a Bachelor’s degree from the Indian Insti-tute of Technology, Bombay. His research interests lie in the areas of convexoptimization, control theory, system identification, and statistics. He has heldthe School of Engineering Fellowship at Stanford University, and a visiting ap-pointment at the Institute for Pure and Applied Mathematics (IPAM) at UCLA.

BenjaminRecht (S’03–A’06) is anAssistant Professor at the University ofWis-consin-Madison and a member of the IEEE.