ieee transactions on information theory, vol. 59, no. …€¦ · ieee transactions on information...
TRANSCRIPT
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 7465
Compressed Sensing Off the GridGongguo Tang, Member, IEEE, Badri Narayan Bhaskar, Parikshit Shah, and
Benjamin Recht, Associate Member, IEEE
Abstract—This paper investigates the problem of estimating thefrequency components of a mixture of complex sinusoids froma random subset of regularly spaced samples. Unlike previouswork in compressed sensing, the frequencies are not assumed tolie on a grid, but can assume any values in the normalized fre-quency domain . An atomic norm minimization approach isproposed to exactly recover the unobserved samples and identifythe unknown frequencies, which is then reformulated as an exactsemidefinite program. Even with this continuous dictionary, it isshown that random samples are sufficient to guar-antee exact frequency localization with high probability, providedthe frequencies are well separated. Extensive numerical experi-ments are performed to illustrate the effectiveness of the proposedmethod.
Index Terms—Atomic norm, basis mismatch, compressedsensing, continuous dictionary, line spectral estimation, nuclearnorm relaxation, Prony’s method, sparsity.
I. INTRODUCTION
C OMPRESSED sensing has demonstrated that data acqui-sition and compression can often be combined, dramati-
cally reducing the time and space needed to acquire many sig-nals of interest [1]–[4]. Despite the tremendous impact of com-pressed sensing on signal processing theory and practice, its de-velopment thus far has focused on signals with sparse repre-sentations in finite discrete dictionaries. However, signals en-countered in applications such as radar, array processing, com-munication, seismology, and remote sensing are usually speci-fied by parameters in a continuous domain [5]–[7]. In order toapply the theory of compressed sensing to such applications, re-searchers typically adopt a discretization procedure to reducethe continuous parameter space to a finite set of grid points[8]–[16]. While this simple strategy yields state-of-the-art per-formance for problems where the true parameters lie on the grid,discretization has several significant drawbacks—1) In caseswhere the true parameters do not fall onto the finite grid, the
Manuscript received August 16, 2012; revised April 26, 2013; accepted July10, 2013. Date of publication August 07, 2013; date of current version October16, 2013. G. Tang was supported by DARPA under Grant N66001-11-1-4090.B. Bhaskar and P. Shah were supported by NSF award CCF-1139953. B. Rechtwas supported by ONR Award N00014-11-1-0723, NSF Award CCF-1148243,a Sloan Research Fellowship, and NSF Award CCF-1139953. This paper waspresented at the Proceedings of the 50th Annual Allerton Conference, Urbana,IL, USA, 2012.G. Tang and B. N. Bhaskar are with the Department of Electrical and Com-
puter Engineering, University of Wisconsin-Madison, WI 53715 USA (e-mail:[email protected]; [email protected]).P. Shah and B. Recht are with the Department of Computer Sciences, Uni-
versity ofWisconsin-Madison,WI 53715 USA (e-mail: [email protected]; [email protected]).Communicated by M. Elad, Associate Editor for Signal Processing.Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIT.2013.2277451
signal cannot often be sparsely represented by the discrete dic-tionary [13], [17], [18]. 2) It is difficult to characterize the per-formance of discretization using standard compressed sensinganalyses since the dictionary becomes very coherent as we in-crease the number of grid points. 3) Although finer grids mayimprove the reconstruction error in theory, very fine grids oftenlead to numerical instability issues.We sidestep the issues arising from discretization by working
directly on the continuous parameter space for estimating thecontinuous frequencies and amplitudes of a mixture of complexsinusoids from partially observed time samples. In particular,the frequencies are not assumed to lie on a grid, and can in-stead take arbitrary values across the bandwidth of the signal.With a time-frequency exchange, our model is exactly the sameas the one in Candès et al.’s foundational work on compressedsensing [1], except that we do not assume the spikes to lie onan equispaced grid. This major difference presents a significanttechnical challenge as the resulting dictionary is no longer anorthonormal Fourier basis, but is an infinite dictionary with con-tinuously many atoms and arbitrarily high correlation betweencandidate atoms. We demonstrate that a sparse sum of complexsinusoids can be reconstructed exactly from a small samplingof its time samples provided the frequencies are sufficiently farapart from one another.Our computational method and theoretical analysis is based
upon the atomic norm induced by samples of complex exponen-tials [19]. Chandrasekaran et al. argue that the atomic norm isthe best convex heuristic for underdetermined, structured linearinverse problems, and it generalizes the norm for sparse re-covery and the nuclear norm for low-rank matrix completion.The norm is a convex function, and in the case of complex ex-ponentials, can be computed via semidefinite programming. Weshow how the atomic norm for moment sequences can be de-rived either from the perspective of sparse approximation orrank minimization [20], illuminating new ties between these re-lated areas of study. Much as was the case in other problemswhere the atomic norm has been studied, we prove that atomicnormminimization achieves nearly optimal recovery bounds forreconstructing sums of sinusoids from incomplete data.To be precise, we consider signals whose spectra consist of
spike trains with unknown locations in the normalized interval(we identify 0 and 1). Rather than sampling the signal at all
times we sample the signal at a subset of timeswith each . Our main contribution
is summarized by the following theorem.Theorem I.1: Suppose we observe the signal
(I.1)
0018-9448 © 2013 IEEE
7466 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
with unknown frequencies on an index setof size selected uniformly at random.
Additionally, assume are drawn i.i.d. fromthe uniform distribution on the complex unit circle and
where the distance is understood as the wrap-arounddistance on the unit circle. If , then there existsa numerical constant such that
is sufficient to guarantee that we can recover and localize thefrequencies via a semidefinite program with probability at least
with respect to the random samples and signs.The frequencies may be identified directly using the dual so-
lution of the atomic norm minimization problem we propose inthis paper. Alternatively, once the missing entries are recoveredexactly, the frequencies can be identified by Prony’s method[21], a matrix pencil approach [22], or other linear predictionmethods [23]. After identifying the frequencies, the coefficients
can be obtained by solving a linear system.Remark I.2 (Resolution): An interesting artifact of using
convex optimization methods is the necessity of a partic-ular resolution condition on the spectrum of the underlyingsignal. For the signal to be recoverable via our methodsusing random time samples from the set
, the spikes in the spectrum need to beseparated by roughly . In contrast, if one chose to acquire
consecutive samples from this set (equis-paced sampling), the required minimum separation wouldbe ; this sampling regime was studied by Candèsand Fernandez-Granda [24]. Therefore, in some sense, theresolution is determined by the region over which we take thesamples, either in full or in a uniform random manner. Wecomment that numerical simulations of Section V suggest thatthe critical separation is actually . We leave tightening ourbounds by the extra constant of 4 to future work.Remark I.3 (Random Signs): The randomness of the signs
of the coefficients essentially assumes that the sinusoids haverandom phases. Such a model is practical in many spectrumsensing applications as argued in [5, Ch. 4.1]. Our proof willreveal that the phases can obey any symmetric distribution onthe unit circle, not simply the uniform distribution.Remark I.4 (Band-Limited Signal Models): Note that any
mixture of sinusoids with frequencies bandlimited to ,after appropriate normalization, can be assumed to have fre-quencies in . Consequently, a bandlimited signal of such aform leads to samples of the form (I.1). More precisely, supposethe frequencies lie in , and is a continuoussignal of the form
By taking regularly spaced Nyquist samples at, we observe
which is exactly the same as our model (I.1) after a trivial trans-lation of the frequency domain. We emphasize that one does notneed to actually acquire all of these Nyquist samples and thendiscard some of them to obtain the set . In practice, one couldpredetermine the set and sample only at locations specifiedby . This procedure allows one to achieve compression at thesensing stage, the central innovation behind the theory of com-pressed sensing.Remark I.5 (Basis Mismatch): Finally, we note that our result
completely obviates the basis mismatch conundrum [13], [17],[18] of discretizationmethods, where the frequenciesmight wellfall off the grid. Since our continuous dictionary is globallycoherent, Theorem I.1 shows that the global coherence of theframe is not an obstacle to recovery. What matters more is thelocal coherence between the atoms composing the true signal,as characterized by the separation between the frequencies.This paper is organized as follows. First, we specify our re-
construction algorithm as the solution to an atomic norm min-imization problem in Section II. We show that this convex op-timization problem can be exactly reformulated as a semidefi-nite program and that our methodology is thus computationallytractable. We outline connections to prior art and the founda-tions that we build upon in Section III. We then proceed to de-velop the proofs in Section IV. Our proof requires the construc-tion of an explicit certificate that satisfies certain interpolationconditions. The production of this certificate requires us to con-sider certain random polynomial kernels, and derive concentra-tion inequalities for these kernels that may be of independentinterest to the reader. In Section V, we validate our theory byextensive numerical experiments, confirming that random un-dersampling as a means of compression coupled with atomicnorm minimization as a means of recovery are a viable, supe-rior alternative to discretization techniques.
II. ATOMIC NORM AND SEMIDEFINITE CHARACTERIZATIONS
Our signal model is a positive combination of complex sinu-soids with arbitrary phases. As motivated in [19], a natural reg-ularizer that encourages a sparse combination of such sinusoidsis the atomic norm induced by these signals. Precisely, defineatoms , and as
and rewrite the signal model (I.1) in vector form
(II.1)
TANG et al.: COMPRESSED SENSING OFF THE GRID 7467
where is an index set with values being eitheror for some positive integer and , andis the phase of the complex number . In the rest of the paper,we use to denote the unknown set offrequencies. In the representation (II.1), we could also choose toabsorb the phase into the coefficient as we did in (I.1).We will use both representations in the following and explicitlyspecify that the coefficient is positive when the phase termis in the atom .The set of atoms are
building blocks of the signal , the same way that canonicalbasis vectors are building blocks for sparse signals, and unit-norm rank one matrices are building blocks for low-rank ma-trices. In sparse recovery and matrix completion, the unit ballsof the sparsity-enforcing norms, e.g., the norm and the nu-clear norm, are exactly the convex hulls of their correspondingbuilding blocks. In a similar spirit, we define an atomic norm
by identifying its unit ball with the convex hull of
Roughly speaking, the atomic norm can enforce sparsityin because low-dimensional faces of correspond tosignals involving only a few atoms. The idea of using atomicnorms to enforce sparsity for a general set of atoms was firstproposed and analyzed in [19].When the phases are all 0, the set
is called themoment curvewhich traces out a 1-D varietyin . It is well known that the convex hull of this curve ischaracterizable in terms of linear matrix inequalities, and mem-bership in the convex hull can thus be computed in polynomialtime (see [25] for a proof of this result and a discussion of manyother algebraic varieties whose convex hulls are characterizedby semidefinite programming). When the phases are allowed torange in , a similar semidefinite characterization holds.Proposition II.1: For any with
or ,
(II.2)
In the proposition, we used the superscript to denote conju-gate transpose and to denote the Toeplitz matrix whosefirst column is equal to . The proof of this proposition relies onthe following classical Vandermonde decomposition lemma forpositive semidefinite Toeplitz matricesLemma II.2 (Caratheodory-Toeplitz, [26]–[28]): Any pos-
itive semidefinite Toeplitz matrix can be represented asfollows:
where
are real positive numbers, and .
The Vandermonde decomposition can be computed effi-ciently via root finding or by solving a generalized eigenvalueproblem [22].
Proof of Proposition II.1: We prove the case. The other case can be proved in a similar
manner. Denote the value of the right-hand side of (II.2) by. Suppose with . Defining
and , we note that
Therefore, the matrix
(II.3)
is positive semidefinite. Now,so that . Since this holds for any decomposi-tion of , we conclude that .Conversely, suppose for some and ,
(II.4)
In particular, . Form a Vandermondedecomposition
as promised by Lemma II.2. Sinceand , we
have .Using this Vandermonde decomposition and the matrix in-
equality (II.4), it follows that is in the range of , and hence
for some complex coefficient vector . Finally, by the SchurComplement Lemma, we have
Let be any vector such that . Such a vectorexists because is full rank. Then,
implying that . By the arithmetic geo-metric mean inequality,
7468 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
implying that since the previous chain of in-equalities hold for any choice of that are feasible.There are several other approaches to prove the semidefi-
nite programming characterization of the atomic norm. As wewill see below, the dual norm of the atomic norm is related tothe maximum modulus of trigonometric polynomials [see equa-tion (II.7)]. Thus, proofs based on Bochner’s Theorem [29], thebounded real lemma [30], [31], or spectral factorization [32]would also provide a tight characterization. It is interesting thatthe SDP for the continuous case is not very different from anSDP derived for basis pursuit denoising on a grid in [15], [16],and [33]. The work [33] also provides a heuristic to deal withoff-grid frequencies, though there is no theory to certify exactrecovery.
A. Atomic Norm Minimization for Continuous CompressedSensing
Recall that we observe only a subset of entries . Asprescribed in [19], a natural algorithm for estimating themissingsamples of a sparse sum of complex exponentials is the atomicnorm minimization problem
(II.5)
or, equivalently, the semidefinite program
(II.6)
Themain result of this paper is that this semidefinite program al-most always recovers the missing samples and identifies the fre-quencies provided the number of measurements is large enoughand the frequencies are reasonably well separated.We formalizethis statement for the case in the fol-lowing theorem.Theorem II.3: Suppose we observe the time samples of
on the index set of size selecteduniformly at random. Additionally, assume are drawni.i.d. from a symmetric distribution on the complex unit circle.If , then there exists a numerical constant such that
is sufficient to guarantee that with probability at least ,is the unique optimizer to (II.5).We prove this theorem in Section IV. Note that Theorem I.1
is a corollary of Theorem II.3 via a simple reformulation. Weprovide a proof of the equivalence in Appendix A.
B. Duality and Frequency Localization
To every norm, there is an associated dual norm, and the dualof the atomic norm for complex sinusoids has useful structurefor both analysis and implementations. In this section, we an-alyze the structure of the dual problem and show that the dualoptimal solution can be used as a method to identify the frequen-cies that comprise the optimal .Define the inner product as , and the real inner
product as . Then, the dual norm ofis defined as
(II.7)
that is, the dual atomic norm is equal to the maximum modulusof the polynomial on the unit circle. Thedual problem of (II.5) is thus
(II.8)
which follows from a standard Lagrangian analysis [19]. Notethat the dual atomic norm problem is an optimization over poly-nomials with bounded modulus on the unit disk.Let be primal-dual feasible to (II.5) and (II.8). We have
that
Since the primal is only equality constrained, Slater’s condi-tion naturally holds, implying strong duality [34, Sec. 5.2.3].By weak duality, we always have
for any primal feasible and any dual feasible. Strong dualityimplies equality holds if and only if is dual optimal and isprimal optimal. A straightforward consequence of strong dualityis a certificate of the support of the solution to (II.5).Proposition II.4: Suppose the atomic set is composed of
atoms defined by with being eitheror . Then, is the unique
optimizer to (II.5) if there exists a dual polynomial
(II.9)
satisfying
(II.10)
(II.11)
(II.12)
TANG et al.: COMPRESSED SENSING OFF THE GRID 7469
The polynomial works as a dual certificate to certifythat is the primal optimizer. The conditions on are im-posed on the values of the dual polynomial [conditions (II.10)and (II.11)] and on the coefficient vector [condition (II.12)].To prove Theorem II.3, we will construct a dual certificate sat-isfying the conditions of Proposition II.4 in Section IV.
Proof of Proposition II.4: Any vector that satisfies theconditions of Proposition II.4 is dual feasible. We also have that
where the last inequality is due to the definition ofatomic norm. On the other hand, Hölder’s inequalitystates , implying
. Since is primal-dual feasible,we conclude that is a primal optimal solution and is a dualoptimal solution because of strong duality.For uniqueness, suppose with
is another optimal solution. We then have for the dualcertificate :
due to condition (II.11) if is not solely supported on , con-tradicting strong duality. So all optimal solutions are supportedon . Since for both and ,the set of atoms with frequencies in are linearly independent,the optimal solution is unique.A consequence of this proposition is that the dual solution
provides a way to determine the composing frequencies of .One could evaluate the dual trigonometric polynomial
and localize the frequencies by identifying the loca-tions where the polynomial achieves modulus 1. The eval-uation of can be performed efficiently using Fast FourierTransform. Once the frequencies are estimated, the coefficientscan be obtained by solving a linear system of equations. We il-lustrate frequency localization from dual polynomial in Fig. 1.We would like to caution the reader that the dual optimal so-
lutions are not unique in general. However, we can show that
Fig. 1. Frequency localization from dual polynomial. Original frequencies inthe signal (dashed blue) with their heights representing the coefficient magni-tudes and the dual polynomial modulus (solid red) obtained from a dual op-timum. The recovered frequencies are obtained by identifying points where thedual polynomial has modulus one.
any dual optimal solution must contain the frequencies inwhenever the optimal primal solution is . To see this, let
be the set of recovered frequencies, andassume , then we have
where the strict inequality follows because for. This strict inequality contradicts strong duality.
Therefore, we have .In general, the set might contain spurious frequencies.
However, if we leverage the fact that the atomic norm is rep-resentable in terms of linear matrix inequalities, we can showthat most semidefinite programming solvers will find gooddual certificates. To make this precise, let us study the dualsemidefinite program of the atomic norm problem (II.6). Thisdual problem has also a semidefinite programming formulation
(II.13)
(II.14)
,.
(II.15)
(II.16)
Here, the three linear matrix inequalities (II.14), (II.15), and(II.16) are equivalent to the dual norm constraint .
7470 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
We emphasize that most solvers can directly return a dual op-timal solution for free when solving the primal problem. So it isnot necessary to solve the dual semidefinite program to obtaina dual optimum.The following proposition addresses when . The proof
is given in Appendix B.Proposition II.5 (Exact Frequency Localization): For signal
, denote by the set ofoptimal solutions of the dual semidefinite program. If there ex-ists one such that the dual polynomial
satisfies
(II.17)
(II.18)
then the following statements hold.1) all in the relative interior of form strictly comple-mentary pairs with the (unique) primal optimal solution,i.e.,
(II.19)
2) all in the relative interior of satisfy (II.17) and(II.18);
3) the dual central path converges to a point in the relativeinterior of .
Statement (2) of the proposition implies that we could useany optimal solution in the relative interior of the dual optimalset to localize frequencies, while statement (3) says any primal-dual path following algorithm for solving semidefinite programs(e.g., SDPT3) will produce such an interior point in the dualoptimal set.Under the conditions of Proposition II.5, the relative interior
of excludes dual optimal solutions that contain spurious fre-quencies. A particular pathological case is when all coefficientsin (I.1) are positive and (assume
and every index is observed), which is apparently a dual optimalsolution. The dual polynomial corresponding to is constant1 and contains every frequency in . It is easy to show thatthe only such that satisfies (II.14), (II.15), and (II.16)is . Hence, is not maximal complementaryunless contains at least frequencies, in which case theonly degree trigonometric polynomial satisfying (II.17) isconstant 1.However, this pathological case will not happen if we have a
few well-separated frequencies and the phases are random.1 In-deed, as we will see in Section IV, the way we prove TheoremsII.3 and I.1 is to explicitly construct a dual polynomial satis-fying (II.17) and (II.18). Therefore, combining these theoremsand Proposition II.5, we obtain the following corollary.Corollary II.6: Under the conditions of Theorem I.1 (or The-
orem II.3), we can identify the frequencies in from a dualoptimal solution in the relative interior of the set of all dual op-timal solutions with high probability.
1We believe the randomness requirement of phases are merely technical.
C. Power of Rank Minimization
The semidefinite programming characterization of the atomicnorm also allows us to draw connections to the study of rankminimization [20], [35]–[37]. A direct way to exploit sparsityin the frequency domain is via minimization of the following“ -norm” type quantity
This penalty function chooses the sparsest representation ofa vector in terms of complex exponentials. This combinato-rial quantity is closely related to the rank of positive definiteToeplitz matrices as delineated by the following Proposition.Proposition II.7: The quantity is equal to the optimal
value of the following rank minimization problem:
(II.20)
Proof: The case for is trivial. For ,denote by the optimal value of (II.20). We first show
. Suppose . Assume thedecomposition withachieves , and set so that
.Then, as we saw in (II.3),
This implies that .We next show . The case is trivial as we
could always expand on a Fourier basis, implying. We focus on . Suppose is an optimal solution of(II.20). Then, if
is a Vandermonde decomposition, positive semidefiniteness im-plies that is in the range of which means that can be ex-pressed as a combination of at most atoms, completing theproof.Hence, for this particular set of atoms, atomic norm mini-
mization is a trace relaxation of a rank minimization problem.The trace relaxation has been proven to be a powerful relax-ation for recovering low-rank matrices subject to random linearequations [20], values at a specified set of entries [35], Eu-clidean distance constraints [38], and partial quantum expecta-tion values [39]. However, our sampling model is far more con-strained and none of the existing theory applies to our problem.Indeed, typical results on trace-norm minimization demand thatthe number of measurements exceeds the rank of the matrixtimes the number of rows in the matrix. In our case, this wouldamount to measurements for an sparse signal.We prove
TANG et al.: COMPRESSED SENSING OFF THE GRID 7471
Fig. 2. Moments and their convex hulls. (a) The real moment curve for the first three moments. (b) The moment curve for the same frequencies, but adding inphase. (c) The convex hull of (a). (d) The convex hull of (b). Whereas all of the secants of (a) are extreme in their convex hull (c), many segments between atomsof (b) lie inside the convex hull (d).
in the sequel that only samples are required,dramatically reducing the dependence on .We close this section by noting that a positive combination
of complex sinusoids with zero phases observed at the firstsamples can be recovered via the trace relaxation with no lim-itation on the resolution [40], [41]. Why does the story changewhen we bring unknown phases into the picture? A partial an-swer is provided by Fig. 2. Fig. 2(a) and (b) displays the set ofatoms with no phase (i.e., ) and phase either 0 or ,respectively. That is, Fig. 2(a) plots the set
while (b) displays the set
Note that is simply . Their convex hulls are dis-played in Fig. 2(c) and (d), respectively. The convex hull ofis neighborly in the sense that every edge between every pair ofatoms is an exposed face and every atom is an extreme point.On the other hand, the only secants between atoms in thatare faces of the convex hull of are those between atoms withfar apart phase angles and frequencies. Problems only worsen ifwe let the phase range in . Thus, our intuition from posi-tive moment curves does not extend to the compressed sensingproblem of sinusoids with complex phases. Nonetheless, we areable to demonstrate that under mild resolution assumptions, wecan still recover sparse superpositions from very small samplingsets.
III. PRIOR ART AND INSPIRATIONS
Frequency estimation is extensively studied and techniquesfor estimating sinusoidal frequencies from time samples dateback to the work of Prony [21]. Many linear prediction algo-rithms based on Prony’s method were proposed to estimate the
7472 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
frequencies from regularly spaced time samples. A survey ofthese methods can be found in [42] and an extensive list of ref-erences is given in [23]. With equispaced samples, these root-finding-based procedures deal with the problem directly on thecontinuous frequency domain, and can recover frequencies pro-vided the number of samples is at least twice of the number offrequencies, regardless of how closely these frequencies are lo-cated [5], [21], [23], [42].In recent work [24], Candès and Fernandez-Granda studied
this problem from the point of view of convex relaxations andproposed a total-variation norm minimization formulation thatprovably recovers the spectrum exactly. However, the convexrelaxation requires the frequencies to be well separated by theinverse of the number of samples. The proof techniques of thisprior work form the foundation of analysis in the sequel, butmany major modifications are required to extend their results tothe compressed sensing regime.In [31], the authors proposed using atomic norm to denoise
a line spectral signal corrupted with Gaussian noise, and refor-mulated the resulting atomic norm minimization problem as asemidefinite program using the bounded real lemma [30]. De-noising is important to frequency estimation since the frequen-cies in a line spectral signal corrupted with moderate noise canbe identified by linear prediction algorithms. Since the atomicnorm framework in [31] is essentially the same as the total-vari-ation norm framework of [24], the same semidefinite programcan also be applied to total-variation norm minimization.What is common to all aforementioned approaches, including
linear prediction methods, is the reliance on observing uniformor equispaced time samples. In sharp contrast, we show thatnonuniform sampling is not only a viable option, and that theoriginal spectrum can be recovered exactly in the continuousdomain, but in fact is a means of compressive or compressedsampling. Indeed nonuniform sampling allows us to effectivelysample the signal at a sub-Nyquist rate. We point out that wehave only handled the case of undersampling uniform samplesas opposed to arbitrarily nonuniform samples in this paper.However, this is still of practical importance. For array signalprocessing applications, this corresponds to a reduction inthe number of sensors required for exact recovery, since eachsensor obtains one spatial sample of the field. An extensivejustification of the necessity of using randomly located sensorarrays can be found in [43]. To the best of our knowledge, littleis known about exact line-spectrum recovery with nonuniformsampling using parametric methods, except sporadic workusing -norm minimization to recover the missing samples[44], or based on nonlinear least square data fitting [45]. Non-parametric methods such as Periodogram and Correlogram fornonuniform sampling have gained popularity in recent years[46], [47], but they do not typically provide exact frequencylocalization even when all of the samples are observed.An interesting feature related to using convex optimization-
based methods for estimation such as [24] is a particular resolv-ability condition: the separation between frequencies is requiredto be greater than where is the number of measurements.Linear predictionmethods do not have a resolvability limitation,but it is known that in practice the numerical stability of rootfinding limits how close the frequencies can be. Theorem II.3
can be viewed as an extension of the theory to nonuniform sam-ples. Note that our approach gives an exact semidefinite char-acterization and is hence computationally tractable. We believeour results have potential impact on two related areas: extendingcompressed sensing to continuous dictionaries, and extendingline spectral estimation to nonuniform sampling, thus providingnew insight in sub-Nyquist sampling and superresolution.
IV. PROOF OF THEOREM II.3
The key to show that the optimization (II.5) succeeds is toconstruct a dual variable satisfying the conditions (II.10), (II.11), and (II.12) in Proposition II.4 to certify the optimality of. The rest of the paper’s proofs focus on the symmetric case
.As shown in Proposition II.4, the dual certificate can be inter-
preted as a polynomial with bounded modulus on the unit circle.The polynomial is constrained to have most of its coefficientsequal to zero. In the case that all of the entries are observed, thepolynomial constructed by Candès and Fernandez-Granda [24]suffices to guarantee optimality. Indeed they write the certificatepolynomial via a kernel expansion and show that one can explic-itly find appropriate kernel coefficients that certify optimality.We review this construction in Section IV-A. The requirementsof the certificate polynomial in our case are far more stringentand require a nontrivial modification of the Candès–Fernandez-Granda construction. We use a random kernel that has nonzerocoefficients only in the indices corresponding to observed loca-tions. (The randomness enters because the samples are observedat random.) The expected value of our random kernel is a mul-tiple of the kernel developed in [24].Using a matrix Bernstein inequality, we show that we can
find suitable coefficients to satisfy most of the optimality con-ditions. We then write our solution in terms of the deterministickernel plus a random perturbation. The remainder of the proofis dedicated to showing that this random perturbation is smalleverywhere. First, we show that the perturbation is small on afine grid of the circle in Section IV-E. To do so, we emulatethe proof of Candès and Romberg for reconstruction from in-coherent bases [48]. Finally, in Section IV-F, we complete theproof by estimating the Lipschitz constant of the random poly-nomial, and, in turn, proving that the perturbations are smalleverywhere. Our proof is based on Bernstein’s polynomial in-equality which was used to estimate the noise performance ofatomic norm denoising by Bhaskar et al. [31].
A. Detour: When All Entries are Observed
Before we consider the random observation model, we ex-plain how to construct a dual polynomial when all entries in
are observed, i.e., . The kernel-basedconstruction method, which was first proposed in [24], inspiresour random kernel-based construction in Section IV-C. The re-sults presented in this section are also necessary for our laterproofs.When all entries are observed, the optimization problem (II.5)
has a trivial solution, but we can still apply duality to certify theoptimality of a particular decomposition. Indeed, a dual poly-nomial satisfying the conditions given in Proposition II.4 with
TANG et al.: COMPRESSED SENSING OFF THE GRID 7473
means that , namely, the decompo-sition achieves the atomic norm. To con-struct such a dual polynomial, Candès and Fernandez-Grandasuggested considering a polynomial of the following form[24]:
(IV.1)
Here, is the squared Fejér kernel
(IV.2)
(IV.3)
with
(IV.4)the discrete convolution of two triangular functions. Thesquared Fejér kernel is a good candidate kernel because itattains the value of 1 at its peak, and rapidly decays to zero.Provided a separation condition is satisfied by the originalsignal, a suitable set of coefficients and can always befound.We use to denote the first three derivatives of. We list some useful facts about the kernel :
For the weighting function , we have
(IV.5)
We require that the dual polynomial (IV.1) satisfies
(IV.6)
(IV.7)
for all . The constraint (IV.6) guarantees that satis-fies the interpolation condition (II.10), and the constraint (IV.7)
is used to ensure that achieves its maximum at frequen-cies in . Note that the condition (II.12) is absent in this sec-tion’s setting since the set is empty.We rewrite these linear constraints in the matrix-vector form
where
and is the vector with . We have rescaledthe system of linear equations such that the system matrix issymmetric, positive semidefinite, and very close to identity.Positive definiteness follows because the system matrix isa positive combination of outer products. To get an idea ofwhy the system matrix is near the identity, observe that issymmetric with diagonals one, is antisymmetric, and issymmetric with negative diagonals . We define
(IV.8)
and summarize properties of the system matrix and its sub-matrices in the following proposition, whose proof is given inAppendix C.Proposition IV.1: Suppose . Then, is
invertible and
(IV.9)
(IV.10)
(IV.11)
Here, denotes the matrix operator norm.For notational simplicity, partition the inverse of as
where and are both . Then, solving for and
yields
(IV.12)
7474 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Then, the th derivative of the dual polynomial (after normal-ization) is
(IV.13)
where we have defined
...
...
(IV.14)with the th derivative of .To certify that the polynomial with the coefficients (IV.12)
are bounded uniformly on the unit circle, Candès and Fer-nandez-Granda divide the domain into regions near toand far from the frequencies of . Define
with . On , was analyzeddirectly, while on is bounded by showing that itssecond order derivative is negative. The following results arederived in the proofs of Lemmas 2.3 and 2.4 in [24].Proposition IV.2: Assume . Then, we have
(IV.15)
and for(IV.16)
(IV.17)
(IV.18)
(IV.19)
(IV.20)
and as a consequence,
(IV.21)
B. Bernoulli Observation Model
The uniform sampling model is difficult to analyze directly.However, the same argument used in [1] shows that the prob-ability of recovery failure under the uniform model is at mosttwice of that under a Bernoulli model. Here by “recoveryfailure,” we refer to that (II.5) would not recover the originalsignal . Therefore, without loss of generality, we focus onthe following Bernoulli observation model in our proof.We observe entries in independently with probability . Let
or 0 indicate whether we observe the th entry. Then,are i.i.d. Bernoulli random variables such that
On average in this model, we will observe entries. For, we use
C. Random Polynomial Kernels
We now turn to designing a dual certificate for the Bernoulliobservation model. As for the case that all entries are observed,the challenge is to construct a dual polynomial satisfying
as well as an additional constraint
(IV.22)
The main difference in our random setting is that the demandson the polynomial are much stricter as manifested by(IV.22), namely, we require that most of the coefficients of thepolynomial are equal to zero. Our approach mimics the con-struction in the deterministic case to write
(IV.23)
but using a random kernel , which has nonzero coeffi-cients only on the random subset and satisfies .We will then prove that concentrates tightly around .Our random kernel is simply the expansion (IV.3), but with
each term multiplied by a Bernoulli random variable corre-sponding to the observation of a component
As in (IV.4), is convolution of two discrete triangularfunctions. The th derivative of is
Both and are random trigono-metric polynomials of degree at most . More importantly,they contain monomial only if , or equivalently,
TANG et al.: COMPRESSED SENSING OFF THE GRID 7475
Fig. 3. Plots of the random kernel. (a) and .(b) and .
. Hence, in (IV.23) is of the form (II.9) and satis-fies . It is easy to calculate the expected valuesof and its th derivatives
(IV.24)
In Fig. 3, we plot and laid overand , respectively. We see that far away from
the peak, the random coefficients induce bounded oscillations tothe kernel. Near 0, however, the random kernel remains sharplypeaked.In order to satisfy the conditions (II.10) and (II.11), we re-
quire that the polynomial in (IV.23) satisfies
(IV.25)
(IV.26)
for all . As for , the constraint (IV.25) guaranteesthat satisfies the interpolation condition (II.10), and theconstraint (IV.26) helps ensure that achieves its max-imum at frequencies in .We now have linear constraints (IV.25), (IV.26) on
unknown variables . The remainder of the proof consistsof three steps.1) Show that the linear system (IV.25), (IV.26) is invertiblewith high probability using matrix Bernstein inequality[49].
2) Show , the random perturbations in-troduced by the random observation process, are small ona set of discrete points with high probability, implying therandom dual polynomial satisfies the constraints in Propo-sition II.4 on the grid. This step is proved using a modifi-cation of the idea in [48].
3) Extend the result to using Bernstein’s polynomial in-equality [50] and eventually show for .
D. Invertibility
In this section, we show the linear system (IV.25) and (IV.26)is invertible. Rewrite the linear system (IV.25) and (IV.26) intothe following matrix-vector form:
(IV.27)where , and . Note thatwe still rescale the derivatives using the deterministic quantity
rather than the random variable .The expectation computation (IV.24) implies that
where
. Define
where
...
...
(IV.28)
7476 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Then, we have
with defined in (IV.8). As a consequence, we have
with a zero mean randomself-adjoint matrix. We will apply the noncommutative Bern-stein inequality to show that concentrates about its meanwith high probability.Lemma IV.3 (Noncommutative Bernstein Inequality, [49, Th.
1.4]): Let be a finite sequence of independent, randomself-adjoint matrices of dimension . Suppose that
Then, for all ,
For , define the event
(IV.29)
The following lemma, proved in Appendix D, shows thathas a high probability if is large enough.Lemma IV.4: If , then we have
provided
Note that an immediate consequence of Lemma IV.4 is thatis invertible on . Additionally, Lemma IV.4 allows us to
control the norms of the submatrices of . For that purpose,we partition as
with and both and obtain following corollary.Corollary IV.5: On the event with , we have
The proof of this corollary uses elementary matrix analysisand can be found in Appendix E. Since on the event with
the matrix is invertible, we solve
for and from (IV.27)
(IV.30)
In the next section, we will plug (IV.30) back into (IV.23), andanalyze the effect of random perturbations on the polynomial
.
E. Random Perturbations
In this section, we show that the dual polynomial con-centrates around on a discrete set .We introduce a random analog of , defined by (IV.14), as
...
...
(IV.31)
with the th derivative of , and defined in (IV.28).The expectation of is equal to times its deterministic coun-terpart defined by (IV.14)
Then, in a similar fashion to (IV.13), we rewrite
We decompose into three parts:
TANG et al.: COMPRESSED SENSING OFF THE GRID 7477
which induces a decomposition on
(IV.32)
Here, as in
(IV.13) and we have defined
and
The goal of the remainder of this section is to show, inLemmas IV.8 and IV.9, that and are small on aset of grid points with high probability.The proof of Lemma IV.8, which shows is
small on , essentially follows that of Candès andRomberg [48]. We include the proof details here for com-pleteness, but very little changes in the argument. Since
is a weighted sum of inde-pendent random variables following a symmetric distributionon the complex unit circle, for fixed , we applyHoeffding’s inequality to control its value. This in turn requiresan estimate of . In Lemma IV.6, wefirst use concentration of measure (Lemma F.1) to establishthat is small with high probability. InLemma IV.7, we then combine Lemmas IV.6 and IV.4 to show
is small. The extension from a fixedto a finite set relies on the union bound.We start with bounding in the following
lemma. The proof given in Appendix F is based on an inequalityof Talagrand.Lemma IV.6: Fix . Let
and fix a positive number
if
otherwise
Then, we have
for some .
The following lemma combines Lemma IV.6 and Corol-lary IV.5 to show is small with highprobability.Lemma IV.7: Let . Consider a finite set. With the same notation as last lemma, we have
Proof of Lemma IV.7: Conditioned on the events and
we have
where we have used Proposition IV.1 and Corollary IV.5, andplugged in . The claim of the lemma then followsfrom the union bound.Lemma IV.7 together with Hoeffding’s inequality allow us to
control the size of .Lemma IV.8: There exists a numerical constant such that
if
then we have
Next lemma controls . Its proof is similar to the proofof Lemma IV.8.Lemma IV.9: There exists a numerical constant such that
if
then we have
7478 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Both Lemmas IV.8 and IV.9 are proven in the Appendix.Denote the event as
(IV.33)
Combining the decomposition (IV.32), Lemma IV.8, andLemma IV.9 with suitable redefinition of and immediatelyyields the following proposition.Proposition IV.10: Suppose is a finite set of
points, controls the maximal deviation of the randompolynomial and its derivatives from their expectations as in(IV.33), and is small constant to control probability.Then, there exists constant such that
(IV.34)
is sufficient to guarantee
F. Extension to Continuous Domain
We have proved that and
are not far on a set of grid points.
This section aims extending this statement to everywhere in, and show for eventually. The key is
the following Bernstein’s polynomial inequality.Lemma IV.11 (Bernstein’s Polynomial Inequality, [50]): Letbe any polynomial of degree with complex coefficients.
Then,
Our first proposition verifies that our random dual polynomialis close to the deterministic dual polynomial on all of .Proposition IV.12: Suppose and
(IV.35)
Then, with probability , we have that
(IV.36)
for all and .
Proof: It suffices to prove (IV.36) on and andthen modify the lower bound (IV.34). We first give a very roughestimate of on the set :
where we have used and . To seethe latter, we note
where we have used
and
Viewing as a trigonometric polynomial in
of degree , according to Bernstein’s polyno-mial inequality, we obtain
TANG et al.: COMPRESSED SENSING OFF THE GRID 7479
We select such that for any , there existsa point satisfying . The size of
is less than .With this choice of , on the set we have
Finally, we modify the condition (IV.34) according to ourchoice of :
An immediate consequence of Proposition IV.12 and thebound (IV.15) of Proposition IV.2 is the following estimate on
for .Lemma IV.13: Suppose and
Then, with probability , we have
(IV.37)
Proof: It suffices to choose . The rest followsfrom (IV.36), triangle inequality, and modification of the con-stant in (IV.35).A similar statement holds for
.Lemma IV.14: Suppose and
Then, we have for allProof: Define and
. Since and with the latterimplying
we only need to show on . Take the secondorder derivative of :
So it suffices to show that for
As a consequence of (IV.36) in Proposition IV.12, triangleinequality, and (IV.16)–(IV.20) of Proposition IV.2, we haveon the set for any
implying
when assumes a sufficiently small numerical value. With thischoice of , the condition of becomes
Therefore, on except for . We actuallyproved a stronger result that with probability at least
(IV.38)
Proof of Theorem II.3: Finally, if and
combining Lemmas IV.13 and IV.14, we have proved the claimof Theorem II.3.
V. NUMERICAL EXPERIMENTS
We conducted a series of numerical experiments to testthe performance of (II.5) under various parameter settings
7480 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Fig. 4. Frequency Estimation: Blue represents the true frequencies, while red represents the estimated ones.
TABLE IPARAMETER CONFIGURATIONS
(see Table I). We use for all numericalexperiments.We compared the performance of two algorithms: the
semidefinite program (II.6) and the basis pursuit obtainedthrough discretization
(V.1)
Here, is a DFTmatrix of appropriate dimension depending onthe grid size. Note that since the components of are complex,this is a second-order cone problem. In the following, we useSDP and BP to label the semidefinite program algorithm and thebasis pursuit algorithm, respectively. We solved the SDP withthe SDPT3 solver [51] and the basis pursuit (V.1) with CVX[52] coupled with SDPT3. All parameters of the SDPT3 solverwere set to default values and CVX precision was set to “high.”For the BP, we used three levels of discretization at 4, 16, and64 times the signal dimension.To generate our instances of form (II.1), we sampled
normalized frequencies from , either randomly, or equi-spaced. Random frequencies are sampled randomly onwith an additional constraint on the minimal separation .Given , equispaced frequencies are generated withthe same separation with an additional random shift. Thisrandom shift ensures that in most cases, basis mismatch oc-curs for discretization method. The signal coefficient magni-tudes are either unit, i.e., equal to 1, or fading, i.e.,equal to with a zero mean unit variance Gaussianrandom variable. The signs follow eitherBernoulli distribution, labeled as real, or uniform distribu-tion on the complex unit circle, labeled as complex. A lengthsignal was then formed according to model (II.1). As a final
step, we uniformly sample entries of the resultingsignal.We tested the algorithms on three sets of experiments. In the
first experiment, by running the algorithms on a randomly gen-
erated instance with , and samples se-lected uniformly at random, we compare SDP and BP’s abilityof frequency localization and visually illustrate the effect of dis-cretization. We see from Fig. 4 that SDP recovery followed byretrieving the frequencies according to Proposition II.5 givesthe most accurate result. We also observe that increasing thelevel of discretization can increase BP’s accuracy in locatingthe frequencies.In the second set of experiments, we compare the perfor-
mance of SDP and BPwith three levels of discretization in termsof solution accuracy and running time. The parameter configu-rations are summarized in Table I. Each configuration was re-peated ten times, resulting a total of 1920 valid experiments ex-cluding those with .We use the performance profile as a convenient way to com-
pare the performance of different algorithms. The performanceprofile proposed in [53] visually presents the performance ofa set of algorithms under a variety of experimental conditions.More specifically, let be the set of experiments andspecify the performance of algorithm on experiment forsome metric (the smaller the better), e.g., running time andsolution accuracy. Then, the performance profile is de-fined as
Roughly speaking, is the fraction of experiments suchthat the performance of algorithm is within a factor of thatof the best performed one.We show the performance profiles for numerical accuracy and
running times in Fig. 5(a) and (b), respectively. We see that SDPsignificantly outperforms BP for all tested discretization levelsin terms of numerical accuracy. When the discretization levelsare higher, e.g., 64×, the running times of BP exceed that ofSDP.To give the reader a better idea of the numerical accuracy and
the running times, in Table II we present their medians and me-dian absolute deviations for the four algorithms. As one wouldexpect, the running time increases as the discretization level in-creases. We also observe that SDP is very accurate, with a me-dian error at the order of . Increasing the level of discretiza-tion can increase the accuracy of BP. However, with discretiza-
TANG et al.: COMPRESSED SENSING OFF THE GRID 7481
Fig. 5. Performance profiles for solution accuracy and running times. Note the-axes are in logarithm scale for both plots. (a) Solution accuracy. (b) Runningtimes.
TABLE IIMEDIANS AND MEDIAN ABSOLUTE DEVIATION (MAD) FOR SOLUTION
ACCURACY AND RUNNING TIME
tion level , we get a median accuracy at the order of, but the median running time already exceeds that of SDP.
In the third set of experiments, we compiled two phase transi-tion plots. To prepare Fig. 6(a), we pick and vary
and . For each fixed ,we randomly generate frequencies while maintaininga frequency separation . The coefficients are generatedwith randommagnitudes and random phases, and the entries areobserved uniform randomly. We then run the SDPT3-SDP al-gorithm to recover the missing entries. The recovery is consid-ered successful if the relative error .This process was repeated ten times and the rate of success wasrecorded. Fig. 6(a) shows the phase transition results. The -axisindicates the fraction of observed entries , while the -axis
Fig. 6. Phase transition: The phase transition plots were prepared with, and . The frequencies were generated randomly
with minimal separation . Both signs and magnitudes of the coefficients arerandom. In Fig. 6(a), the separation and ,while in Fig. 6(b), the separation and .(a) Phase transition: . (b) Phase transition: .
is . The color represents the rate of success with red cor-responding to perfect recovery and blue corresponding to com-plete failure.We also plot the line . Since a signal of frequen-
cies has degrees of freedom, including frequency locationsand magnitudes, this line serves as the boundary above whichany algorithm should have a chance to fail. In particular, Prony’smethod requires consecutive samples in order to recover thefrequencies and the magnitudes.From Fig. 6(a), we see that there is a transition from perfect
recovery to complete failure. However, the transition boundaryis not very sharp. In particular, we notice failures below theboundary of the transition where complete success shouldhappen. Examination of the unsuccessful instances show thatthey correspond to instances with minimal frequency separa-tions marginally exceeding . We expect to get cleaner phasetransitions if the frequency separation is increased.To prepare Fig. 6(b), we repeated the same process in
preparing Fig. 6(a) except that the frequency separation wasincreased from to . In addition, to respect the minimalseparation, we reduced the range of possible sparsity levels to
. We now see a much sharper phase transition.The boundary is actually very close to the line.When is close to 1, we even observe successful recoveryabove the line.
VI. CONCLUSION AND FUTURE WORK
By leveraging the framework of atomic norm minimization,we were able to solve the problem of compressed sensing of line
7482 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
spectra. For signals with well-separated frequencies, we showthe number of samples needed is roughly proportional to thenumber of frequencies, up to polylogarithmic factors. This re-covery is possible even though our continuous dictionary is notincoherent at all and does not satisfy any sort of restricted isom-etry conditions.There are several interesting future directions to be explored
to further expand the scope of this work. First, it would beuseful to understand what happens in the presence of noise. Wecannot expect exact support recovery in this case, as our dic-tionary is continuous and any noise will make the exact fre-quencies unidentifiable. In a similar vein, techniques like thoseused in [24] that still rely on discretization are not applicablefor our current setting. However, since our numerical method israther stable, we are hopeful that a theoretical stability result ispossible.We show a simple example which provides some evidence for
conjecturing that our proposed technique is stable. The signalwas generated with , , random frequencies,
fading amplitudes, and random phases. A total number of 18time samples were chosen uniformly from . The noisyobservation was generated by adding complex noise withbounded norm to . We denoised and recovered thesignal by solving the following optimization:
(VI.1)
which can be solved by a semidefinite program. We observein Fig. 7 that the frequencies were approximately recovered bythe optimization problem (VI.1) in presence of noise. We leavethe verification of stability with more extensive numerical sim-ulations and theoretical analysis of this phenomenon to futurework.Second, we saw in our numerical experiments that modest
discretization introduces substantial error in signal reconstruc-tion and fine discretization carries significant computationalburdens. In this regard, it would be of great interest to speedup our semidefinite programming solvers so that we can scaleour algorithms beyond the synthetic experiments of this paper.Our rudimentary experimentation with first-order methodsdeveloped in [31] did not suffice for this problem as they wereunable to achieve the precision necessary for fine frequencylocalization. So, instead, it would be of interest to exploresecond-order alternatives—such as active set methods—tospeed up our computations.Finally, we are interested in exploring the class of signals
that are semidefinite characterizable in hopes of understandingwhich signals can be exactly recovered. Our continuous fre-quency model is an instance of applying compressed sensingto problems with continuous dictionaries. It would be of greatinterest to see how our techniques may be extended to other con-tinuously parametrized dictionaries. Models involving imagemanifolds may fall into this category [54]. Fully exploring thespace of signals that can be acquired with just a few speciallycoded samples provides a fruitful and exciting program of fu-ture work.
Fig. 7. Noisy frequency recovery: (a) Real part of true, noisy, and re-covered signals, (b)True frequencies (blue) and recovered frequencies(red). The (randomly generated) true signal has three frequencies at
with coefficients. The (real parts of) noisy observations at
out of indices are indicated by black circles. The noisecomponents were generated according to i.i.d. complex Gaussian distributionand then scaled to have an norm of 2. The observed signal has an norm5.9453.
APPENDIX APROOF OF THEOREM I.1
Proof: Assume with andor 4. Suppose the signal has decomposition
...
...
...
The rest of the proof argues that the dual polynomial con-structed for the symmetric case can be modified to certify theoptimality of for the general case.
TANG et al.: COMPRESSED SENSING OFF THE GRID 7483
If the coefficients have uniform randomcomplex signs, for fixed , also haveuniform random complex signs. In addition, the Bernoulli ob-servation model on index set naturallyinduces a Bernoulli observation modelon with . Denote
. Therefore, ifand
(A.1)
according to the proof of Theorem II.3, with probability greatthan , we could construct a dual polynomial
satisfying
Now define
otherwise
and
Clearly, the polynomial satisfies
where . The theorem thenfollows from rewriting (A.1) in terms of and Proposition II.4.
APPENDIX BPROOF OF PROPOSITION II.5
We need the following lemma about maximal rank spectralfactorization.
Lemma B.1: Suppose withis a nonnegative trigonometric polynomial, which
has zeros on the unit circle. Then, there exists a positivesemidefinite matrix with rank such that
where
...
Furthermore, this is the representation with maximal possiblerank.
Proof: According to the positive real lemma [30, Th. 2.5],the nonnegativity of
is equivalent to
(B.1)
In this case, we can write
Such a representation is called a Gram matrix representation.The set of all Gram matrices satisfying (B.1) is a compactconvex set. By the spectral factorization theorem [30, Th. 1.1],all of the rank one elements of are of the form suchthat
with the scalar factor determined by. Here, are
the zeros of , as a function defined on the complexplane, on the unit circle with multiplicity two. The rest
zeros of are divided intoconjugate pairs . Choosingas either or yields different . Therefore, there are
total such rank-one representations of .Since has zeros on , we have for any ,
, implying. Therefore, the null space of has dimension at least
, and hence the rank of any Gram matrix is at most .In the following we construct a maximal rank representation
of . For that purpose, define for via
and
7484 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
We first claim that are linearly indepen-dent. Suppose there exist constants such that
Postmultiplying both sides of the above equation by foreach we obtain
implying for . Since ,must also be zero. Therefore, the set islinearly independent. Now form the matrix
where , and . Then, the linear independenceof implies that is of rank , and
Proof of Proposition II.5: We prove for the case. Note that our semidefinite program (II.6) and
its dual have interior feasible points, a common assumption un-derlying many results about semidefinite programs, includingthe ones cited in the following arguments. For example, to ob-tain an interior feasible point for (II.6), one could take onand zero elsewhere, the Toeplitz matrix equal to ,
and .1) For the dual optimal solution given in the proposition,define
The trigonometric polynomial is positive withzeros at since
ifotherwise.
Now according to Lemma B.1, there exists withsuch that
Define which satisfies. In other words, we have a
pair such that
and . Therefore, is a dualoptimal solution.The unique primal optimal solution has the form (refer toProposition II.4 for a proof of the uniqueness):
where . Since the following decomposi-tion holds:
the rank of is , and hence is a strictly comple-mentary pair (see [55] for more information about strictcomplementarity).According to [56, Lemma 3.1], all matrices in the rel-ative interior of the optimal solution set of a semidefi-nite program share the same range space. Therefore, forany other in the relative interior of , the rank of
is also , implying that also
form strictly complementary pairs with the primal optimalsolution .
2) Since for the in part 1)
the null space of is
(B.2)
TANG et al.: COMPRESSED SENSING OFF THE GRID 7485
Again using [56, Lemma 3.1], we conclude that the null
space of any with in the relative inte-
rior of is given by (B.2). Therefore, we have for each :
(B.3)
or equivalently
using .For any , if , then a
calculation similar to (B.3) concludes that
with is in the null space
of . Due to the linear independence of
with , the null spacehave dimension greater than , contradicting with
. Therefore, we
have
3) Due to the existence of strictly complementary primal-dualoptimal solutions, the statement is a direct consequence of[57, Lemma 3.4] (See also [58], [59]).
APPENDIX CPROOF OF PROPOSITION IV.1
Proof: Under the assumption that , we cite theresults of [24, Proof of Lemma 2.2] as follows:
where is the matrix infinity norm, namely, the maximumabsolute row sum. Since is symmetric and has zero diag-onals, the Geršhgorin circle theorem [60] implies that
As a consequence, is invertible and
APPENDIX DPROOF OF LEMMA IV.4
1) Proof of Lemma IV.4: We start with computing thequantities necessary to apply Lemma IV.3
Here, we have used
We continue with bounding :
To further bound , we note
which leads to
Therefore, we have
7486 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Invoking the noncommutative Bernstein’s inequality and set-ting , we have
if
Consequently, when accordingto (IV.9), we have, confirming the invertibility of .
APPENDIX EPROOF OF COROLLARY IV.5
Assuming is invertible and , wehave the following two inequalities:
which are rearrangements of
Therefore, we establish that when on the set
:
Since the operator norm of a matrix dominates that of all sub-matrices, this completes the proof.
APPENDIX FPROOF OF LEMMA IV.6
The proof uses Talagrand’s concentration of measureinequality.
Lemma F.1 ([61, Corollary 7.8]): Let be a finite se-quence of independent random variables taking values in a Ba-nach space and let be defined as
for a countable family of real valued functions . Assume thatand for all and every . Then for
all
where
and is a numerical constant.Proof of Lemma IV.6: Based on the definition of
in (IV.31) and in (IV.14), we explicitly writeas
where is defined in (IV.28) and we have defined as
It is clear that are independent random vectors withzero mean.Define
and
To compute the quantities necessary to apply Lemma F.1, wewill extensively use the following elementary bounds:
and
TANG et al.: COMPRESSED SENSING OFF THE GRID 7487
First, we obtain an upper bound on :
The expected value of is upper bounded asfollows:
(F.1)
Observe that .We apply Jensen’sinequality and combine with (F.1) to obtain
Next, we upper bound :
implying
where is a matrix in whose th column is. Note that
Therefore, we have
In conclusion, Lemma F.1 shows that
Suppose . Then, define
and fix . Then, it follows that
for some provided . The same is true ifand if we define
. Therefore, let
and fix obeying
otherwise
Then, we have
for some . Application of the union bound proves thelemma.
APPENDIX GPROOF OF LEMMA IV.8
The proof of Lemma IV.8 is based on Hoeffding’s inequalitypresented below.
Lemma G.1 (Hoeffding’s Inequality): Let the componentsof be sampled i.i.d. from a symmetric distribution on the
7488 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
complex unit circle, , and be a positive real number.Then,
Proof of Lemma IV.8: Consider the random inner productwhere are i.i.d. symmetric
random variables with values on the complex unit circle.Conditioned on a particular realization where
Hoeffding’s inequality and the union bound then imply
Elementary probability calculation shows
Setting
in and applying Lemma IV.7 yield,
For the second term to be less than , we choose such that
and assume this value from now on. The first term is less thanif
(G.1)
First assume that . The condition in Lemma IV.6is or equivalently
(G.2)
In this case, we have , leading to
Now suppose that . If , thenwhich again gives the above lower bound on .
On the other hand if , then and
Therefore, to verify (G.1) it suffices to take obeying (G.2)and
This analysis shows that the first term is less than if
According to Lemma IV.4, the last term is less than if
Setting , combining all lower bounds on together,and absorbing all constants into one, we obtain
is sufficient to guarantee
with probability at least . The union bound then provesthe lemma.
APPENDIX HPROOF OF LEMMA IV.9
1) Proof of Lemma IV.9: Recall that
On the set defined in (IV.29), we established in CorollaryIV.5 that
We use the norm to bound the norm of :
To get a uniform bound on ,
we need the following bound:
TANG et al.: COMPRESSED SENSING OFF THE GRID 7489
for suitably chosen numerical constants and . The boundover the region is a consequence of the more accuratebound established in [24, Lemma 2.6], while the uniform boundcan be obtained by checking the expression of . Con-
sequently, we have
We conclude that on the set
Again, application of Hoeffding’s inequality and the unionbound gives
To make the first term less than , it suffices to take
To have the second term less than , we require
Another application of the union bound with respect toproves the lemma.
ACKNOWLEDGMENT
The authors would like to thank R. Nowak for many helpfuldiscussions about this work
REFERENCES[1] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency infor-mation,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb.2006.
[2] E. J. Candès and M. Wakin, “An introduction to compressive sam-pling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar.2008.
[3] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol.52, no. 4, pp. 1289–1306, Apr. 2006.
[4] R. Baraniuk, “Compressive sensing [lecture notes],” IEEE SignalProcess. Mag., vol. 24, no. 4, pp. 118–121, Jul. 2007.
[5] P. Stoica and R. L. Moses, Spectral Analysis of Signals, 1 ed. UpperSaddle River, NJ, USA: Prentice-Hall, 2005.
[6] C. Ekanadham, D. Tranchina, and E. P. Simoncelli, “Neural spike iden-tification with continuous basis pursuit,” presented at the Comput. Syst.Neurosci., Salt Lake City, UT, USA, Feb. 2011.
[7] C. E. Parrish and R. D. Nowak, “Improved approach to lidar airportobstruction surveying using full-waveform data,” J. Surv. Eng., vol.135, no. 2, pp. 72–82, May 2009.
[8] D. Malioutov, M. Cetin, and A. S. Willsky, “A sparse signal recon-struction perspective for source localization with sensor arrays,” IEEETrans. Signal Process., vol. 53, no. 8, pp. 3010–3022, Aug. 2005.
[9] M. A. Herman and T. Strohmer, “High-resolution radar via compressedsensing,” IEEE Trans. Signal Process., vol. 57, no. 6, pp. 2275–2284,Jun. 2009.
[10] A. C. Fannjiang, T. Strohmer, and P. Yan, “Compressed remote sensingof sparse objects,” SIAM J. Imag. Sci., vol. 3, no. 3, pp. 595–618, Jan.2010.
[11] R. Baraniuk and P. Steeghs, “Compressive radar imaging,” in Proc.IEEE Radar Conf., Waltham, MA, USA, Apr. 2007, pp. 128–133.
[12] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressedchannel sensing: A new approach to estimating sparse multipath chan-nels,” Proc. IEEE, vol. 98, no. 6, pp. 1058–1076, Jun. 2010.
[13] M. F. Duarte and R. Baraniuk, “Spectral compressive sensing,” Jul.2011 [Online]. Available: dsp.rice.edu
[14] H. Rauhut, “Random sampling of sparse trigonometric polynomials,”Appl. Comput. Harmon. Anal., vol. 22, no. 1, pp. 16–42, Jan. 2007.
[15] P. Stoica and P. Babu, “Spice and likes: Two hyperparameter-freemethods for sparse-parameter estimation,” Signal Process., vol. 92,no. 7, pp. 1580–1590, 2012.
[16] P. Stoica, P. Babu, and J. Li, “New method of sparse parameter estima-tion in separable models and its use for spectral analysis of irregularlysampled data,” IEEE Trans. Signal Process., vol. 59, no. 1, pp. 35–47,Jan. 2011.
[17] Y. Chi, L. L. Scharf, A. Pezeshki, and A. R. Calderbank, “Sensitivity tobasis mismatch in compressed sensing,” IEEE Trans. Signal Process.,vol. 59, no. 5, pp. 2182–2195, May 2011.
[18] M. A. Herman and T. Strohmer, “General deviants: An analysis of per-turbations in compressed sensing,” IEEE J. Sel. Top. Signal Process.,vol. 4, no. 2, pp. 342–349, Apr. 2010.
[19] V. Chandrasekaran, B. Recht, P. A. Parrilo, A. S. Willsky, “The convexgeometry of linear inverse problems,”, “Found. Comput. Math.,”, vol.12, pp. 805–849, Dec. 2012.
[20] B. Recht,M. Fazel, and P. A. Parrilo, “Guaranteedminimum-rank solu-tions of linear matrix equations via nuclear norm minimization,” SIAMRev., vol. 52, no. 3, pp. 471–501, Jan. 2010.
[21] B. G. R. de Prony, “Essai éxperimental et analytique: Sur les lois de ladilatabilité de fluides élastique et sur celles de la force expansive de lavapeur de l’alkool, à différentes températures,” J. de l’école Polytech-nique, vol. 1, no. 22, pp. 24–76, 1795.
[22] Y. Hua and T. K. Sarkar, “Matrix pencil method for estimating param-eters of exponentially damped/undamped sinusoids in noise,” IEEETrans. Acoust., Speech, Signal Process., vol. 38, no. 5, pp. 814–824,May 1990.
[23] P. Stoica, “List of references on spectral line analysis,” Signal Process.,vol. 31, no. 3, pp. 329–340, Apr. 1993.
[24] E. J. Candès and C. Fernandez-Granda, “Towards a mathematicaltheory of super-resolution,” Mar. 2012, vol. cs.IT [Online]. Available:arXiv.org
[25] R. Sanyal, F. Sottile, and B. Sturmfels, “Orbitopes,”Mathematika, vol.57, pp. 275–314, 2011.
[26] C. Carathéodory, “Über den variabilitätsbereich der Fourier’schen kon-stanten von positiven harmonischen funktionen,” Rendiconti del Cir-colo Matematico di Palermo (1884–1940), vol. 32, no. 1, pp. 193–217,1911.
[27] C. Carathéodory and L. Fejér, “Über den zusammenhang der extremenvon harmonischen funktionen mit ihren koeffizienten und über denPicard-Landau’schen satz,” Rendiconti del Circolo Matematico diPalermo (1884–1940), vol. 32, no. 1, pp. 218–239, 1911.
[28] O. Toeplitz, “Zur theorie der quadratischen und bilinearen formenvon unendlichvielen veränderlichen,” Math. Ann., vol. 70, no. 3, pp.351–376, 1911.
7490 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
[29] A. Megretski, “Positivity of trigonometric polynomials,” in Proc. 42ndIEEE Conf. Decision Control, Maui, HI, USA, Dec. 2003, vol. 4, pp.3814–3817.
[30] B. Dumitrescu, Positive Trigonometric Polynomials and Signal Pro-cessing Applications. New York, NY, USA: Springer-Verlag, Feb.2007.
[31] B. N. Bhaskar, G. Tang, and B. Recht, “Atomic norm denoising withapplications to line spectral estimation,” Apr. 2012, vol. cs.IT [Online].Available: arXiv.org
[32] T. Roh and L. Vandenberghe, “Discrete transforms, semidefinite pro-gramming, and sum-of-squares representations of nonnegative polyno-mials,” SIAM J. Optim., vol. 16, no. 4, pp. 939–964, Jan. 2006.
[33] P. Stoica and P. Babu, “Sparse estimation of spectral lines: Grid selec-tion problems and their solutions,” IEEE Trans. Signal Process., vol.60, no. 2, pp. 962–967, Feb. 2012.
[34] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, Mar. 2004.
[35] E. J. Candès and B. Recht, “Exact matrix completion via convex op-timization,” Found. Comput. Math., vol. 9, no. 6, pp. 717–772, Dec.2009.
[36] D. Gross, “Recovering low-rank matrices from few coefficients in anybasis,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 1548–1566, Mar.2009.
[37] B. Recht, “A simpler approach to matrix completion,” J. Mach.Learning, vol. 12, pp. 3413–3430, Jan. 2011.
[38] A. Javanmard and A. Montanari, “Localization from incomplete noisydistance measurements,” in Proc. IEEE Int. Symp. Inf. Theory, St. Pe-tersburg, Russia, Jul. 2011, pp. 1584–1588.
[39] D. Gross, Y. K. Liu, S. T. Flammia, S. Becker, and J. Eisert, “Quantumstate tomography via compressed sensing,” Phys. Rev. Lett., vol. 105,no. 15, p. 150401, Oct. 2010.
[40] D. L. Donoho and J. Tanner, “Sparse nonnegative solution of underde-termined linear equations by linear programming,” in Proc. Nat. Acad.Sci. USA, 2005, vol. 102, no. 27, pp. 9446–9451.
[41] J.-J. Fuchs, “Sparsity and uniqueness for some specific under-deter-mined linear systems,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess., 2005, vol. 5, pp. v/729–v/732.
[42] T. Blu, P.-L. Dragotti, M. Vetterli, P. Marziliano, and L. Coulot,“Sparse sampling of signal innovations,” IEEE Signal Process. Mag.,vol. 25, no. 2, pp. 31–40, Mar. 2008.
[43] L. Carin, D. Liu, and B. Guo, “Coherence, compressive sensing, andrandom sensor arrays,” IEEE Antennas Propag. Mag., vol. 53, no. 4,pp. 28–39, Aug. 2011.
[44] E. R. Dowski, C. A. Whitmore, and S. K. Avery, “Estimation ofrandomly sampled sinusoids in additive noise,” IEEE Trans. Acoust.,Speech, Signal Process., vol. 36, no. 12, pp. 1906–1908, Dec. 1988.
[45] P. Stoica, J. Li, and H. He, “Spectral analysis of nonuniformly sampleddata: A new approach versus the periodogram,” IEEE Trans. SignalProcess., vol. 57, no. 3, pp. 843–858, Mar. 2009.
[46] Y.Wang, J. Li, and P. Stoica, Spectral Analysis of Signals: TheMissingData Case, 1 ed. San Rafael, CA, USA: Morgan & Claypool Pub-lishers, Jul. 2005.
[47] M. Shaghaghi and S. A. Vorobyov, “Spectral estimation from under-sampled data: Correlogram and model-based least squares,” Feb. 2012,vol. math.ST [Online]. Available: arXiv.org
[48] E. J. Candès and J. Romberg, “Sparsity and incoherence in compressivesampling,” Inverse Problems, vol. 23, no. 3, pp. 969–985, Apr. 2007.
[49] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,”Found. Comput. Math., vol. 12, no. 4, pp. 389–434, Aug. 2011.
[50] A. C. Schaeffer, “Inequalities of A. Markoff and S. Bernstein for poly-nomials and related functions,” Bull. Amer. Math. Soc., vol. 47, pp.565–579, Nov. 1941.
[51] K. C. Toh, M. J. Todd, and R. H. Tütüncü, “SDPT3: A MATLAB soft-ware package for semidefinite-quadratic-linear programming,” [On-line]. Available: http://www.math.nus.edu.sg/~mattohkc/sdpt3.html
[52] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, Version 1.21,” Apr. 2011 [Online]. Available: http://cvxr.com/cvx
[53] E. D. Dolan and J. J. Moré, “Benchmarking optimization software withperformance profiles,”Math. Prog., A, vol. 91, no. 2, pp. 201–203,Mar.2002.
[54] M. B. Wakin, “A manifold lifting algorithm for multi-view compres-sive imaging,” in Proc. Picture Coding Symp., Chicago, IL, USA, May2009, pp. 1–4.
[55] F. Alizadeh, J. Haeberly, and M. Overton, “Complementarity and non-degeneracy in semidefinite programming,” Math. Program., vol. 77,no. 1, pp. 111–128, 1997.
[56] D. Goldfarb and K. Scheinberg, “Interior point trajectories in semidef-inite programming,” SIAM J. Optim., vol. 8, no. 4, pp. 871–886, 1998.
[57] Z. Luo, J. Sturm, and S. Zhang, “Superlinear convergence of a sym-metric primal-dual path following algorithm for semidefinite program-ming,” SIAM J. Optim., vol. 8, no. 1, pp. 59–81, 1998.
[58] E. de Klerk, C. Roos, and T. Terlaky, “Initialization in semidefiniteprogramming via a self-dual skew-symmetric embedding,” Oper. Res.Lett., vol. 20, no. 5, pp. 213–221, 1997.
[59] M. Halická, E. de Klerk, and C. Roos, “On the convergence of thecentral path in semidefinite optimization,” SIAM J. Optim., vol. 12, no.4, pp. 1090–1099, 2002.
[60] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge, U.K.:Cambridge Univ. Press, Feb. 1990.
[61] M. LeDoux, The Concentration of Measure Phenomenon. Provi-dence, RI, USA: American Mathematical Society, 2001.
GongguoTang (S’09–M’11) received the Ph.D. degree in electrical engineeringfrom Washington University in St. Louis in 2011. He is currently a Postdoc-toral Research Associate at the Department of Electrical and Computer Engi-neering, University of Wisconsin-Madison. His research interests are in the areaof signal processing, convex optimization, information theory and statistics, andtheir applications.
Badri Narayan Bhaskar is a student in Electrical Engineering at University ofWisconsin-Madison.
Parikshit Shah (S’06–M’09) is currently aMember of Research Staff at PhilipsResearch, New York. He has been a research associate at the University of Wis-consin (affiliated with the Wisconsin Institutes of Discovery) in 2011–2012.He received his Ph.D. in Electrical Engineering and Computer Science fromthe Massachusetts Institute of Technology in June 2011. Earlier, he received aMaster’s degree from Stanford and a Bachelor’s degree from the Indian Insti-tute of Technology, Bombay. His research interests lie in the areas of convexoptimization, control theory, system identification, and statistics. He has heldthe School of Engineering Fellowship at Stanford University, and a visiting ap-pointment at the Institute for Pure and Applied Mathematics (IPAM) at UCLA.
BenjaminRecht (S’03–A’06) is anAssistant Professor at the University ofWis-consin-Madison and a member of the IEEE.