bilinear matrix factorization methods for time-varying

Bilinear Matrix Factorization Methods for Time-Varying NarrowbandChannel Estimation: Exploiting Sparsity and Rank

Sajjad Beygi, Amr Elnakeeb, Sunav Choudhary, and Urbashi Mitra

Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

University of Southern California, Los Angeles, CA

Abstract—In this paper, the estimation of a narrowband time-varying channel under the practical assumptions of finite blocklength and finite transmission bandwidth is investigated. It isshown that the signal, after passing through a time-varying nar-rowband channel reveals a useful parametric low-rank structurethat can be represented as a bilinear form. To estimate thechannel, two strategies are developed. The first method exploitsthe low-rank bilinear structure of the channel via a non-convexstrategy based on an alternating direction optimization betweenthe delay and Doppler directions. While prior Wirtinger flowmethods have exhibited good performance with proper initial-ization, this is not true in the current scenario. Due to the non-convex nature of this approach, this first approach is sensitive tolocal minima. Furthermore, the convergence rate of the Wirtingerflow method is shown to be provably modest. Thus, a novelconvex approach based on the minimization of the atomic normusing measurements of the signal at the time domain is proposedbased on a second bilinear parametrization of the channel. Forthe convex approach, optimality and uniqueness conditions, andthe theoretical guarantee for noiseless channel estimation withsmall number of measurements are characterized. Numericalresults show that the performance of the proposed algorithmis independent of the leakage effect and the new methods canachieve 5 − 12 dB improvement, on average, compared to aclassical l1-based sparse approximation method that does notconsider the leakage effect, and 2 dB improvement over a basisexpansion method that considers the leakage effect.

I. INTRODUCTION

Wireless communications have enabled intelligent traf-

fic safety [1], [2], automated robotic networks, underwater

surveillance systems [3], [4], and many other useful technolo-

gies. In all of these systems, establishing a reliable, high data

rate communication link between the transmitter and receiver

is essential. To achieve this goal, accurate channel state

information is needed to equalize the received signal and, thus

combat the effects of the wireless channel. One of the well-

known approaches to acquire channel state information is to

probe the channel in time/frequency with known signals, and

reconstruct the channel response from the output signals (see

[5] and references therein). Least-squares (LS) and Wiener

filters are classical examples of this approach. However, these

methods do not take advantage of the rich, intrinsic structure of

Copyright (c) 2017 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

This research has been funded in part by the following grants: ONRN00014-15-1-2550, ONR N00014-09-1-07004, NSF CCF-1117896, NSFCNS-1213128, NSF CCF-1410009, AFOSR FA9550-12-1-0215, DOT CA-26-7084-00, NSF CPS-1446901, and Fulbright Foundation, UK Royal Academyof Engineering, UK Leverhulme Trust.

Ming Hsieh Department of Electrical Engineering, University of South-ern California, Los Angeles, CA 90089 Emails: {beygihar, elnakeeb,ubli}@usc.edu and [email protected]

wireless communication channels in their estimation process.

In particular, many time-varying channels have sparse repre-

sentations in the Doppler-delay domain. The main challenge

with classical approaches, i.e., LS and Wiener filtering, is that

they require large number of measurements compared to the

number of unknown parameters in the estimation problem to

perform well.

To combat this challenge, we need to take advantage of

the side information about the structure of the unknown

parameters. By exploiting inherent structure, we can reduce

the size of the feasible set of solutions. This results in the need

for fewer observations. Methods for sparse channel estimation

have existed for some time [4], [6]–[11] and there has been

a recent resurgence using modern signal processing methods

for sparsity [11]–[13], group-sparsity or mixed/hybrid (sparse

and group sparsity) structures [2], [14], [15] and even rank

[4], [16], [17].

The use of practical pulse shapes due to finite block

length and transmission bandwidth constraints results in a

loss of sparsity and degrades performance of the modern

sparse methods [2], [12]. This effect is defined as channel

leakage in [2], [12]. It has been shown that the performance of

compressed sensing (CS) methods are significantly degraded

due to the leakage effect in practice [2], [12], [18]–[20]. The

mitigation of leakage effects via sparse basis expansion has

been previously considered in [12], [18]–[20]. In [18], a com-

pressive method for tracking doubly selective channels within

multicarrier systems, including OFDM systems is proposed.

Using a recently introduced concept of modified compressed

sensing (MOD-CS), the sequential delay-Doppler sparsity of

the channel is exploited to improve estimation performance

through a recursive estimation module. In [19], a compressive

estimator of doubly selective channels for pulse-shaping mul-

ticarrier MIMO systems (including MIMO OFDM as a special

case) is proposed. The use of multichannel compressed sensing

exploits the joint sparsity of the MIMO channel for improved

performance. A multichannel basis optimization for enhancing

joint sparsity is also proposed. In [20], advanced compressive

estimators of doubly dispersive channels within multicarrier

communication systems (including classical OFDM systems)

are considered. The performance of compressive channel esti-

mation has been shown to be limited by leakage components

impairing the channel’s effective delay-Doppler sparsity.

In [12], the application of compressed sensing to the estima-

tion of doubly selective channels within pulse-shaping multi-

carrier systems (considered OFDM systems as a special case)

is considered. By exploiting sparsity in the delay-Doppler

domain, CS-based channel estimation allows for an increase

in spectral efficiency through a reduction of the number of

2

pilot symbols. For combating leakage effects that limit the

delay-Doppler sparsity, a sparsity-enhancing basis expansion

is considered and a method for optimizing the basis with

or without prior statistical information about the channel is

proposed.

In this work it is shown that with these practical communi-

cation system constraints, the transmitted signal after passing

through a linear, time-varying narrowband channel exhibits a

parametric, low-rank, bilinear form. It is this signal description

that enables methods that are distinct from the previously

described work which are inherently one-dimensional in nature

where our methods are two-dimensional. The rank of this

representation is determined by the number of dominant paths,

which is small in both cellular and underwater environments

[21]. The bilinear form is due to the separability of the pulse

leakage effects in the delay and Doppler domains [2]. Herein,

we propose two methods which directly exploit the bilinear,

low-rank form. Our first approach is a variation of gradient

based methods, motivated by the strong performance of this

type of approach for other applications [22], [23], due to the

ease of finding a good initialization. To solve the non-convex

optimization problem, we use the alternating direction method

[24] due to the bilinearity of the measurement model. In the

first step of this algorithm, we recover the channel in the delay

direction and in the second step we estimate the channel in

the Doppler direction. We repeat the steps iteratively until

we converge to a stationary point. If the optimum values of

the channel estimates are identifiable, we can prove that the

gradient of the non-convex objective function is zero only

at that point, for the noiseless case. Despite this positive

result, we also show that the gradients have high-variance, thus

underscoring the likelihood of finding a local minimum which

is always a challenge with non-convex objective functions.

Our second approach is based on an alternative param-

eterization of the channel. We define a set of atoms to

describe the set of rank-one matrices in our channel estimation

problem. Utilizing this set of atoms, we show that the channel

estimation problem can be stated as a parametric low-rank

matrix recovery problem. Motivated by convex recovery for

inverse problems via the atomic norm heuristic [25], [26], we

develop a recovery algorithm employing the atomic norm to

enforce the channel model and leakage structures via a convex

optimization problem. We show that the solution of our convex

optimization problem is the optimal solution of our channel

estimation problem in the noiseless scenario. Furthermore,

we discuss the conditions under which the solution of this

convex program is unique. Finally, we develop a scaling

law that relates the number of measurements needed to the

probability of correct estimation of leaked channel parameters.

We analyze the algorithm to show that the global optimum can

be recovered in the absence of noise. Numerical results showed

that the proposed algorithm can provide a performance of 5dB to 12 dB improvement, in the SNR sense, for SNRs over

5 dB compared to an l1-based sparse approximation method.

Furthermore, the proposed method offers 2 dB improvement,

on average, over the basis expansion method considered in

[12], which takes into account the leakage effects.

Portions of this work have appeared in three prior confer-

ence papers [16], [17], [27]; the non-convex approach based

on Wirtinger flow was introduced in [17] and results relating

to convergence provided; the convex approach based on the

atomic norm was introduced in [16] and the scaling law result

was provided in [27]. The current manuscript has the complete

proofs for our key technical results, as well as generalizing the

theorem on convergence in [17]. We have further numerical

results in the current work which explore the efficacy of the

root finding for Doppler values in the convex approach as well

as investigating the robustness of the convex approach to not

satisfying the constraints of the scaling law. Additionally, we

have new numerical comparisons to the method considered in

[12].

The rest of this paper is organized as follows. Section II

develops the communication system model which is used in

Section III to derive the discrete time observation model which

captures the bilinear structure of the problem. Section IV

derives the two strategies based on non-convex and convex

programming. Section V is devoted to discussion and nu-

merical results, and finally Section VI concludes the paper.

Appendices, A, B, and C provide the proofs of the three major

theorems of the work.

Notation: Scalar values are denoted by lower-case letters, xand column vectors by bold letters, x. The i-th element of xis given by x[i]. Given vector x ∈ Rn, ‖x‖2 =

√∑ni=1 x[i]

2

denotes the �2 norm of x, respectively. A matrix is denoted by

bold capital letters such XXX and its (i, j)-the element by X[i, j].The transpose of X is given by XT and its conjugate transpose

by XH . A diagonal matrix with elements x is written as

diag{x} and the identity matrix as I. The set of real numbers

by R, and the set of complex numbers by C. The element-wise

(Schur) product is denoted by �.

II. SYSTEM MODEL

We assume that the transmitted signal x(t) is generated by

the modulation of a pilot sequence x[n] onto the transmit pulse

pt(t) as given by

x(t) =+∞∑

n=−∞x[n]pt(t− nTs),

where Ts is the sampling period. Note that this signal model

is quite general, and encompasses OFDM signals as well as

single-carrier signals. The signal x(t) is transmitted over a

linear, time-varying channel. The received signal y(t) can be

written as,

y(t) =

∫ +∞

−∞h (t, τ)x(t− τ) dτ + z(t). (1)

Here, h(t, τ) is the channel’s time-varying impulse response,

and z(t) is a white Gaussian noise process. A common

model for the narrowband time-varying (TV) channel impulse

response is as follows,

h(t, τ) =

p0∑k=1

ηkδ(τ − tk)ej2πνkt, (2)

where p0 denotes the number of dominant paths in the channel,

ηk, tk, and νk denote the kth channel path’s attenuation gain,

3

delay, and Doppler shift, respectively. At the receiver, y(t)is converted into a discrete-time signal using an anti-aliasing

filter pr(t). That is,

y[n] =

∫ +∞

−∞y(t)pr(nTs − t) dt.

We assume that pt(t) and pr(t) are causal with support

[0, Tsupp). Under the reasonable assumption νmaxTsupp � 1,

where νmax = max (ν1, . . . , νp0) denotes the Doppler spread

of the channel [1] and defining p(t) = pt(t) ∗ pr(t), we can

write the received signal after filtering and sampling as [2],

y[n] =

m0−1∑m=0

h(l)[n,m]x[n−m] + z[n], (3)

where h(l)[n,m] =∑p0

k=1 h(l)k [n,m] and

h(l)k [n,m] = ηke

j2πνk((n−m)Ts−tk)p(mTs − tk), (4)

for n ∈ . The superscript l denotes leakage. Here, m0 =⌊τmax

Ts

⌋+ 1, where τmax is the maximum delay spread of

the channel, denotes the maximum discrete delay spread of

the channel. The pulse shaping filter support does not impact

the value of τmax. Without loss of generality, if we assume

that pr(t) has a root-Nyquist spectrum with respect to the

sample duration Ts, then z[n] is a sequence of i.i.d circularly

symmetric complex Gaussian random variables with a constant

variance σ2z . The pulse leakage effect is due to the non-zero

support of the pulse p(·) in Equation (4). The leakage with

respect to Doppler can be decreased by increasing the pulse

shape duration, while the leakage with respect to delay can

be decreased by increasing the bandwidth of the transmitted

signal. Given the practical constraints of pulse shape duration

and bandwidth, the leakage effect increases the number of

nonzero coefficients of the observed leaked channel at the

receiver (for more details, see [2], [12]).

The main goal of channel estimation is the determination

of the channel coefficients, i.e.,{h(l)k [n,m] | for 1 ≤ k ≤ p0

}and 0 ≤ m ≤ m0 − 1 at time instance n, in order to

equalize their effect on the transmitted signal. It is clear

that at each time instance, n, there exist m0p0 (unknown)

channel coefficients to be estimated. These coefficients are

estimated via the observations y[n] and the pilot sequence

x[n] which are both known at the receiver during channel

estimation (see signal model in Equation (3)). We observe

that n ∈ {m0, · · · , nT +m0 − 1}, and nT denotes the total

number of training symbols. In the next section, we show

that the channel coefficients exhibit structures that we can

explicitly exploit in our estimation process in order to decrease

the set of feasible solutions for channel coefficients, and thus

improve the estimation fidelity.

III. PARAMETRIC SIGNAL REPRESENTATION

In this section, we exploit the intrinsic structures in the

measurement model in Equation (3). We show that, even

though the leakage effect diminishes the sparsity of the ef-

fective observed channel coefficients, leakage also introduces

an elegant parametric low-rank structure that will enable

high accuracy channel estimation with a limited number of

measurements.

Define gk(t) = p(t− tk)e−j2πνkt, then using Equations (3)

and (4), the received signal from the k-th path can be written

as

sk[n] =

m0−1∑m=0

gk(mTs)x[n−m] = xTngk, (5)

where gk = [gk (0Ts) , · · · , gk ((m0 − 1)Ts)]T

and xn =[x [n] , x [n− 1] · · · , x [n− (m0 − 1)]]

T. Vectors gk, for 1 ≤

k ≤ p0, contain only the (shifted) leakage pulse shape

information and the vector xn is described by m0 consecutive

samples from the training signal up to time n. We can represent

the (aggregated) received signal in Equation (3) as

y[n] =

p0∑k=1

ηksk[n]ej2πνkn, (6)

where ηk = ηke−j2πνktk and νk = νkTs ∈ [− 1

2 ,12 ]. If we

stack the sk[n] for m0 ≤ n ≤ nT + m0 − 1 in a vector as

sk = [sk[m0], · · · , sk[nT +m0 − 1]]T, we can write

sk = XXXgk, (7)

where XXX is a nT -by-m0 matrix, with its i-th row equal-

ing xTi+m0−1. In wireless communication systems, typically

nT > m0, that is, all sk live in a common low-dimensional

subspace spanned by the columns of a known nT ×m0 matrix

XXX with nT > m0. We assume that ‖gk‖2 = 1 without loss of

generality. Using Equation (5), recovery of sk is guaranteed

if gk can be recovered. Therefore, the number of degrees of

freedom in Equation (6) becomes O(m0p0), which is smaller

than the number of measurements nT when p0,m0 � nT .

Applying Equation (5), we can rewrite Equation (6) as

y[n] =

p0∑k=1

ηkej2πnνkxT

ngk. (8)

We define d(ν) =[e−j2πm0ν , · · · , e−j2π(nT+m0−1)ν

]T,

which is a vector of all possible Doppler shifts in the channel

representation. Thus, we have

y[n] =

⟨p0∑k=1

ηkgkd(νk)H ,xne

Tn−m0+1

⟩, (9)

for n = m0, · · · , nT +m0 − 1, where 〈XXX,YYY〉 = trace(YYYHXXX)and en, 1 ≤ n ≤ nT , are the canonical basis vectors for

IRnT×1. We observe that the first factor in the matrix inner

product given in Equation (9) has only the (narrowband) time-

varying channel information and the second factor is a function

of the training sequence only. Hereafter, we define the leakedchannel matrix as

HHHl =

p0∑k=1

ηkgkd(νk)H . (10)

Since each term in the above summation in Equation (10) is a

rank one matrix, we know that rank (HHHl) ≤ p0. Thus, Equation

(9) leads to a parametrized rank-p0 matrix recovery problem,

4

which we write as

y = Π(HHHl), (11)

where the linear operator Π : m0×nT → nT×1 is defined

as [Π(HHHl)]n =⟨HHHl,xne

Tn

⟩and HHH is a low-rank matrix with

parametric representation given in Equation (10).

IV. STRUCTURED ESTIMATION OF TIME-VARYING

NARROWBAND CHANNELS

In this section, we propose two algorithms to estimate

the leaked channel matrix HHHl via the measurements, y[n],by exploiting the parametrized low-rank matrix structure de-

scribed in Section III. We show that by leveraging the channel

structure in the estimation process, the channel coefficients can

be estimated with high accuracy and with a small number of

measurements. From this point on, we drop the subscript l on

HHHl for clarity and it is understood that HHH always refers to the

leaked channel matrix.

A. Nonconvex Alternating Direction Minimization

Recovering the channel from the linear measurement model

y = Π(HHH), described in Equation (11), is equivalent to deter-

mining a low-rank matrix HHH that satisfies this measurement

model. Thus, we seek to solve,

HHHl = argminHHH

‖y −Π(HHH)‖22 s.t.

{rank (HHH) ≤ p0

HHH =∑p0

k=1 ηkgkd(νk)H

,

(12)

where p0 denotes the maximum number of dominant paths in

the channel. Unfortunately, this optimization problem, due to

the rank constraint is, in general, a NP-hard problem [28]. A

tractable relaxation of the the rank constraint is the nuclear

norm of the target matrix [28]. In particular, we can rewrite

the relaxed optimization problem as,

HHHl = argminHHH

‖y −Π(HHH)‖22 + λ‖HHH‖∗ s.t. HHH =

p0∑k=1

ηkgkd(νk)H ,

(13)

where the parameter λ in Equation (13) determines the trade-

off between the fidelity of the solution to the measurements yand its conforming to the low-rank model. Furthermore, from

Equation (10), we know that the channel matrix HHH can be

represented as

HHH =

p0∑k=1

ηkgkd(νk)H = HHHtHHHν , (14)

where

HHHt =[η1g1, η2g2, · · · , ηp0

gp0

], (15)

HHHν = [d(ν1),d(ν2), · · · ,d(νp0)]H. (16)

As can be seen from Equation (14), the sampled channel

matrix is a function of both the delay and the Doppler values.

One can estimate the sampled channel matrix vectors, i.e, gand d(ν)H , or attempt to estimate the delay and Doppler val-

ues that contribute to those vectors. The Doppler values have a

direct structural relationship with HHH through the Vandermonde

matrix. While the pulse shaping filters are known, endeavoring

to exploit the resulting parametric relationship between the

leakage vector g and the delay (and also the Doppler) did

not provide good performance in our direct estimation of the

delay. This could be due to the highly non-linear nature of the

pulse correlation function or the presence of the Doppler in the

delay leakage vector. Thus, we were able to be parametric with

respect to the Doppler; but not with respect to the delay. We

treat HHHt as an effective delay matrix; although it is dependent

on Doppler as well, and do not seek to estimate the delays

directly. In developing our equalizer, we only need to use HHHt

(as a whole) and HHHν .

Therefore, we can reformulate the optimization problem in

Equation (13) with HHH = HHHtHHHν as,

argminHHHt,HHHν

(‖y −Π(HHHtHHHν)‖22 + λ‖HHHtHHHν‖∗

). (17)

From [29], we know that the minimization of the nuclear

norm of matrix products can be rewritten as a Frobenius norm

minimization,

minHHHt,HHHν

‖HHHtHHHν‖∗ = minHHHt,HHHν

1

2

(‖HHHt‖2F + ‖HHHν‖2F

). (18)

With this equivalence in hand, we will consider the follow-

ing alternative optimization problem,

argminHHHt,HHHν

(‖y −Π(HHHtHHHν)‖22 +

λ

2

(‖HHHt‖2F + ‖HHHν‖2F

)).

(19)

It shall be noted that the cost function given in (19) is an

upper bound on the cost function given in (17), due to the

following inequalities,

minHHHt,HHHν

(||yyy −Π(HHHtHHHν)||22 + λ1||HHHtHHHν ||∗

)(a)

≤ minHHHt,HHHν

(||yyy −Π(HHHtHHHν)||22 + λ1

√rank(HHHtHHHν) ||HHHtHHHν ||F

)(b)

≤ minHHHt,HHHν


√p0 ||HHHtHHHν ||F

)(c)

≤ minHHHt,HHHν


√p0 ||HHHt||F ||HHHν ||F

)(d)

≤ minHHHt,HHHν


√p0

2

(||HHHt||2F + ||HHHν ||2F

))= min

HHHt,HHHν

(||yyy −Π(HHHtHHHν)||22 +

λ

2

(||HHHt||2F + ||HHHν ||2F

)),

where, λ = λ1√p0, (a) holds since ||AAA||∗ ≤

√rank(AAA) ||AAA||F ,

(b) holds since√

rank(HHHtHHHν) =√

rank(HHH) ≤ √p0, (c)

holds since ||AAABBB||F ≤ ||AAA||F ||BBB||F , and (d) holds since

||AAA||F ||BBB||F ≤ 1

2(||AAA||2F + ||BBB||2F ).

The form of this upper bound facilitates optimization,

although it is still non-convex and thus our methods will yield

local optima only.

From Equation (16), we see that the matrix HHHν is a partial

Vandermonde matrix and is fully determined by the Doppler

parameters ν = [ν1, · · · , νp0]. Therefore, instead of optimizing

the matrix HHHν on p0×nT , we perform the optimization over

5

the set of Doppler parameters ν ∈ [− 12 ,

12 ]

p0 as follows:

argminHHHt,ν

‖y −Π(HHHtHHHν(ν))‖22 +λ

2‖HHHt‖2F +

λ

2‖HHHν(ν)‖2F .

Note that the above optimization problem is non-convex [27],

due to the combination of unknown products, i.e., Π(HHHtHHHν).Due to the separability of the objective function in this opti-

mization problem in delay and Doppler directions, see Section

V in [2], we can employ the alternating projections algorithm

[24] which is a space and efficient technique that stores the

iterates in factored form. The algorithm is extraordinarily

simple, and easy to interpret: In the first step, we fix either

HHHt or HHHν , and try to optimize the other. Then, in the second

step, we substitute the computed matrix in the first step and

optimize over the second matrix. We iterate over these two

steps untill the algorithm converges to the optimal (or a

stationary point) solution. Given the current estimates of HHHkt

and HHHkν , the updating rules can be summarized as

HHHk+1ν = argmin

HHHν(ν)

∥∥∥y −Π(HHHk+1

t HHHν(ν))∥∥∥2

2, (20)

HHHk+1t = argmin

HHHt

∥∥y −Π(HHHtHHHν(ν

k))∥∥2

2+

λ

2‖HHHt‖2F . (21)

We can further simplify these iterations using Lemma 1.

Lemma 1. [see e.g., [30]] Suppose that Π : m0×nT →nT×1 is a linear operator where

[Π (HHH)]n = Tr (XXXnHHH) =

m0∑i=1

nT∑j=1

XXXn[j, i]HHH[i, j],

where XXXn = xneTn . Then, for HHH = HHHtHHHν , we can write

Π(HHH) = AAAtvec (HHHHHHHHHν) = AAAνvec (HHHHHHHHHt) , (22)

where vec(·) stacks the columns of its matrix argument into asingle column vector. The matrices AAAt and AAAν are defined as

AAAt[k, l + p0(j − 1)] =

m0∑i=1

XXXk[j, i]HHHt[i, l],

and

AAAν [k, i+m(l − 1)] =

nT∑j=1

XXXk[j, i]HHHν [l, j],

where l ∈ {1, · · · , p0}.

Applying Lemma 1, we can rewrite each iteration of the

alternating projection algorithm in Equation (21) and (20) as

HHHk+1ν = argmin

HHHν(ν)

∥∥∥y −AAAk+1t vec (HHHν(ν))

∥∥∥22, (23)

HHHk+1t = argmin

HHHt

∥∥∥y −AAAkνvec (HHHt)

∥∥∥22+

λ

2‖vec (HHHt)‖22. (24)

In the rest of this work, we connote this method of channel es-

timation as the non-convex alternating direction minimization(NADM) approach. From Equation (24), we see that updating

in the delay direction is just a Ridge estimator or Tikhonov

regularization [31] and in the Doppler direction we need to find

the roots of a (exponential) polynomial using a root-finding

algorithm [32]. One of the challenges with this approach is

that it is sensitive to the initialization.

Remark 1. If we know the exact number of the dominant

paths in the channel, i.e., rank (HHH) = p0, then enforcing the

separability of the channel in leakage and Doppler direction

(matrices) in Equation (17) will implicitly enforce the low-rank

structure of the channel matrix. Therefore, when the number

of dominant paths in the channel is known exactly, we can set

λ = 0 in Equation (17) for the channel estimation.

Considering the case in Remark 1, we state a theorem

(Theorem 1), that under no noise and sufficiently small

stopping resolution, the output(HHHopt

t ,HHHoptν

)of the NADM

algorithm is the global optimum (HHH∗t ,HHH

∗ν) of the channel

estimation problem whenever the global optimum is uniquely

identifiable1.

Theorem 1. Define ΔΔΔ � HHHtHHHTν −HHH∗

tHHH∗νT and J (HHHt,HHHν) =

‖y −Π(HHHtHHHν)‖22. In the absence of noise, ΔΔΔ �= 000 impliesthat ∂J(HHHt,HHHν)

∂HHHt�= 000 and ∂J(HHHt,HHHν)

∂HHHν�= 000, if the global optimum(

HHHoptt ,HHHopt

ν

)is uniquely identifiable.

The proof is given in Appendix A.

Remark 2. The unique identifiability of (HHH∗t ,HHH

∗ν) as the

solution of

(HHH∗t ,HHH

∗ν) = argmin

HHHt,HHHν

J (HHHt,HHHν) (25)

is necessary for any recovery guarantee. To see this, note that if

there is a second global optimum (HHH∗∗t ,HHH∗∗

ν ) to (25) then there

is no way to determine which of the two solutions is the correct

one. In this case, we have ΔΔΔ∗∗ = HHH∗∗t HHH∗∗

νT −HHH∗

tHHH∗νT �= 000,

but global optimality of (HHH∗∗t ,HHH∗∗

ν ) implies∂J(HHH∗∗

t ,HHH∗∗ν )

∂HHHt= 000,

thus contradicting the conclusion of Theorem 1 if the unique

identifiability clause was removed from the statement of the

theorem.

While Theorem 1 excludes the existence of local optima,

it does not quantify the convergence rate of the algorithm.

To do so involves knowing the mean and the variance of the

partial derivative∂J(HHHt,HHHν)

∂HHHtover the random training sequence

(Theorem 2). We analyze the output of proposed algorithm

to find the solution of the channel estimation problem when

the elements of the pilot sequence x are drawn i.i.d from a

sequence of random BPSK modulation, i.e., {−1,+1} with

equal probability.

Theorem 2. Let elements of x ∈ nT+m0−1 be generatedby a sequence of random BPSK modulation, i.e., {−1,+1}with equal probability, and define ΔΔΔ � HHHtHHH

Tν − HHH∗

tHHH∗νT

=[δ1, · · · , δnT−1] ∈ m0×nT . In the absence of noise, we have

E

{∥∥∥∥∂J (HHHt,HHHν)

∂HHHt

∥∥∥∥2

F

}=

∥∥∥∥E{∂J (HHHt,HHHν)

∂HHHt

}∥∥∥∥2

F

+ (m0 − 1)p0

nT−1∑n=0

‖δn‖22, (26)

1Identifiability here means that given sufficient information, can we un-ambiguously determine the pair of inputs (HHHt,HHHν) that generated theobservation.

6

and∥∥∥∥E{∂J (HHHt,HHHν)

∂HHHτ

}∥∥∥∥2

F

= vec (ΔΔΔ∗)HBBB (ΔΔΔ∗) , (27)

where (using ⊗ to denote Kronecker product of matrices) BBB =HHHνHHH

Hν ⊗III ∈ m0nT×m0nT is a Hermitian symmetric positive

semidefinite matrix satisfying the eigenvalue bounds

000 � BBB � ‖HHHν‖22III. (28)

The proof is given in Appendix B. In our simulations, see

Section V, we did observe the impact of bad initializations;

i.e., local optima were found which resulted in poor perfor-

mance [23]. This challenge motivates us to see if we can use

a convex programming approach to enforce the parametric

bilinear low-rank structure. In the next section, we employ

atomic norm heuristic, introduced by [33], to promote the

sparsity of number dominant paths in the channel as well as

the parametric low rank structure in the channel model.

B. Parametric Low-rank Atomic Norm

We know that the number of dominant paths in a narrow-

band time-varying wireless communication system is small.

This means that the number of terms in Equation (8), p0,

is quite small compared to the number of training signal

measurements, i.e., p0 � nT . In other words, the channel can

be described as a summation of rank-one matrices of form

gd(ν)H in Equation (10). This representation motivates the

use of the atomic norm heuristic to promote this structure. For

this purpose, we define an atom as AAA(g, ν) = gd(ν)H , where

ν ∈ [− 12 ,

12 ] and g ∈ m0×1. Without loss of generality, we

consider ‖g‖2 = 1. Then, we define the set of all atoms as

A =

{AAA(g, ν) | ν ∈ [−1

2,1

2], ‖g‖2 = 1, g ∈ m0×1

}.

(29)

Our goal here is to find a representation of the channel with

a small number of dominant paths, i.e.,

‖HHH‖A,0 = infp

{HHH =

p∑k=1

ηkgkd(νk)H

}. (30)

Due to the combinatoric nature of the norm defined in Equa-

tion (30), the above optimization problem is NP-hard. Thus,

we alternatively consider the convex relaxation of above norm

as the atomic norm associated with the set of atoms defined

in Equation (29) as

‖HHH‖A = inf {t > 0 : HHH ∈ tconv(A)} (31)

= infηk,νk,‖gk‖2=1

{∑k

|ηk| : HHH =∑k

ηkgkd(νk)H

},

where conv denotes the convex hull. This relaxation is similar

to the relaxation of l0-norm of a vector by its l1-norm, which

is the prevalent relaxation in the compressed sensing literature

to avoid the combinatorial nature of l0-norm in the recovery

algorithm.

Remark 3. The atomic representation in Equation (31) for

matrix HHH, i.e., HHH =∑

k ηkAAA(gk, νk) =∑

k ηkgkd(νk)H ,

not only captures the functional forms of its elements, but

also enforces the rank-one constraint on each term in the

summation, i.e., rank (AAA (gk, νk)) = 1.

To enforce the sparsity of the atomic representation (low

rank) of the received signal, we solve

minimizeHHH

‖HHH‖A s.t. y = Π(HHH). (32)

From [25], we know that due to the Vandermonde decompo-

sition, the convex hull of the set of atoms A can be charac-

terized by a semidefinite program (SDP). Therefore ‖HHH‖A in

Equation (32) admits an equivalent SDP representation.

Proposition 1 (see, e.g., [25], [34]). For any HHH ∈ m0×nT ,

‖HHH‖A = infz,WWW

1

2nTtrace (Toep(z) + nTWWW)

s.t.

[Toep(z) HHHH

HHH WWW

]� 0, (33)

where z is a complex vector whose first element is real,Toep(z) denotes the nT×nT Hermitian Toeplitz matrix whosefirst column is z, and WWW is a Hermitian m0 ×m0 matrix.

Therefore, we can use any efficient SDP solver such as CVX

[35], to solve the optimization problem in Equation (32). For

noisy measurements, we consider

minimizeHHH

‖HHH‖A s.t. ‖y −Π(HHH)‖2 ≤ σ2z . (34)

The above problem can similarly be cast as SDP by adding

the noisy convex constraint. Similar to the noiseless version,

the noisy version can be solved using CVX.

Next, we show that the solution of the optimization problem

in Equation (32) is the optimal solution of our channel

estimation problem in the noiseless scenario. Furthermore, we

discuss the conditions under which the solution of this convex

program is unique. Finally, we develop a scaling law that

relates the number of measurements needed to the probability

of correct estimation of leaked channel parameters.

1) Optimality and Uniqueness: The dual of the optimiza-

tion problem in Equation (32), using standard Lagrangian

analysis, can be written as

maximizeλ

Re {〈λ,y〉} s.t. ‖Π∗ (λ)‖∗A ≤ 1, (35)

where Π∗ (λ) =∑

k λ(k)xkeTk−m0+1 is the adjoint operator

of Π and ‖·‖∗A denotes the dual norm of the atomic norm,

which is an inner product. Therefore, we have

‖Π∗ (λ)‖∗A = sup‖ΘΘΘ‖A≤1

Re {〈Π∗ (λ) ,ΘΘΘ〉} (36)

= supν∈[− 1

2 ,12 ],‖g‖2=1

Re{⟨

Π∗ (λ) , gd(ν)H⟩}

.

Equality in Equation (36) holds, since the set{gd(ν)H

}ν,g

covers all the extremal points of the atomic norm unit ball,

i.e., {ΘΘΘ : ‖ΘΘΘ‖A ≤ 1}. If we define the vector-valued function

μ(ν) = Π∗ (λ)d(ν), then by Cauchy-Schwarz inequality, we

7

have

‖Π∗ (λ)‖∗A = supν∈[− 1

2 ,12 ],‖g‖2=1

Re{gHμ(ν)

}≤ sup

ν∈[− 12 ,

12 ]

‖μ(ν)‖2. (37)

Now, if we consider the following condition that

‖μ(ν)‖2 ≤ 1, (C-1)

then, we can rewrite the optimization problem in Equation (35)

as follows,

maximizeλ

Re {〈λ,y〉} subject to ‖μ(ν)‖2 ≤ 1, (38)

where μ(ν) = Π∗ (λ)d(ν) or

μ(ν) =

nT+m0−1∑n=m0

λ(n−m0 + 1)ej2πnνxHn . (39)

Similarly, we have Re {〈λ,y〉} = Re {〈Π∗ (λ) ,HHH〉} =∑k Re

{η∗kg

Hk μ (νk)

}. If we further assume

μ(νk) = sign (ηk) gk, (C-2)

for k ∈ {1, · · · , p0}, then we have Re {〈λ,y〉} =∑

k |ηk| ≥‖HHH‖A. Moreover, using the Hölder inequality, we know that

Re {〈λ,y〉} ≤ ‖Π∗ (λ)‖∗A‖HHH‖A ≤ ‖HHH‖A.Therefore, if condition (C-2) holds, then Re {〈λ,y〉} =‖HHH‖A. In other words, under conditions (C-1) and (C-2), the

solution of the primal (Equation (32)) and dual (Equation (35))

optimization problems introduce a zero duality gap. Thus, HHHand λ are optimal solutions of the primal and dual optimization

problem. Furthermore, using proof by contradiction, we can

see that condition (C-2) ensures the uniqueness of the optimal

solution. Suppose HHH =∑

k ηkgkd(νk)H is another optimal

solution. Since HHH and HHH are different, there are some νk that

are not in the support of HHH. Define Tν = {ν1, ν2, · · · , νp0} as

the Doppler shifts’ support of HHH. Then, we have

Re {〈λ,y〉} = Re{⟨

Π∗ (λ) , HHH⟩}

=∑k∈Tν

Re{η∗kg

Hk μ (νk)

}+

∑k/∈Tν

Re{ˆη∗kg

Hk μ (νk)

}

<∑k∈Tν

|η∗k|+∑k/∈Tν

|ˆη∗k|,

which is in contradiction with the optimality of HHH. Thus, we

show that if we can guarantee the existence of a dual poly-

nomial μ(ν) = Π∗ (λ)d(ν) meeting the two key conditions

(C-1) and (C-2), then the optimization problem in Equation

(32) will find the optimal solution of the channel estimation

problem. In Theorem 3, we show that a proper dual polynomial

μ(ν) exists under two main conditions: 1) a minimum Doppler

separation and 2) a sufficient number of measurements.

Theorem 3. Suppose nT ≥ 64 (see [34]) and the trainingsequence {x[n] | m0 ≤ n ≤ nT +m0 − 1} is generated using

a i.i.d. random source with Rademacher distribution 2. Assumethat min1≤i<j≤p0

|νi − νj | ≥ 4nT

, then there exists a constantc, such that for

nT ≥ cp0m0 log3(nT p0m0

δ

), (40)

the proposed optimization problem in Equation (32) canrecover HHH with probability at least 1− δ.

The proof is given in Appendix C.

2) Channel Estimation Algorithm: After we evaluate the

dual parameters λ by solving the optimization problem in

Equation (32), we can construct the function μ (ν) in Equation

(39). Then, we can employ μ(ν) to estimate the Doppler

parameters by enforcing condition (C-2), as we know that

|μ (νk)| = 1 for k ∈ {1, · · · , p0}. Towards this goal, we need

to find the roots of the following polynomial

Q(ν) = 1− ‖μ(ν)‖22 = 1− μ(ν)Hμ(ν)

= 1− ‖μ(ν)‖22, (41)

which are equal to {νk}p0

k=1. After the estimation of {νk}p0

k=1,

we can substitute them into Equation (8) to achieve a linear

system of equations to evaluate {ηkgk}p0

k=1. Note that we do

not need to compute the values of ηk and gk separately in order

to equalize the channel distortion. As seen in Equation (8), to

construct an equalizer, we just require ηkgk for 1 ≤ k ≤ p0,

using Equation (4) we can rewrite the channel coefficients as

h(l)[n,m] =

p0∑k=0

ηkgk[m]ej2πνkn, (42)

thus to design a channel equalizer, one just need evaluate

the above coefficients in Equation (42). We connote our

convex channel estimation approach as the parametric lowrank atomic norm (PLAN) method.

V. NUMERICAL SIMULATIONS

In this section, we perform several numerical experiments

to validate the performance of the proposed channel estimation

algorithms. Furthermore, we compare the performance of

our proposed algorithms, namely NADM and PLAN, with

sparsity-based method, see e.g., [2], [13] and the references

therein. More details about the model of the channel coeffi-

cients and measurements for the element-wise sparsity-based

method, i.e., LASSO method, can be found in Section V in [2].

Additionally, we compare the performance of our proposed

algorithms with the work [12] that considers an iterative basis-

optimization approach which exploits prior statistics and spar-

sity to mitigate the leakage effect. From [12], basis expansion

optimization is coupled with OMP, namely OMP (optimized).

A. Signal Parameters

We construct a narrowband time-varying channel based on

the model given in Equation (2). We first generate the channel

delay, Doppler, and attenuation parameters randomly. In our

experiments, the (normalized) delay, in (0, 1], and Doppler

2The Rademacher distribution is a discrete probability distribution wherexi[j] = ±1 with probability 1/2.

8

0 5 10 15SNR (dB)

-16

-14

-12

-10

-8

-6

-4

-2

0

2N

MS

E (

dB

)

NADMPLANLASSOOMP (optimized)NADM: Worst caseNADM: Best case

Fig. 1: ±2σ Normalized mean-squared-error (NMSE) curves ofestimation strategies vs. signal-to-noise-ratio (SNR) at the receiverside—leaked channel estimation.

parameters, in [− 12 ,

12 ], are generated via uniform random

variables and the channel attenuation parameters are generated

using a Rayleigh random variable, unless otherwise stated. The

transmit training signal x = [x[1], x[2], · · · , x[nT + n0 − 1]]T

is generated by a random BPSK modulation, i.e., {−1,+1}with equal probability. Moreover, the transmit and received

pulse shapes are Gaussian pulses with 50% window’s support

over a single symbol interval.

B. Performance Comparisons

In the first numerical simulation, we compare the normal-

ized mean-squared-error (NMSE) of estimation performance

of different algorithms. We define the NMSE as follows:

NMSE =‖θ − θ‖‖θ‖ , (43)

where θ is the true value of the target parameter and θdenotes its estimated value. Results in Fig.1 depict the NMSE,

averaged over total number of runs (10000 runs), of the

estimation of the leaked channel matrix, HHH, using the proposed

NADM and PLAN approaches, the sparsity-based method, and

the OMP optimized method, with nT = 64 measurements

of the training signal, m0 = 10 and p0 = 5. From Fig.1,

we observe that both the convex and non-convex approaches

perform better than the method using the l1-norm alone to

promote the sparsity of the channel coefficients. This is due

to the fact that the new methods directly exploit the leakage

effects presented in Sections III and IV, while the sparsity-

based method ignores the leakage effect. Furthermore, our

method offers 2 dB improvement, on average, over the basis

expansion method proposed in [12] which does explicitly

consider leakage. To explore the sensitivity of the non-convex

NADM, we include ±2σ bars on each plot. The error bars

underscore that the NADM method has much higher variance

Fig. 2: Normalized mean-squared-error (NMSE) vs. signal-to-noise-ratio (SNR) at the receiver side—channel Doppler shifts estimationusing PLAN approach (effect of nT ).

5 10 15 20SNR (dB)

10-4

10-3

10-2N

MS

E

p0=10, m

0=4

p0=10, m

0=6

p0=10, m

0=8

p0=15, m

0=4

p0=20, m

0=4

Fig. 3: Normalized mean-squared-error (NMSE) vs. signal-to-noise-ratio (SNR) at the receiver side—channel Doppler shifts estimationusing PLAN approach (effect of m0 and p0).

than the other considered methods. Specifically, NADM has

approximately twice the variance of that of PLAN. The best

and the worst curves are determined from the best and the

worst performance cases over all the runs. From Fig.1, we

conclude that NADM approach is very sensitive to initializa-

tion. Finding a good strategy for a proper initialization of the

NADM approach is an interesting research problem and is left

for future research.

C. Doppler Estimation and Resolution Constraint

As discussed in Section IV-B2, in our proposed PLAN

approach, we compute the Doppler shift parameters, νk for

1 ≤ k ≤ p0 directly, by finding the roots of polynomial Q(ν),defined in Equation (41), then we use the evaluated Doppler

shifts to estimate the leaked channel gains using Equation (8).

9

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Doppler shift ( )

0

0.2

0.4

0.6

0.8

1

1-Q

()

Fig. 4: Doppler shifts are the roots of Q(ν). This figure depicts1−Q(ν) vs. ν. For nT = 100, p0 = 5, and m0 = 10.

In Fig. 2, with m0 = 10 and p0 = 5, the NMSE of our

proposed convex algorithm, PLAN, for the estimation of the

Doppler shift parameters is shown as a function of SNR. In this

figure, for SNR ≥ 5 dB, we see that for all values of nT , the

proposed algorithm can estimate the Doppler parameters with

at most 0.01 (normalized) error. Furthermore, the accuracy of

Doppler shifts estimation improves by increasing the number

of measurements. Fig. 3 shows the effect of changing m0

and p0 for nT = 64. As clearly can be seen from Fig.

3, as m0 or p0 increases, the accuracy of Doppler shifts

estimation does not degrade much as long as the number

of the measurements is sufficient, i.e, on the order of m0p0;

however, the performance degrades much more if the quantity

m0p0 exceeds nT by a large factor. From Fig. 2 and Fig. 3,

we conclude that our PLAN approach can perform quite well

with the measurements number only on the order of a constant

factor of m0p0.

Fig. 4 and 5 illustrate the Doppler shift recovery using the

(dual) function μ(ν). The channel Doppler shift parameters to

generate this figure were νk ∈ {−0.4,−0.2, 0, 0.2, 0.3}. It is

clear that by increasing nT from 100 to 200 measurements,

other peaks in the curve in Fig. 5 get much smaller than the

peaks for the locations of the Doppler shifts of the channel.

In other words, by increasing the number of measurements,

nT , the resolution constraint in Theorem 1 becomes smaller

and we are able to estimate the Doppler shifts with higher

accuracy.

D. Bit Error Performance Comparison

In this section, we compare the performance of our PLAN

method and the element sparsity only (sparse approximation)

approach based on the l1-norm, for data recovery during the

data transmission phase. For this comparison, we build a mini-

mum mean square error (MMSE) equalizer using the estimated

channel matrix using these two algorithms, to equalize the

channel distortion (see for example Chapter 16.2 in [21]). We

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Doppler shift ( )

0

0.2

0.4

0.6

0.8

1

1-Q

()

Fig. 5: Doppler shifts are the roots of Q(ν). This figure depicts1−Q(ν) vs. ν. For nT = 200, p0 = 5, and m0 = 10.

consider m0 = 10, p0 = 5, and nT = 150 during the channel

estimation phase. Then, we generate n = 1000 bits randomly

and modulated by BPSK signaling to compute the bit-error-

rate (BER) during data transmission. The results in Fig. 6 show

the BER vs. SNR performance of these algorithms. The label

“PLAN-not satisfied” indicates that the resolution constraint

is not met in the channel for the Doppler shift realizations,

while “PLAN-satisfied” indicates the Doppler shifts are well-

separated as suggested in Theorem 3. For both cases (well-

separated and not), we generate random values of the Doppler

shifts. We first select ν uniformly from [−0.5, 0.5]; for each

realization, we check if the resolution constraint is met. If

it is, this set of values is labeled as “well-separated,” if it

is not, it is labeled as “not well-separated.” Then simula-

tions/computations are done for these two sets of values and

performance averaged accordingly. The performance of the

sparse approximation method does not depend on the Doppler

shift separation constraint, thus in Fig. 6, only one curve for

this method is provided. From Fig. 6, we see that PLAN offers

significantly superior performance over a sparsity-only method

even if the resolution constraint is not met. Furthermore,

improvement in performance is achieved if the constraint is

met, i.e., the Doppler shifts are sufficiently separated. Addi-

tionally, slight performance improvement over [12] is observed

if the resolution constraint is not met especially at low SNR

values; whereas much stronger improvement is observed if the

resolution constraint is met.

VI. CONCLUSIONS

In this work, we develop new parametric descriptions

of narrow-band time-varying channels which lead to novel

channel estimation strategies. In particular, the representations

admit a low-rank, bilinear form as functions of the Doppler,

delay, and pulse shape functions. These forms suggest matrix

factorization methods for channel estimation. Despite the

popularity of Wirtinger flow methods for solving non-convex

optimizations with good initializations, we show that such

10

0 2 4 6 8 10 12 14 16SNR (dB)

10-2

10-1

100B

ER

LASSOOMP (optimized)PLAN- not satisfiedPLAN- satisfied

Fig. 6: BER vs. SNR performance comparison. The label “PLAN-not satisfied” indicates that the resolution constraint is not enforced inthe channel realization, while “PLAN-satisfied” indicates the Dopplershifts are well-separated as given in Theorem 3.

a strategy does not work well for our channel estimation

application. Thus, the alternating/gradient descent approach

(NADM) for the non-convex objective function suffers from

local minima. If the channel is identifiable, it can be shown

that the gradient is only zero at the true channel in the

noiseless case, however it is also proven that the variance of

the gradient is high, implying slow convergence. In contrast, a

different parametric representation leads to a convex optimiza-

tion wherein the atomic norm heuristic can be employed to

promote sparsity/low-rank. The parametric, low rank, atomic

norm (PLAN) approach can be implemented via a semi-

definite program. Uniqueness conditions are established for

PLAN as well as optimality conditions. Finally, a scaling

law is developed that is dependent on a practically achiev-

able resolution constraint (how close two paths can be in

Doppler and delay). A challenge for the future is to fully

incorporate the functional form of both Doppler and delay.

As true channel sparsity is lost when practical pulse shapes

are employed, the proposed methods offer strong improvement

over a classical estimation strategy which assumes a truly

sparse observation, i.e. LASSO. Simulation results show that

the proposed methods can offer anywhere from 5 − 12dB

improvement over LASSO. Furthermore, a 2 dB improvement

is observed, on average, over the basis expansion method that

considers leakage effects.

APPENDIX A: PROOF OF THEOREM 1

We will show the contrapositive, i.e., if∂J(HHHt,HHHν)

∂HHHt= 000,

then it implies ΔΔΔ = 000. Standard linear algebra, Wirtinger

calculus [36], and the definition of the matrices XXXn imply,

∂J (HHHt,HHHν)

∂HHHt=

∂

(∑nT−1n=0

∣∣∣y[n]− Tr(XXXnHHHtHHH

Tν

)∣∣∣2)∂HHHt

= −nT−1∑n=0

(Tr(XXXnHHH

∗tHHH

∗νT)−

Tr(XXXnHHHtHHH

Tν

))∗

XXXnHHHν

=

nT−1∑n=0

Tr(XXX∗

nΔΔΔH)xnHHHν [n, :]. (44)

Clearly, if∂J(HHHt,HHHν)

∂HHHt= 000, then, we must have

Tr(XXX∗

nΔΔΔH)xnHHHν [n, :] = 000 for every 0 ≤ n ≤ nT − 1.

Since, xn �= 0 (by domain restriction) and HHHν [n, :] �= 0T (by

definition), we must have Tr(XXX∗

nΔΔΔH)

= 0 for every 0 ≤n ≤ nT −1, or equivalently Π(ΔΔΔ) = 0 ⇐⇒ Π

(HHH∗

tHHH∗νT)=

Π(HHHtHHHν

T)

. Using the unique identifiability of the global

optimum implies HHH∗tHHH

∗νT= HHHtHHH

Tν , or equivalently ΔΔΔ = 000. �

APPENDIX B: PROOF OF THEOREM 2

Before starting the proof of this Theorem, we state Lemma

2 that is useful in the rest of our proof.

Lemma 2 (e.g., see [37]). Let XXX ∈ m×n and CCC ∈ n×n

be arbitrary matrices and let ⊗ denote the matrix Kroneckerproduct operation. We have

Tr{XXXCCCXXXH

}= vec (XXX)

H(CCC⊗ III) vec (XXX) . � (45)

Proof of Theorem 2: Let us define the shorthand DDD =HHHνHHH

Hν , so that AAA = DDD ⊗ III. From (44) and the definition of

the matrices XXXn, we get

∂J (HHHt,HHHν)

∂HHHt=

nT−1∑n=0

Tr(XXX∗

nΔΔΔH)xnHHHν [n, :]

=

nT−1∑n=0

xnxHn ΔΔΔ∗[:, n]HHHν [n, :].

Since the training pilots are generated using a random BPSK

sequence, we have E{xnxn

H}= III for every 0 ≤ n ≤ nT −1

and therefore,

E

{∂J (HHHt,HHHν)

∂HHHt

}=

nT−1∑n=0

ΔΔΔ∗[:, n]HHHν [n, :] = ΔΔΔ∗HHHν .

Using Lemma 2, we have∥∥∥∥E{∂J (HHHt,HHHν)

∂HHHt

}∥∥∥∥2

F

= Tr(ΔΔΔ∗DDDΔΔΔT

)= vec (ΔΔΔ∗)

H(DDD⊗ III) vec (ΔΔΔ∗) .

(46)

Next, we will invoke some standard properties of the Kro-

necker matrix product [37]. We have (YYY1 ⊗YYY2)H

= YYYH1 ⊗YYYH

2

for arbitrary matrices YYY1 and YYY2, so that BBB is Hermitian

11

symmetric by the Hermitian symmetry of both DDD and III.Further, if YYY1 ∈ p×p and YYY2 ∈ q×q respectively have

eigenvalues λi, 1 ≤ i ≤ p and μj , 1 ≤ j ≤ q (listed with

multiplicities), then YYY1 ⊗ YYY2 has the pq eigenvalues (with

multiplicities) λiμj , (i, j) ∈ {1, . . . , p}×{1, . . . , q}. Thus, the

minimum and maximum eigenvalues of BBB are (by definition)

the same as that of DDD. Since

xHDDDx =∥∥∥HHHH

ν x∥∥∥22≤∥∥∥HHHH

ν

∥∥∥22‖x‖22,

with equality achieved when x is the leading eigenvector of

HHHHν , we have 000 � DDD �

∥∥∥HHHHν

∥∥∥22III. Therefore, BBB is positive

semidefinite with maximum eigenvalue as∥∥∥HHHH

ν

∥∥∥22= ‖HHHν‖22

and (28) is proved. From (46) and standard linear algebra, we

have∥∥∥∥∂J (HHHt,HHHν)

∂HHHt

∥∥∥∥2

F

= Tr

([nT−1∑n=0

xnxHn ΔΔΔ∗[:, n]HHHν [n, :]

]H×

nT−1∑n′=0

xn′xHn′ΔΔΔ

∗[:, n′]HHHν [n′, :]

)

=

nT−1∑n=0

nT−1∑n′=0

Tr

(HHHν [n, :]

HΔΔΔT [:, n]xnxHn ×

xn′xHn′ΔΔΔ

∗[:, n′]HHHν [n′, :]

).

Since the training pilots are generated using a random BPSK

sequence, we get,

E

{∥∥∥∥∂J (HHHt,HHHν)

∂HHHt

∥∥∥∥2

F

}

=

nT−1∑n=0

nT−1∑n′=0

Tr(HHHν [n, :]

HΔΔΔT [:, n]ΔΔΔ∗[:, n′]HHHν [n′, :]

)

+ (m0 − 1)

nT−1∑n=0

Tr(HHHν [n, :]

HΔΔΔT [:, n]ΔΔΔ∗[:, n]HHHν [n, :])

=

∥∥∥∥∥nT−1∑n=0

ΔΔΔ∗[:, n]HHHν [n, :]

∥∥∥∥∥2

F

+

(m0 − 1)

nT−1∑n=0

‖ΔΔΔ[:, n]‖22‖HHHν [n, :]‖22

=

∥∥∥∥E{∂J (HHHt,HHHν)

∂HHHt

}∥∥∥∥2

F

+ (m0 − 1)p0

nT−1∑n=0

‖ΔΔΔ[:, n]‖22,

where the expression for the expected gradient E{

∂J(HHHt,HHHν)∂HHHt

}in the last equality follows from (46) and

‖HHHν [n, :]‖22 =

p0∑l=1

|HHHν [n, l]|2 =

p0∑l=1

1 = p0.

This completes the proof of the Theorem 2. �

APPENDIX C: PROOF OF THEOREM 3

According to our analysis in Section IV-B1, if we could

design a vector-valued function (polynomial) μ(ν) that satis-

fies conditions (C-1) and (C-2), then the optimization problem

in Equation (32) will recover the optimal solution of our

channel estimation problem. In this proof, we use a technique

called dual certifier construction [25], [26], [34]. Based on

this technique, we construct a randomized dual polynomial

μ(ν), i.e., the dual certifier, with the use of Fejer’s kernel

[34] and show that given enough number of measurements,

the constructed dual certifier satisfies both conditions (C-1)

and (C-2) with high probability.

Consider the the squared Fejer’s kernel as [34]

fr(ν) =1

nT

nT+m0−1∑n=m0

fne−j2πnν , (47)

where fn = 1nT

∑nT

k=n−nT

(1− |k|

nT

)(1− |n−k|

nT

). Define

the randomized matrix-valued version of the squared Fejer’s

kernel as [25], [26], [34]

FFFr(ν) =1

nT

nT+m0−1∑n=m0

fne−j2πnνxnx

Hn . (48)

Since the training signal is generated using an i.i.d. ran-

dom source with Rademacher distribution, it is clear that

E {FFFr(ν)} = fr(ν)IIIm0and for its derivative we have

E{FFF′r(ν)

}= f ′

r(ν)IIIm0, where IIIm0

denotes an m0 × m0

identity matrix. Now, we define a candidate vector-valued dual

certifier polynomial μ(ν) as

μ(ν) =

p0∑k=1

FFFr(ν − νk)αk +FFF′r(ν − νk)βk, (49)

where αk = [αk,1, · · · , αk,m0]T

and βk = [βk,1, · · · , βk,m0]T

are constant coefficients. Clearly, the candidate μ(ν) defined

in Equation (49) follows the valid form of μ(ν) given in

Equation (39). Coefficients αk and βk are selected such that

the candidate μ(ν) satisfies condition (C-2) and part of (C-1),

namely

μ(νk) = sign (ηk) gk, (50)

μ′(νk) = 0, i.e., maximum occurs at νk. (51)

We can summarize the above equations as

ΓΓΓ[αT

1 , · · · ,αTp0, γβT

1 , · · · , γβTp0

]T= g, (52)

where g =[sign (η1) g

T1 , · · · , sign (ηp0

) gTp0,0T , · · · ,0T

]T,

γ =√|f ′′

r (0)|, and the matrix ΓΓΓ can be written as

ΓΓΓ =1

nT

nT+m0−1∑n=m0

(νnνHn )⊗ (xnx

Hn )fn, (53)

where ⊗ is the Kronecker product and νn =[e−j2πnν1 , .., e−j2πnνp0 , j2πn

γ e−j2πnν1 , .., j2πnγ e−j2πnνp0

]H.

Thus, if we show that ΓΓΓ is invertible, then we can easily

evaluate αk and βk from the system of equations in Equation

(52) and accordingly μ(ν) will satisfy both (50) and (51).

Lemma 3 shows that for enough number of measurements

and well-separated Doppler shift parameters, the matrix ΓΓΓ is

invertible with high probability.

12

Lemma 3. [See Proposition 16 in [25] and Lemma 2.2 in [34]] Define an event Eε = {‖ΓΓΓ− E {ΓΓΓ}‖ ≤ ε} for the generatedrandom i.i.d. sequence xn with Rademacher distribution, then

1) Let 0 < δ < 1 and |νi − νj | ≥ 1nT

for ∀i �= j. Then forany ε ∈ (0, 0.5], as long as

nT ≥ 80m0p0ε2

log

(4m0p0

δ

),

event Eε occurs with probability at least 1− δ.2) Define ΓΓΓ = E {ΓΓΓ}. Let |νi − νj | ≥ 1

nTfor ∀i �= j. Then

ΓΓΓ is invertible.3) Given that Eε holds for an ε ∈ (0, 0.25], then we have∥∥∥ΓΓΓ−1 − ΓΓΓ

−1∥∥∥ ≤ 2ε

∥∥∥ΓΓΓ−1∥∥∥ and

∥∥ΓΓΓ−1∥∥ ≤ 2

∥∥∥ΓΓΓ−1∥∥∥. �

Thus, the construction of μ(ν) in Equation (49) ensures

condition (C-2) and μ′(νk) = 0 for ∀k. To complete the proof

we need to show that ‖μ(ν)‖2 < 1 for all ν ∈ [−0.5, 0.5]\Tνto guarantee condition (C-1). We show that this condition will

be satisfied by proposed μ(ν) in Lemma 5 and Lemma 6. But

before stating these lemmas, let us to define some notations

and state Lemma 4 that we use in the proof of Lemma 5

and 6. Let ΓΓΓ−1 = [LLL RRR] where LLL ∈ IR2m0p0×m0p0 and RRR ∈IR2m0p0×m0p0 , then using (52), we have[

αT1 , · · · ,αT

p0, γβT

1 , · · · , γβTp0

]T= LLLg.

If we multiply both side of above equation by

ΩΩΩ(m)(ν) =1

γm

[FFF(m)r (ν − ν1), · · · ,FFF(m)

r (ν − νp0),

1

γFFF(m+1)r (ν − ν1), · · · ,

1

γFFF(m+1)r (ν − ν1)

]H, (54)

where ΩΩΩ(m)(ν) denotes the mth order derivative of the func-

tion ΩΩΩ(ν) for m = 0, 1, 2, · · · , then we can present the mth

order entry-wise derivative of μ(ν) by

1

γmμ(m)(ν) = ΩΩΩ(m)(ν)HLLLg. (55)

Similarly, if we define μ(ν) = E {μ(ν)} and ΓΓΓ−1

=[LLL RRR

],

then we have

1

γmμ(m)(ν) =

[E{ΩΩΩ(m)(ν)

}]H(LLL⊗ IIIm0

)g. (56)

Furthermore, we can write

ΩΩΩ(m)(ν) =1

nT

nT+m0−1∑n=m0

(j2πn

γ)mfne

j2πnννn ⊗ xnxHn ,

(57)

and E{ΩΩΩ(m)(ν)

}= ω(m)(ν)⊗ III, where

ω(m)(ν) =1

γm

[f (m)r (ν − ν1), · · · , f (m)

r (ν − νp0),

1

γf (m+1)r (ν − ν1), · · · ,

1

γf (m+1)r (ν − ν1)

]H

=1

nT

nT+m0−1∑n=m0

(j2πn

γ)mfne

j2πnν(νn ⊗ III). (58)

Now using these relationships, we can use Lemma 4 to

show that μ(m)(ν) is concentrated around μ(m)(ν) with high

probability.

Lemma 4 (see the proof of Theorem 3 in [38]). Consider|νi − νj | ≥ 1

nTfor ∀i �= j and let δ ∈ (0, 1). Then, for

m = 0, 1, 2, · · · , we have

1

γm

∥∥∥μ(m)(ν)− μ(m)(ν)∥∥∥2≤ ε (59)

for nT ≥ cm0p0

ε2 log3(nTm0p0

δε

), where c is a constant number,

with probability at least 1− δ. �

Let us define T nearν = ∪p0

k=1 [νk − νε, νk + νε] and T farν =

[−0.5, 0.5]\T nearν where νε = O

(1nT

), e.g., say νε =

0.1nT

.

Lemma 5. Assume |νi − νj | ≥ 1nT

for ∀i �= j and let δ ∈(0, 1). Then,

‖μ(ν)‖2 < 1, for ∀ν ∈ T farν

with probability at least (1 − δ) for nT ≥cm0p0 log

3(nTm0p0

δ

). �

Proof of Lemma 5: We start by a relationship results from

triangular inequality

‖μ(ν)‖2 ≤ ‖μ(ν)− μ(ν)‖2 + ‖μ(ν)‖2, (60)

where μ(ν) = E {μ(ν)}. To complete the proof, since from

Lemma 4, we know that ‖μ(ν)− μ(ν)‖2 approaches to zero

with high probability for ν ∈ (0, 1), we just need to show that

‖μ(ν)‖2 < 1 for ν ∈ T farν . From (56), we have

‖μ(ν)‖2 = supx:‖x‖2=1

xHμ(ν) (61)

= supx:‖x‖2=1

xH [E {ΩΩΩ(ν)}]H (LLL⊗ III)g (62)

= supx:‖x‖2=1

p0∑k=1

[ω(ν)HLLL

]k

(xH ηkgk

)< 0.99992.

(63)

The above inequality follows from the fact that(xH ηkgk

)≤ 1

and the proof of Lemma 2.4 in [34] for ν ∈ T farν (or see

Lemma 10 in [26]).

Lemma 6. Assume |νi − νj | ≥ 1nT

for ∀i �= j and let δ ∈(0, 1). Then,

‖μ(ν)‖2 < 1, for ∀ν ∈ T nearν

with probability at least (1 − δ) for nT ≥cm0p0 log

3(nTm0p0

δ

). �

Proof of Lemma 6: Our choice of the coefficients implies

thatd‖μ(ν)‖2

2

dν |ν=νk= 0 and

d2‖μ(ν)‖22dν2

|ν=νk= Re

{2μ(ν)H

dμ(ν)

dν

}|ν=νk

= 0.

Thus, for ν ∈ T near = [νk − νε, νk + νε], to proof the claim

of the theorem, it is sufficient to show thatd2‖μ(ν)‖2

2

dν2 < 0.

13

Note that

1

2

d2‖μ(ν)‖22dν2

= ‖μ′(ν)‖22 +Re{μ′′(ν)Hμ(ν)

}, (64)

for ν ∈ T nearν . Using Lemma 4, we can write

1

γ2‖μ′(ν)‖22 =

∥∥∥∥ 1γ (μ′(ν)− μ′(ν) + μ′(ν))

∥∥∥∥2

2

≤ ε2 +2ε‖μ′(ν)‖2

γ+

‖μ′(ν)‖22γ2

. (65)

Similar to calculations in Lemma 2.3 and 2.4 in [34], we have

‖μ′(ν)‖2 ≤ 1.6nT and γ > πnT√3

for nT ≥ 2. Therefore, we

have

1

γ2‖μ′(ν)‖22 ≤ ε2 + 1.75ε+

‖μ′(ν)‖22γ2

. (66)

Similarly, we have ‖μ(ν)‖2 ≤ 1 and ‖μ′′(ν)‖2 ≤ 21.15n2T

for ν ∈ T near, thus

1

γ2Re

{μ′′(ν)Hμ(ν)

}=

1

γ2Re

{(μ′′(ν)− μ′′(ν) + μ′′(ν))

H(μ(ν)− μ(ν) + μ(ν))

}

≤ ε2 + 4.25ε+Re


}γ2

. (67)

Therefore, substituting (66) and (67) in the inequality (64), we

have

1

2γ2

d2‖μ(ν)‖22dν2

< 2ε2 + 6ε+

1

γ2

(‖μ′(ν)‖22 +Re


}).

(68)

Similar to argument in Lemma 2.3 in [34] (see equation 2.19

and afterward), we can conclude that

1

γ2

(‖μ′(ν)‖22 +Re


})≤ −0.029.

Therefore, we have 12

d2‖μ(ν)‖22

dν2 ≤ 2ε2 + 6ε − 0.029 < 0, for

ε small enough, e.g., ε ≤ 10−5. This completes the proof.

Putting the results of Lemma 5 and Lemma 6 together, The-

orem 3 is proved, since μ(ν) is verified to satisfy both condi-

tions (C-1) and (C-2) with high probability for enough number

of measurements. �

REFERENCES

[1] G. Matz, H. Bolcskei, and F. Hlawatsch, “Time-frequency foundationsof communications: Concepts and tools,” IEEE Signal Processing Mag-azine, vol. 30, pp. 87–96, Nov 2013.

[2] S. Beygi, U. Mitra, and E. G. Ström, “Nested sparse approximation:Structured estimation of v2v channels using geometry-based stochasticchannel model,” IEEE Transactions on Signal Processing, vol. 63,pp. 4940–4955, Sept 2015.

[3] G. A. Hollinger, S. Choudhary, P. Qarabaqi, C. Murphy, U. Mitra, G. S.Sukhatme, M. Stojanovic, H. Singh, and F. Hover, “Underwater datacollection using robotic sensor networks,” IEEE Journal on SelectedAreas in Communications, vol. 30, pp. 899–911, June 2012.

[4] S. Beygi and U. Mitra, “Multi-scale multi-lag channel estimation us-ing low rank approximation for ofdm,” IEEE Transactions on SignalProcessing, vol. 63, pp. 4744–4755, Sept 2015.

[5] P. Bello, “Characterization of randomly time-variant linear channels,”IEEE Transactions on Communications Systems, vol. 11, pp. 360–393,December 1963.

[6] E. Aktas and U. Mitra, “Single-user sparse channel acquisition inmultiuser DS-CDMA systems,” IEEE Transactions on Communications,vol. 51, pp. 682–693, April 2003.

[7] C. Carbonelli, S. Vedantam, and U. Mitra, “Sparse channel estimationwith zero tap detection,” IEEE Transactions on Wireless Communica-tions, vol. 6, pp. 1743–1763, May 2007.

[8] S. F. Cotter and B. D. Rao, “Sparse channel estimation via matchingpursuit with application to equalization,” IEEE Transactions on Com-munications, vol. 50, pp. 374–377, March 2002.

[9] J. Homer, I. Mareels, R. R. Bitmead, B. Wahlberg, and A. Gustafsson,“LMS estimation via structural detection,” IEEE Transactions on SignalProcessing, vol. 46, pp. 2651–2663, Oct 1998.

[10] M. Kocic, D. Brady, and M. Stojanovic, “Sparse equalization for real-time digital underwater acoustic communications,” in ‘Challenges of OurChanging Global Environment’. Conference Proceedings. OCEANS ’95MTS/IEEE, vol. 3, pp. 1417–1422 vol.3, Oct 1995.

[11] C. R. Berger, S. Zhou, J. C. Preisig, and P. Willett, “Sparse channelestimation for multicarrier underwater acoustic communication: Fromsubspace methods to compressed sensing,” IEEE Transactions on SignalProcessing, vol. 58, pp. 1708–1721, March 2010.

[12] G. Tauböck, F. Hlawatsch, D. Eiwen, and H. Rauhut, “Compressiveestimation of doubly selective channels in multicarrier systems: Leakageeffects and sparsity-enhancing processing,” IEEE Journal of SelectedTopics in Signal Processing, vol. 4, pp. 255–271, April 2010.

[13] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressedchannel sensing: A new approach to estimating sparse multipath chan-nels,” Proceedings of the IEEE, vol. 98, pp. 1058–1076, June 2010.

[14] N. Michelusi, U. Mitra, A. F. Molisch, and M. Zorzi, “UWBsparse/diffuse channels, Part I: Channel models and bayesian estima-tors,” IEEE Transactions on Signal Processing, vol. 60, pp. 5307–5319,Oct 2012.

[15] N. Michelusi, U. Mitra, A. F. Molisch, and M. Zorzi, “UWBsparse/diffuse channels, Part II: Estimator analysis and practical chan-nels,” IEEE Transactions on Signal Processing, vol. 60, pp. 5320–5333,Oct 2012.

[16] S. Beygi and U. Mitra, “Structured estimation of time-varying narrow-band wireless communication channels,” in 2017 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP),pp. 3529–3533, March 2017.

[17] S. Beygi and U. Mitra, “Time-varying narrowband channel estima-tion: Exploiting low-rank and sparsity via bilinear representation,” in2016 50th Asilomar Conference on Signals, Systems and Computers,pp. 1235–1239, Nov 2016.

[18] D. Eiwen, G. Tauböck, F. Hlawatsch, and H. G. Feichtinger, “Com-pressive tracking of doubly selective channels in multicarrier systemsbased on sequential delay-doppler sparsity,” in 2011 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP),pp. 2928–2931, May 2011.

[19] D. Eiwen, G. Tauböck, F. Hlawatsch, H. Rauhut, and N. Czink,“Multichannel-compressive estimation of doubly selective channels inMIMO-OFDM systems: Exploiting and enhancing joint sparsity,” in2010 IEEE International Conference on Acoustics, Speech and SignalProcessing, pp. 3082–3085, March 2010.

[20] D. Eiwen, G. Tauböck, F. Hlawatsch, and H. G. Feichtinger, “Groupsparsity methods for compressive channel estimation in doubly disper-sive multicarrier systems,” in 2010 IEEE 11th International Workshopon Signal Processing Advances in Wireless Communications (SPAWC),pp. 1–5, June 2010.

[21] A. F. Molisch, Wireless communications, vol. 34. John Wiley & Sons,2012.

[22] S. Oymak, B. Recht, and M. Soltanolkotabi, “Sharp time–data tradeoffsfor linear inverse problems,” IEEE Transactions on Information Theory,vol. 64, pp. 4129–4158, June 2018.

[23] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht,“Low-rank solutions of linear matrix equations via procrustes flow,”ICML’16, pp. 964–973, JMLR.org, 2016.

[24] M. R. Hestenes, “Multiplier and gradient methods,” Journal of Opti-mization Theory and Applications, vol. 4, pp. 303–320, Nov 1969.

[25] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht, “Compressed sensing offthe grid,” IEEE Transactions on Information Theory, vol. 59, pp. 7465–7490, Nov 2013.

[26] Y. Chi, “Guaranteed blind sparse spikes deconvolution via lifting andconvex optimization,” IEEE Journal of Selected Topics in Signal Pro-cessing, vol. 10, pp. 782–794, June 2016.

14

[27] S. Choudhary, S. Beygi, and U. Mitra, “Delay-doppler estimationvia structured low-rank matrix recovery,” in 2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP),pp. 3786–3790, March 2016.

[28] M. Fazel, H. Hindi, and S. Boyd, “Rank minimization and applicationsin system theory,” in Proceedings of the 2004 American Control Con-ference, vol. 4, pp. 3273–3278 vol.4, June 2004.

[29] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization,”SIAM Rev., vol. 52, pp. 471–501, Aug. 2010.

[30] J. P. Haldar and D. Hernando, “Rank-constrained solutions to linearmatrix equations using powerfactorization,” IEEE Signal ProcessingLetters, vol. 16, pp. 584–587, July 2009.

[31] R. Willoughby, “Solutions of ill-posed problems (A. N. Tikhonov andV. Y. Arsenin),” SIAM Review, vol. 21, no. 2, pp. 266–267, 1979.

[32] W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, and P. B.Kramer, Numerical recipes: the art of scientific computing. AIP, 1987.

[33] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The con-vex geometry of linear inverse problems,” Foundations of ComputationalMathematics, vol. 12, pp. 805–849, Dec 2012.

[34] E. J. Candes and C. Fernandez-Granda, “Towards a mathematical theoryof super-resolution,” Communications on Pure and Applied Mathemat-ics, vol. 67, pp. 906–956, June 2014.

[35] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1.” http://cvxr.com/cvx, Mar. 2014.

[36] R. Remmert, Theory of complex functions, vol. 122. Springer Science& Business Media, 2012.

[37] K. B. Petersen, M. S. Pedersen, et al., “The matrix cookbook,” TechnicalUniversity of Denmark, vol. 7, p. 15, 2008.

[38] R. Heckel and M. Soltanolkotabi, “Generalized line spectral estimationvia convex optimization,” IEEE Transactions on Information Theory,vol. 64, pp. 4001–4023, June 2018.

bilinear matrix factorization methods for time-varying

Documents