bilinear matrix factorization methods for time-varying
TRANSCRIPT
Bilinear Matrix Factorization Methods for Time-Varying NarrowbandChannel Estimation: Exploiting Sparsity and Rank
Sajjad Beygi, Amr Elnakeeb, Sunav Choudhary, and Urbashi Mitra
Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering
University of Southern California, Los Angeles, CA
Abstract—In this paper, the estimation of a narrowband time-varying channel under the practical assumptions of finite blocklength and finite transmission bandwidth is investigated. It isshown that the signal, after passing through a time-varying nar-rowband channel reveals a useful parametric low-rank structurethat can be represented as a bilinear form. To estimate thechannel, two strategies are developed. The first method exploitsthe low-rank bilinear structure of the channel via a non-convexstrategy based on an alternating direction optimization betweenthe delay and Doppler directions. While prior Wirtinger flowmethods have exhibited good performance with proper initial-ization, this is not true in the current scenario. Due to the non-convex nature of this approach, this first approach is sensitive tolocal minima. Furthermore, the convergence rate of the Wirtingerflow method is shown to be provably modest. Thus, a novelconvex approach based on the minimization of the atomic normusing measurements of the signal at the time domain is proposedbased on a second bilinear parametrization of the channel. Forthe convex approach, optimality and uniqueness conditions, andthe theoretical guarantee for noiseless channel estimation withsmall number of measurements are characterized. Numericalresults show that the performance of the proposed algorithmis independent of the leakage effect and the new methods canachieve 5 − 12 dB improvement, on average, compared to aclassical l1-based sparse approximation method that does notconsider the leakage effect, and 2 dB improvement over a basisexpansion method that considers the leakage effect.
I. INTRODUCTION
Wireless communications have enabled intelligent traf-
fic safety [1], [2], automated robotic networks, underwater
surveillance systems [3], [4], and many other useful technolo-
gies. In all of these systems, establishing a reliable, high data
rate communication link between the transmitter and receiver
is essential. To achieve this goal, accurate channel state
information is needed to equalize the received signal and, thus
combat the effects of the wireless channel. One of the well-
known approaches to acquire channel state information is to
probe the channel in time/frequency with known signals, and
reconstruct the channel response from the output signals (see
[5] and references therein). Least-squares (LS) and Wiener
filters are classical examples of this approach. However, these
methods do not take advantage of the rich, intrinsic structure of
Copyright (c) 2017 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].
This research has been funded in part by the following grants: ONRN00014-15-1-2550, ONR N00014-09-1-07004, NSF CCF-1117896, NSFCNS-1213128, NSF CCF-1410009, AFOSR FA9550-12-1-0215, DOT CA-26-7084-00, NSF CPS-1446901, and Fulbright Foundation, UK Royal Academyof Engineering, UK Leverhulme Trust.
Ming Hsieh Department of Electrical Engineering, University of South-ern California, Los Angeles, CA 90089 Emails: {beygihar, elnakeeb,ubli}@usc.edu and [email protected]
wireless communication channels in their estimation process.
In particular, many time-varying channels have sparse repre-
sentations in the Doppler-delay domain. The main challenge
with classical approaches, i.e., LS and Wiener filtering, is that
they require large number of measurements compared to the
number of unknown parameters in the estimation problem to
perform well.
To combat this challenge, we need to take advantage of
the side information about the structure of the unknown
parameters. By exploiting inherent structure, we can reduce
the size of the feasible set of solutions. This results in the need
for fewer observations. Methods for sparse channel estimation
have existed for some time [4], [6]–[11] and there has been
a recent resurgence using modern signal processing methods
for sparsity [11]–[13], group-sparsity or mixed/hybrid (sparse
and group sparsity) structures [2], [14], [15] and even rank
[4], [16], [17].
The use of practical pulse shapes due to finite block
length and transmission bandwidth constraints results in a
loss of sparsity and degrades performance of the modern
sparse methods [2], [12]. This effect is defined as channel
leakage in [2], [12]. It has been shown that the performance of
compressed sensing (CS) methods are significantly degraded
due to the leakage effect in practice [2], [12], [18]–[20]. The
mitigation of leakage effects via sparse basis expansion has
been previously considered in [12], [18]–[20]. In [18], a com-
pressive method for tracking doubly selective channels within
multicarrier systems, including OFDM systems is proposed.
Using a recently introduced concept of modified compressed
sensing (MOD-CS), the sequential delay-Doppler sparsity of
the channel is exploited to improve estimation performance
through a recursive estimation module. In [19], a compressive
estimator of doubly selective channels for pulse-shaping mul-
ticarrier MIMO systems (including MIMO OFDM as a special
case) is proposed. The use of multichannel compressed sensing
exploits the joint sparsity of the MIMO channel for improved
performance. A multichannel basis optimization for enhancing
joint sparsity is also proposed. In [20], advanced compressive
estimators of doubly dispersive channels within multicarrier
communication systems (including classical OFDM systems)
are considered. The performance of compressive channel esti-
mation has been shown to be limited by leakage components
impairing the channel’s effective delay-Doppler sparsity.
In [12], the application of compressed sensing to the estima-
tion of doubly selective channels within pulse-shaping multi-
carrier systems (considered OFDM systems as a special case)
is considered. By exploiting sparsity in the delay-Doppler
domain, CS-based channel estimation allows for an increase
in spectral efficiency through a reduction of the number of
2
pilot symbols. For combating leakage effects that limit the
delay-Doppler sparsity, a sparsity-enhancing basis expansion
is considered and a method for optimizing the basis with
or without prior statistical information about the channel is
proposed.
In this work it is shown that with these practical communi-
cation system constraints, the transmitted signal after passing
through a linear, time-varying narrowband channel exhibits a
parametric, low-rank, bilinear form. It is this signal description
that enables methods that are distinct from the previously
described work which are inherently one-dimensional in nature
where our methods are two-dimensional. The rank of this
representation is determined by the number of dominant paths,
which is small in both cellular and underwater environments
[21]. The bilinear form is due to the separability of the pulse
leakage effects in the delay and Doppler domains [2]. Herein,
we propose two methods which directly exploit the bilinear,
low-rank form. Our first approach is a variation of gradient
based methods, motivated by the strong performance of this
type of approach for other applications [22], [23], due to the
ease of finding a good initialization. To solve the non-convex
optimization problem, we use the alternating direction method
[24] due to the bilinearity of the measurement model. In the
first step of this algorithm, we recover the channel in the delay
direction and in the second step we estimate the channel in
the Doppler direction. We repeat the steps iteratively until
we converge to a stationary point. If the optimum values of
the channel estimates are identifiable, we can prove that the
gradient of the non-convex objective function is zero only
at that point, for the noiseless case. Despite this positive
result, we also show that the gradients have high-variance, thus
underscoring the likelihood of finding a local minimum which
is always a challenge with non-convex objective functions.
Our second approach is based on an alternative param-
eterization of the channel. We define a set of atoms to
describe the set of rank-one matrices in our channel estimation
problem. Utilizing this set of atoms, we show that the channel
estimation problem can be stated as a parametric low-rank
matrix recovery problem. Motivated by convex recovery for
inverse problems via the atomic norm heuristic [25], [26], we
develop a recovery algorithm employing the atomic norm to
enforce the channel model and leakage structures via a convex
optimization problem. We show that the solution of our convex
optimization problem is the optimal solution of our channel
estimation problem in the noiseless scenario. Furthermore,
we discuss the conditions under which the solution of this
convex program is unique. Finally, we develop a scaling
law that relates the number of measurements needed to the
probability of correct estimation of leaked channel parameters.
We analyze the algorithm to show that the global optimum can
be recovered in the absence of noise. Numerical results showed
that the proposed algorithm can provide a performance of 5dB to 12 dB improvement, in the SNR sense, for SNRs over
5 dB compared to an l1-based sparse approximation method.
Furthermore, the proposed method offers 2 dB improvement,
on average, over the basis expansion method considered in
[12], which takes into account the leakage effects.
Portions of this work have appeared in three prior confer-
ence papers [16], [17], [27]; the non-convex approach based
on Wirtinger flow was introduced in [17] and results relating
to convergence provided; the convex approach based on the
atomic norm was introduced in [16] and the scaling law result
was provided in [27]. The current manuscript has the complete
proofs for our key technical results, as well as generalizing the
theorem on convergence in [17]. We have further numerical
results in the current work which explore the efficacy of the
root finding for Doppler values in the convex approach as well
as investigating the robustness of the convex approach to not
satisfying the constraints of the scaling law. Additionally, we
have new numerical comparisons to the method considered in
[12].
The rest of this paper is organized as follows. Section II
develops the communication system model which is used in
Section III to derive the discrete time observation model which
captures the bilinear structure of the problem. Section IV
derives the two strategies based on non-convex and convex
programming. Section V is devoted to discussion and nu-
merical results, and finally Section VI concludes the paper.
Appendices, A, B, and C provide the proofs of the three major
theorems of the work.
Notation: Scalar values are denoted by lower-case letters, xand column vectors by bold letters, x. The i-th element of xis given by x[i]. Given vector x ∈ Rn, ‖x‖2 =
√∑ni=1 x[i]
2
denotes the �2 norm of x, respectively. A matrix is denoted by
bold capital letters such XXX and its (i, j)-the element by X[i, j].The transpose of X is given by XT and its conjugate transpose
by XH . A diagonal matrix with elements x is written as
diag{x} and the identity matrix as I. The set of real numbers
by R, and the set of complex numbers by C. The element-wise
(Schur) product is denoted by �.
II. SYSTEM MODEL
We assume that the transmitted signal x(t) is generated by
the modulation of a pilot sequence x[n] onto the transmit pulse
pt(t) as given by
x(t) =+∞∑
n=−∞x[n]pt(t− nTs),
where Ts is the sampling period. Note that this signal model
is quite general, and encompasses OFDM signals as well as
single-carrier signals. The signal x(t) is transmitted over a
linear, time-varying channel. The received signal y(t) can be
written as,
y(t) =
∫ +∞
−∞h (t, τ)x(t− τ) dτ + z(t). (1)
Here, h(t, τ) is the channel’s time-varying impulse response,
and z(t) is a white Gaussian noise process. A common
model for the narrowband time-varying (TV) channel impulse
response is as follows,
h(t, τ) =
p0∑k=1
ηkδ(τ − tk)ej2πνkt, (2)
where p0 denotes the number of dominant paths in the channel,
ηk, tk, and νk denote the kth channel path’s attenuation gain,
3
delay, and Doppler shift, respectively. At the receiver, y(t)is converted into a discrete-time signal using an anti-aliasing
filter pr(t). That is,
y[n] =
∫ +∞
−∞y(t)pr(nTs − t) dt.
We assume that pt(t) and pr(t) are causal with support
[0, Tsupp). Under the reasonable assumption νmaxTsupp � 1,
where νmax = max (ν1, . . . , νp0) denotes the Doppler spread
of the channel [1] and defining p(t) = pt(t) ∗ pr(t), we can
write the received signal after filtering and sampling as [2],
y[n] =
m0−1∑m=0
h(l)[n,m]x[n−m] + z[n], (3)
where h(l)[n,m] =∑p0
k=1 h(l)k [n,m] and
h(l)k [n,m] = ηke
j2πνk((n−m)Ts−tk)p(mTs − tk), (4)
for n ∈ . The superscript l denotes leakage. Here, m0 =⌊τmax
Ts
⌋+ 1, where τmax is the maximum delay spread of
the channel, denotes the maximum discrete delay spread of
the channel. The pulse shaping filter support does not impact
the value of τmax. Without loss of generality, if we assume
that pr(t) has a root-Nyquist spectrum with respect to the
sample duration Ts, then z[n] is a sequence of i.i.d circularly
symmetric complex Gaussian random variables with a constant
variance σ2z . The pulse leakage effect is due to the non-zero
support of the pulse p(·) in Equation (4). The leakage with
respect to Doppler can be decreased by increasing the pulse
shape duration, while the leakage with respect to delay can
be decreased by increasing the bandwidth of the transmitted
signal. Given the practical constraints of pulse shape duration
and bandwidth, the leakage effect increases the number of
nonzero coefficients of the observed leaked channel at the
receiver (for more details, see [2], [12]).
The main goal of channel estimation is the determination
of the channel coefficients, i.e.,{h(l)k [n,m] | for 1 ≤ k ≤ p0
}and 0 ≤ m ≤ m0 − 1 at time instance n, in order to
equalize their effect on the transmitted signal. It is clear
that at each time instance, n, there exist m0p0 (unknown)
channel coefficients to be estimated. These coefficients are
estimated via the observations y[n] and the pilot sequence
x[n] which are both known at the receiver during channel
estimation (see signal model in Equation (3)). We observe
that n ∈ {m0, · · · , nT +m0 − 1}, and nT denotes the total
number of training symbols. In the next section, we show
that the channel coefficients exhibit structures that we can
explicitly exploit in our estimation process in order to decrease
the set of feasible solutions for channel coefficients, and thus
improve the estimation fidelity.
III. PARAMETRIC SIGNAL REPRESENTATION
In this section, we exploit the intrinsic structures in the
measurement model in Equation (3). We show that, even
though the leakage effect diminishes the sparsity of the ef-
fective observed channel coefficients, leakage also introduces
an elegant parametric low-rank structure that will enable
high accuracy channel estimation with a limited number of
measurements.
Define gk(t) = p(t− tk)e−j2πνkt, then using Equations (3)
and (4), the received signal from the k-th path can be written
as
sk[n] =
m0−1∑m=0
gk(mTs)x[n−m] = xTngk, (5)
where gk = [gk (0Ts) , · · · , gk ((m0 − 1)Ts)]T
and xn =[x [n] , x [n− 1] · · · , x [n− (m0 − 1)]]
T. Vectors gk, for 1 ≤
k ≤ p0, contain only the (shifted) leakage pulse shape
information and the vector xn is described by m0 consecutive
samples from the training signal up to time n. We can represent
the (aggregated) received signal in Equation (3) as
y[n] =
p0∑k=1
ηksk[n]ej2πνkn, (6)
where ηk = ηke−j2πνktk and νk = νkTs ∈ [− 1
2 ,12 ]. If we
stack the sk[n] for m0 ≤ n ≤ nT + m0 − 1 in a vector as
sk = [sk[m0], · · · , sk[nT +m0 − 1]]T, we can write
sk = XXXgk, (7)
where XXX is a nT -by-m0 matrix, with its i-th row equal-
ing xTi+m0−1. In wireless communication systems, typically
nT > m0, that is, all sk live in a common low-dimensional
subspace spanned by the columns of a known nT ×m0 matrix
XXX with nT > m0. We assume that ‖gk‖2 = 1 without loss of
generality. Using Equation (5), recovery of sk is guaranteed
if gk can be recovered. Therefore, the number of degrees of
freedom in Equation (6) becomes O(m0p0), which is smaller
than the number of measurements nT when p0,m0 � nT .
Applying Equation (5), we can rewrite Equation (6) as
y[n] =
p0∑k=1
ηkej2πnνkxT
ngk. (8)
We define d(ν) =[e−j2πm0ν , · · · , e−j2π(nT+m0−1)ν
]T,
which is a vector of all possible Doppler shifts in the channel
representation. Thus, we have
y[n] =
⟨p0∑k=1
ηkgkd(νk)H ,xne
Tn−m0+1
⟩, (9)
for n = m0, · · · , nT +m0 − 1, where 〈XXX,YYY〉 = trace(YYYHXXX)and en, 1 ≤ n ≤ nT , are the canonical basis vectors for
IRnT×1. We observe that the first factor in the matrix inner
product given in Equation (9) has only the (narrowband) time-
varying channel information and the second factor is a function
of the training sequence only. Hereafter, we define the leakedchannel matrix as
HHHl =
p0∑k=1
ηkgkd(νk)H . (10)
Since each term in the above summation in Equation (10) is a
rank one matrix, we know that rank (HHHl) ≤ p0. Thus, Equation
(9) leads to a parametrized rank-p0 matrix recovery problem,
4
which we write as
y = Π(HHHl), (11)
where the linear operator Π : m0×nT → nT×1 is defined
as [Π(HHHl)]n =⟨HHHl,xne
Tn
⟩and HHH is a low-rank matrix with
parametric representation given in Equation (10).
IV. STRUCTURED ESTIMATION OF TIME-VARYING
NARROWBAND CHANNELS
In this section, we propose two algorithms to estimate
the leaked channel matrix HHHl via the measurements, y[n],by exploiting the parametrized low-rank matrix structure de-
scribed in Section III. We show that by leveraging the channel
structure in the estimation process, the channel coefficients can
be estimated with high accuracy and with a small number of
measurements. From this point on, we drop the subscript l on
HHHl for clarity and it is understood that HHH always refers to the
leaked channel matrix.
A. Nonconvex Alternating Direction Minimization
Recovering the channel from the linear measurement model
y = Π(HHH), described in Equation (11), is equivalent to deter-
mining a low-rank matrix HHH that satisfies this measurement
model. Thus, we seek to solve,
HHHl = argminHHH
‖y −Π(HHH)‖22 s.t.
{rank (HHH) ≤ p0
HHH =∑p0
k=1 ηkgkd(νk)H
,
(12)
where p0 denotes the maximum number of dominant paths in
the channel. Unfortunately, this optimization problem, due to
the rank constraint is, in general, a NP-hard problem [28]. A
tractable relaxation of the the rank constraint is the nuclear
norm of the target matrix [28]. In particular, we can rewrite
the relaxed optimization problem as,
HHHl = argminHHH
‖y −Π(HHH)‖22 + λ‖HHH‖∗ s.t. HHH =
p0∑k=1
ηkgkd(νk)H ,
(13)
where the parameter λ in Equation (13) determines the trade-
off between the fidelity of the solution to the measurements yand its conforming to the low-rank model. Furthermore, from
Equation (10), we know that the channel matrix HHH can be
represented as
HHH =
p0∑k=1
ηkgkd(νk)H = HHHtHHHν , (14)
where
HHHt =[η1g1, η2g2, · · · , ηp0
gp0
], (15)
HHHν = [d(ν1),d(ν2), · · · ,d(νp0)]H. (16)
As can be seen from Equation (14), the sampled channel
matrix is a function of both the delay and the Doppler values.
One can estimate the sampled channel matrix vectors, i.e, gand d(ν)H , or attempt to estimate the delay and Doppler val-
ues that contribute to those vectors. The Doppler values have a
direct structural relationship with HHH through the Vandermonde
matrix. While the pulse shaping filters are known, endeavoring
to exploit the resulting parametric relationship between the
leakage vector g and the delay (and also the Doppler) did
not provide good performance in our direct estimation of the
delay. This could be due to the highly non-linear nature of the
pulse correlation function or the presence of the Doppler in the
delay leakage vector. Thus, we were able to be parametric with
respect to the Doppler; but not with respect to the delay. We
treat HHHt as an effective delay matrix; although it is dependent
on Doppler as well, and do not seek to estimate the delays
directly. In developing our equalizer, we only need to use HHHt
(as a whole) and HHHν .
Therefore, we can reformulate the optimization problem in
Equation (13) with HHH = HHHtHHHν as,
argminHHHt,HHHν
(‖y −Π(HHHtHHHν)‖22 + λ‖HHHtHHHν‖∗
). (17)
From [29], we know that the minimization of the nuclear
norm of matrix products can be rewritten as a Frobenius norm
minimization,
minHHHt,HHHν
‖HHHtHHHν‖∗ = minHHHt,HHHν
1
2
(‖HHHt‖2F + ‖HHHν‖2F
). (18)
With this equivalence in hand, we will consider the follow-
ing alternative optimization problem,
argminHHHt,HHHν
(‖y −Π(HHHtHHHν)‖22 +
λ
2
(‖HHHt‖2F + ‖HHHν‖2F
)).
(19)
It shall be noted that the cost function given in (19) is an
upper bound on the cost function given in (17), due to the
following inequalities,
minHHHt,HHHν
(||yyy −Π(HHHtHHHν)||22 + λ1||HHHtHHHν ||∗
)(a)
≤ minHHHt,HHHν
(||yyy −Π(HHHtHHHν)||22 + λ1
√rank(HHHtHHHν) ||HHHtHHHν ||F
)(b)
≤ minHHHt,HHHν
(||yyy −Π(HHHtHHHν)||22 + λ1
√p0 ||HHHtHHHν ||F
)(c)
≤ minHHHt,HHHν
(||yyy −Π(HHHtHHHν)||22 + λ1
√p0 ||HHHt||F ||HHHν ||F
)(d)
≤ minHHHt,HHHν
(||yyy −Π(HHHtHHHν)||22 + λ1
√p0
2
(||HHHt||2F + ||HHHν ||2F
))= min
HHHt,HHHν
(||yyy −Π(HHHtHHHν)||22 +
λ
2
(||HHHt||2F + ||HHHν ||2F
)),
where, λ = λ1√p0, (a) holds since ||AAA||∗ ≤
√rank(AAA) ||AAA||F ,
(b) holds since√
rank(HHHtHHHν) =√
rank(HHH) ≤ √p0, (c)
holds since ||AAABBB||F ≤ ||AAA||F ||BBB||F , and (d) holds since
||AAA||F ||BBB||F ≤ 1
2(||AAA||2F + ||BBB||2F ).
The form of this upper bound facilitates optimization,
although it is still non-convex and thus our methods will yield
local optima only.
From Equation (16), we see that the matrix HHHν is a partial
Vandermonde matrix and is fully determined by the Doppler
parameters ν = [ν1, · · · , νp0]. Therefore, instead of optimizing
the matrix HHHν on p0×nT , we perform the optimization over
5
the set of Doppler parameters ν ∈ [− 12 ,
12 ]
p0 as follows:
argminHHHt,ν
‖y −Π(HHHtHHHν(ν))‖22 +λ
2‖HHHt‖2F +
λ
2‖HHHν(ν)‖2F .
Note that the above optimization problem is non-convex [27],
due to the combination of unknown products, i.e., Π(HHHtHHHν).Due to the separability of the objective function in this opti-
mization problem in delay and Doppler directions, see Section
V in [2], we can employ the alternating projections algorithm
[24] which is a space and efficient technique that stores the
iterates in factored form. The algorithm is extraordinarily
simple, and easy to interpret: In the first step, we fix either
HHHt or HHHν , and try to optimize the other. Then, in the second
step, we substitute the computed matrix in the first step and
optimize over the second matrix. We iterate over these two
steps untill the algorithm converges to the optimal (or a
stationary point) solution. Given the current estimates of HHHkt
and HHHkν , the updating rules can be summarized as
HHHk+1ν = argmin
HHHν(ν)
∥∥∥y −Π(HHHk+1
t HHHν(ν))∥∥∥2
2, (20)
HHHk+1t = argmin
HHHt
∥∥y −Π(HHHtHHHν(ν
k))∥∥2
2+
λ
2‖HHHt‖2F . (21)
We can further simplify these iterations using Lemma 1.
Lemma 1. [see e.g., [30]] Suppose that Π : m0×nT →nT×1 is a linear operator where
[Π (HHH)]n = Tr (XXXnHHH) =
m0∑i=1
nT∑j=1
XXXn[j, i]HHH[i, j],
where XXXn = xneTn . Then, for HHH = HHHtHHHν , we can write
Π(HHH) = AAAtvec (HHHHHHHHHν) = AAAνvec (HHHHHHHHHt) , (22)
where vec(·) stacks the columns of its matrix argument into asingle column vector. The matrices AAAt and AAAν are defined as
AAAt[k, l + p0(j − 1)] =
m0∑i=1
XXXk[j, i]HHHt[i, l],
and
AAAν [k, i+m(l − 1)] =
nT∑j=1
XXXk[j, i]HHHν [l, j],
where l ∈ {1, · · · , p0}.
Applying Lemma 1, we can rewrite each iteration of the
alternating projection algorithm in Equation (21) and (20) as
HHHk+1ν = argmin
HHHν(ν)
∥∥∥y −AAAk+1t vec (HHHν(ν))
∥∥∥22, (23)
HHHk+1t = argmin
HHHt
∥∥∥y −AAAkνvec (HHHt)
∥∥∥22+
λ
2‖vec (HHHt)‖22. (24)
In the rest of this work, we connote this method of channel es-
timation as the non-convex alternating direction minimization(NADM) approach. From Equation (24), we see that updating
in the delay direction is just a Ridge estimator or Tikhonov
regularization [31] and in the Doppler direction we need to find
the roots of a (exponential) polynomial using a root-finding
algorithm [32]. One of the challenges with this approach is
that it is sensitive to the initialization.
Remark 1. If we know the exact number of the dominant
paths in the channel, i.e., rank (HHH) = p0, then enforcing the
separability of the channel in leakage and Doppler direction
(matrices) in Equation (17) will implicitly enforce the low-rank
structure of the channel matrix. Therefore, when the number
of dominant paths in the channel is known exactly, we can set
λ = 0 in Equation (17) for the channel estimation.
Considering the case in Remark 1, we state a theorem
(Theorem 1), that under no noise and sufficiently small
stopping resolution, the output(HHHopt
t ,HHHoptν
)of the NADM
algorithm is the global optimum (HHH∗t ,HHH
∗ν) of the channel
estimation problem whenever the global optimum is uniquely
identifiable1.
Theorem 1. Define ΔΔΔ � HHHtHHHTν −HHH∗
tHHH∗νT and J (HHHt,HHHν) =
‖y −Π(HHHtHHHν)‖22. In the absence of noise, ΔΔΔ �= 000 impliesthat ∂J(HHHt,HHHν)
∂HHHt�= 000 and ∂J(HHHt,HHHν)
∂HHHν�= 000, if the global optimum(
HHHoptt ,HHHopt
ν
)is uniquely identifiable.
The proof is given in Appendix A.
Remark 2. The unique identifiability of (HHH∗t ,HHH
∗ν) as the
solution of
(HHH∗t ,HHH
∗ν) = argmin
HHHt,HHHν
J (HHHt,HHHν) (25)
is necessary for any recovery guarantee. To see this, note that if
there is a second global optimum (HHH∗∗t ,HHH∗∗
ν ) to (25) then there
is no way to determine which of the two solutions is the correct
one. In this case, we have ΔΔΔ∗∗ = HHH∗∗t HHH∗∗
νT −HHH∗
tHHH∗νT �= 000,
but global optimality of (HHH∗∗t ,HHH∗∗
ν ) implies∂J(HHH∗∗
t ,HHH∗∗ν )
∂HHHt= 000,
thus contradicting the conclusion of Theorem 1 if the unique
identifiability clause was removed from the statement of the
theorem.
While Theorem 1 excludes the existence of local optima,
it does not quantify the convergence rate of the algorithm.
To do so involves knowing the mean and the variance of the
partial derivative∂J(HHHt,HHHν)
∂HHHtover the random training sequence
(Theorem 2). We analyze the output of proposed algorithm
to find the solution of the channel estimation problem when
the elements of the pilot sequence x are drawn i.i.d from a
sequence of random BPSK modulation, i.e., {−1,+1} with
equal probability.
Theorem 2. Let elements of x ∈ nT+m0−1 be generatedby a sequence of random BPSK modulation, i.e., {−1,+1}with equal probability, and define ΔΔΔ � HHHtHHH
Tν − HHH∗
tHHH∗νT
=[δ1, · · · , δnT−1] ∈ m0×nT . In the absence of noise, we have
E
{∥∥∥∥∂J (HHHt,HHHν)
∂HHHt
∥∥∥∥2
F
}=
∥∥∥∥E{∂J (HHHt,HHHν)
∂HHHt
}∥∥∥∥2
F
+ (m0 − 1)p0
nT−1∑n=0
‖δn‖22, (26)
1Identifiability here means that given sufficient information, can we un-ambiguously determine the pair of inputs (HHHt,HHHν) that generated theobservation.
6
and∥∥∥∥E{∂J (HHHt,HHHν)
∂HHHτ
}∥∥∥∥2
F
= vec (ΔΔΔ∗)HBBB (ΔΔΔ∗) , (27)
where (using ⊗ to denote Kronecker product of matrices) BBB =HHHνHHH
Hν ⊗III ∈ m0nT×m0nT is a Hermitian symmetric positive
semidefinite matrix satisfying the eigenvalue bounds
000 � BBB � ‖HHHν‖22III. (28)
The proof is given in Appendix B. In our simulations, see
Section V, we did observe the impact of bad initializations;
i.e., local optima were found which resulted in poor perfor-
mance [23]. This challenge motivates us to see if we can use
a convex programming approach to enforce the parametric
bilinear low-rank structure. In the next section, we employ
atomic norm heuristic, introduced by [33], to promote the
sparsity of number dominant paths in the channel as well as
the parametric low rank structure in the channel model.
B. Parametric Low-rank Atomic Norm
We know that the number of dominant paths in a narrow-
band time-varying wireless communication system is small.
This means that the number of terms in Equation (8), p0,
is quite small compared to the number of training signal
measurements, i.e., p0 � nT . In other words, the channel can
be described as a summation of rank-one matrices of form
gd(ν)H in Equation (10). This representation motivates the
use of the atomic norm heuristic to promote this structure. For
this purpose, we define an atom as AAA(g, ν) = gd(ν)H , where
ν ∈ [− 12 ,
12 ] and g ∈ m0×1. Without loss of generality, we
consider ‖g‖2 = 1. Then, we define the set of all atoms as
A =
{AAA(g, ν) | ν ∈ [−1
2,1
2], ‖g‖2 = 1, g ∈ m0×1
}.
(29)
Our goal here is to find a representation of the channel with
a small number of dominant paths, i.e.,
‖HHH‖A,0 = infp
{HHH =
p∑k=1
ηkgkd(νk)H
}. (30)
Due to the combinatoric nature of the norm defined in Equa-
tion (30), the above optimization problem is NP-hard. Thus,
we alternatively consider the convex relaxation of above norm
as the atomic norm associated with the set of atoms defined
in Equation (29) as
‖HHH‖A = inf {t > 0 : HHH ∈ tconv(A)} (31)
= infηk,νk,‖gk‖2=1
{∑k
|ηk| : HHH =∑k
ηkgkd(νk)H
},
where conv denotes the convex hull. This relaxation is similar
to the relaxation of l0-norm of a vector by its l1-norm, which
is the prevalent relaxation in the compressed sensing literature
to avoid the combinatorial nature of l0-norm in the recovery
algorithm.
Remark 3. The atomic representation in Equation (31) for
matrix HHH, i.e., HHH =∑
k ηkAAA(gk, νk) =∑
k ηkgkd(νk)H ,
not only captures the functional forms of its elements, but
also enforces the rank-one constraint on each term in the
summation, i.e., rank (AAA (gk, νk)) = 1.
To enforce the sparsity of the atomic representation (low
rank) of the received signal, we solve
minimizeHHH
‖HHH‖A s.t. y = Π(HHH). (32)
From [25], we know that due to the Vandermonde decompo-
sition, the convex hull of the set of atoms A can be charac-
terized by a semidefinite program (SDP). Therefore ‖HHH‖A in
Equation (32) admits an equivalent SDP representation.
Proposition 1 (see, e.g., [25], [34]). For any HHH ∈ m0×nT ,
‖HHH‖A = infz,WWW
1
2nTtrace (Toep(z) + nTWWW)
s.t.
[Toep(z) HHHH
HHH WWW
]� 0, (33)
where z is a complex vector whose first element is real,Toep(z) denotes the nT×nT Hermitian Toeplitz matrix whosefirst column is z, and WWW is a Hermitian m0 ×m0 matrix.
Therefore, we can use any efficient SDP solver such as CVX
[35], to solve the optimization problem in Equation (32). For
noisy measurements, we consider
minimizeHHH
‖HHH‖A s.t. ‖y −Π(HHH)‖2 ≤ σ2z . (34)
The above problem can similarly be cast as SDP by adding
the noisy convex constraint. Similar to the noiseless version,
the noisy version can be solved using CVX.
Next, we show that the solution of the optimization problem
in Equation (32) is the optimal solution of our channel
estimation problem in the noiseless scenario. Furthermore, we
discuss the conditions under which the solution of this convex
program is unique. Finally, we develop a scaling law that
relates the number of measurements needed to the probability
of correct estimation of leaked channel parameters.
1) Optimality and Uniqueness: The dual of the optimiza-
tion problem in Equation (32), using standard Lagrangian
analysis, can be written as
maximizeλ
Re {〈λ,y〉} s.t. ‖Π∗ (λ)‖∗A ≤ 1, (35)
where Π∗ (λ) =∑
k λ(k)xkeTk−m0+1 is the adjoint operator
of Π and ‖·‖∗A denotes the dual norm of the atomic norm,
which is an inner product. Therefore, we have
‖Π∗ (λ)‖∗A = sup‖ΘΘΘ‖A≤1
Re {〈Π∗ (λ) ,ΘΘΘ〉} (36)
= supν∈[− 1
2 ,12 ],‖g‖2=1
Re{⟨
Π∗ (λ) , gd(ν)H⟩}
.
Equality in Equation (36) holds, since the set{gd(ν)H
}ν,g
covers all the extremal points of the atomic norm unit ball,
i.e., {ΘΘΘ : ‖ΘΘΘ‖A ≤ 1}. If we define the vector-valued function
μ(ν) = Π∗ (λ)d(ν), then by Cauchy-Schwarz inequality, we
7
have
‖Π∗ (λ)‖∗A = supν∈[− 1
2 ,12 ],‖g‖2=1
Re{gHμ(ν)
}≤ sup
ν∈[− 12 ,
12 ]
‖μ(ν)‖2. (37)
Now, if we consider the following condition that
‖μ(ν)‖2 ≤ 1, (C-1)
then, we can rewrite the optimization problem in Equation (35)
as follows,
maximizeλ
Re {〈λ,y〉} subject to ‖μ(ν)‖2 ≤ 1, (38)
where μ(ν) = Π∗ (λ)d(ν) or
μ(ν) =
nT+m0−1∑n=m0
λ(n−m0 + 1)ej2πnνxHn . (39)
Similarly, we have Re {〈λ,y〉} = Re {〈Π∗ (λ) ,HHH〉} =∑k Re
{η∗kg
Hk μ (νk)
}. If we further assume
μ(νk) = sign (ηk) gk, (C-2)
for k ∈ {1, · · · , p0}, then we have Re {〈λ,y〉} =∑
k |ηk| ≥‖HHH‖A. Moreover, using the Hölder inequality, we know that
Re {〈λ,y〉} ≤ ‖Π∗ (λ)‖∗A‖HHH‖A ≤ ‖HHH‖A.Therefore, if condition (C-2) holds, then Re {〈λ,y〉} =‖HHH‖A. In other words, under conditions (C-1) and (C-2), the
solution of the primal (Equation (32)) and dual (Equation (35))
optimization problems introduce a zero duality gap. Thus, HHHand λ are optimal solutions of the primal and dual optimization
problem. Furthermore, using proof by contradiction, we can
see that condition (C-2) ensures the uniqueness of the optimal
solution. Suppose HHH =∑
k ηkgkd(νk)H is another optimal
solution. Since HHH and HHH are different, there are some νk that
are not in the support of HHH. Define Tν = {ν1, ν2, · · · , νp0} as
the Doppler shifts’ support of HHH. Then, we have
Re {〈λ,y〉} = Re{⟨
Π∗ (λ) , HHH⟩}
=∑k∈Tν
Re{η∗kg
Hk μ (νk)
}+
∑k/∈Tν
Re{ˆη∗kg
Hk μ (νk)
}
<∑k∈Tν
|η∗k|+∑k/∈Tν
|ˆη∗k|,
which is in contradiction with the optimality of HHH. Thus, we
show that if we can guarantee the existence of a dual poly-
nomial μ(ν) = Π∗ (λ)d(ν) meeting the two key conditions
(C-1) and (C-2), then the optimization problem in Equation
(32) will find the optimal solution of the channel estimation
problem. In Theorem 3, we show that a proper dual polynomial
μ(ν) exists under two main conditions: 1) a minimum Doppler
separation and 2) a sufficient number of measurements.
Theorem 3. Suppose nT ≥ 64 (see [34]) and the trainingsequence {x[n] | m0 ≤ n ≤ nT +m0 − 1} is generated using
a i.i.d. random source with Rademacher distribution 2. Assumethat min1≤i<j≤p0
|νi − νj | ≥ 4nT
, then there exists a constantc, such that for
nT ≥ cp0m0 log3(nT p0m0
δ
), (40)
the proposed optimization problem in Equation (32) canrecover HHH with probability at least 1− δ.
The proof is given in Appendix C.
2) Channel Estimation Algorithm: After we evaluate the
dual parameters λ by solving the optimization problem in
Equation (32), we can construct the function μ (ν) in Equation
(39). Then, we can employ μ(ν) to estimate the Doppler
parameters by enforcing condition (C-2), as we know that
|μ (νk)| = 1 for k ∈ {1, · · · , p0}. Towards this goal, we need
to find the roots of the following polynomial
Q(ν) = 1− ‖μ(ν)‖22 = 1− μ(ν)Hμ(ν)
= 1− ‖μ(ν)‖22, (41)
which are equal to {νk}p0
k=1. After the estimation of {νk}p0
k=1,
we can substitute them into Equation (8) to achieve a linear
system of equations to evaluate {ηkgk}p0
k=1. Note that we do
not need to compute the values of ηk and gk separately in order
to equalize the channel distortion. As seen in Equation (8), to
construct an equalizer, we just require ηkgk for 1 ≤ k ≤ p0,
using Equation (4) we can rewrite the channel coefficients as
h(l)[n,m] =
p0∑k=0
ηkgk[m]ej2πνkn, (42)
thus to design a channel equalizer, one just need evaluate
the above coefficients in Equation (42). We connote our
convex channel estimation approach as the parametric lowrank atomic norm (PLAN) method.
V. NUMERICAL SIMULATIONS
In this section, we perform several numerical experiments
to validate the performance of the proposed channel estimation
algorithms. Furthermore, we compare the performance of
our proposed algorithms, namely NADM and PLAN, with
sparsity-based method, see e.g., [2], [13] and the references
therein. More details about the model of the channel coeffi-
cients and measurements for the element-wise sparsity-based
method, i.e., LASSO method, can be found in Section V in [2].
Additionally, we compare the performance of our proposed
algorithms with the work [12] that considers an iterative basis-
optimization approach which exploits prior statistics and spar-
sity to mitigate the leakage effect. From [12], basis expansion
optimization is coupled with OMP, namely OMP (optimized).
A. Signal Parameters
We construct a narrowband time-varying channel based on
the model given in Equation (2). We first generate the channel
delay, Doppler, and attenuation parameters randomly. In our
experiments, the (normalized) delay, in (0, 1], and Doppler
2The Rademacher distribution is a discrete probability distribution wherexi[j] = ±1 with probability 1/2.
8
0 5 10 15SNR (dB)
-16
-14
-12
-10
-8
-6
-4
-2
0
2N
MS
E (
dB
)
NADMPLANLASSOOMP (optimized)NADM: Worst caseNADM: Best case
Fig. 1: ±2σ Normalized mean-squared-error (NMSE) curves ofestimation strategies vs. signal-to-noise-ratio (SNR) at the receiverside—leaked channel estimation.
parameters, in [− 12 ,
12 ], are generated via uniform random
variables and the channel attenuation parameters are generated
using a Rayleigh random variable, unless otherwise stated. The
transmit training signal x = [x[1], x[2], · · · , x[nT + n0 − 1]]T
is generated by a random BPSK modulation, i.e., {−1,+1}with equal probability. Moreover, the transmit and received
pulse shapes are Gaussian pulses with 50% window’s support
over a single symbol interval.
B. Performance Comparisons
In the first numerical simulation, we compare the normal-
ized mean-squared-error (NMSE) of estimation performance
of different algorithms. We define the NMSE as follows:
NMSE =‖θ − θ‖‖θ‖ , (43)
where θ is the true value of the target parameter and θdenotes its estimated value. Results in Fig.1 depict the NMSE,
averaged over total number of runs (10000 runs), of the
estimation of the leaked channel matrix, HHH, using the proposed
NADM and PLAN approaches, the sparsity-based method, and
the OMP optimized method, with nT = 64 measurements
of the training signal, m0 = 10 and p0 = 5. From Fig.1,
we observe that both the convex and non-convex approaches
perform better than the method using the l1-norm alone to
promote the sparsity of the channel coefficients. This is due
to the fact that the new methods directly exploit the leakage
effects presented in Sections III and IV, while the sparsity-
based method ignores the leakage effect. Furthermore, our
method offers 2 dB improvement, on average, over the basis
expansion method proposed in [12] which does explicitly
consider leakage. To explore the sensitivity of the non-convex
NADM, we include ±2σ bars on each plot. The error bars
underscore that the NADM method has much higher variance
Fig. 2: Normalized mean-squared-error (NMSE) vs. signal-to-noise-ratio (SNR) at the receiver side—channel Doppler shifts estimationusing PLAN approach (effect of nT ).
5 10 15 20SNR (dB)
10-4
10-3
10-2N
MS
E
p0=10, m
0=4
p0=10, m
0=6
p0=10, m
0=8
p0=15, m
0=4
p0=20, m
0=4
Fig. 3: Normalized mean-squared-error (NMSE) vs. signal-to-noise-ratio (SNR) at the receiver side—channel Doppler shifts estimationusing PLAN approach (effect of m0 and p0).
than the other considered methods. Specifically, NADM has
approximately twice the variance of that of PLAN. The best
and the worst curves are determined from the best and the
worst performance cases over all the runs. From Fig.1, we
conclude that NADM approach is very sensitive to initializa-
tion. Finding a good strategy for a proper initialization of the
NADM approach is an interesting research problem and is left
for future research.
C. Doppler Estimation and Resolution Constraint
As discussed in Section IV-B2, in our proposed PLAN
approach, we compute the Doppler shift parameters, νk for
1 ≤ k ≤ p0 directly, by finding the roots of polynomial Q(ν),defined in Equation (41), then we use the evaluated Doppler
shifts to estimate the leaked channel gains using Equation (8).
9
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Doppler shift ( )
0
0.2
0.4
0.6
0.8
1
1-Q
()
Fig. 4: Doppler shifts are the roots of Q(ν). This figure depicts1−Q(ν) vs. ν. For nT = 100, p0 = 5, and m0 = 10.
In Fig. 2, with m0 = 10 and p0 = 5, the NMSE of our
proposed convex algorithm, PLAN, for the estimation of the
Doppler shift parameters is shown as a function of SNR. In this
figure, for SNR ≥ 5 dB, we see that for all values of nT , the
proposed algorithm can estimate the Doppler parameters with
at most 0.01 (normalized) error. Furthermore, the accuracy of
Doppler shifts estimation improves by increasing the number
of measurements. Fig. 3 shows the effect of changing m0
and p0 for nT = 64. As clearly can be seen from Fig.
3, as m0 or p0 increases, the accuracy of Doppler shifts
estimation does not degrade much as long as the number
of the measurements is sufficient, i.e, on the order of m0p0;
however, the performance degrades much more if the quantity
m0p0 exceeds nT by a large factor. From Fig. 2 and Fig. 3,
we conclude that our PLAN approach can perform quite well
with the measurements number only on the order of a constant
factor of m0p0.
Fig. 4 and 5 illustrate the Doppler shift recovery using the
(dual) function μ(ν). The channel Doppler shift parameters to
generate this figure were νk ∈ {−0.4,−0.2, 0, 0.2, 0.3}. It is
clear that by increasing nT from 100 to 200 measurements,
other peaks in the curve in Fig. 5 get much smaller than the
peaks for the locations of the Doppler shifts of the channel.
In other words, by increasing the number of measurements,
nT , the resolution constraint in Theorem 1 becomes smaller
and we are able to estimate the Doppler shifts with higher
accuracy.
D. Bit Error Performance Comparison
In this section, we compare the performance of our PLAN
method and the element sparsity only (sparse approximation)
approach based on the l1-norm, for data recovery during the
data transmission phase. For this comparison, we build a mini-
mum mean square error (MMSE) equalizer using the estimated
channel matrix using these two algorithms, to equalize the
channel distortion (see for example Chapter 16.2 in [21]). We
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Doppler shift ( )
0
0.2
0.4
0.6
0.8
1
1-Q
()
Fig. 5: Doppler shifts are the roots of Q(ν). This figure depicts1−Q(ν) vs. ν. For nT = 200, p0 = 5, and m0 = 10.
consider m0 = 10, p0 = 5, and nT = 150 during the channel
estimation phase. Then, we generate n = 1000 bits randomly
and modulated by BPSK signaling to compute the bit-error-
rate (BER) during data transmission. The results in Fig. 6 show
the BER vs. SNR performance of these algorithms. The label
“PLAN-not satisfied” indicates that the resolution constraint
is not met in the channel for the Doppler shift realizations,
while “PLAN-satisfied” indicates the Doppler shifts are well-
separated as suggested in Theorem 3. For both cases (well-
separated and not), we generate random values of the Doppler
shifts. We first select ν uniformly from [−0.5, 0.5]; for each
realization, we check if the resolution constraint is met. If
it is, this set of values is labeled as “well-separated,” if it
is not, it is labeled as “not well-separated.” Then simula-
tions/computations are done for these two sets of values and
performance averaged accordingly. The performance of the
sparse approximation method does not depend on the Doppler
shift separation constraint, thus in Fig. 6, only one curve for
this method is provided. From Fig. 6, we see that PLAN offers
significantly superior performance over a sparsity-only method
even if the resolution constraint is not met. Furthermore,
improvement in performance is achieved if the constraint is
met, i.e., the Doppler shifts are sufficiently separated. Addi-
tionally, slight performance improvement over [12] is observed
if the resolution constraint is not met especially at low SNR
values; whereas much stronger improvement is observed if the
resolution constraint is met.
VI. CONCLUSIONS
In this work, we develop new parametric descriptions
of narrow-band time-varying channels which lead to novel
channel estimation strategies. In particular, the representations
admit a low-rank, bilinear form as functions of the Doppler,
delay, and pulse shape functions. These forms suggest matrix
factorization methods for channel estimation. Despite the
popularity of Wirtinger flow methods for solving non-convex
optimizations with good initializations, we show that such
10
0 2 4 6 8 10 12 14 16SNR (dB)
10-2
10-1
100B
ER
LASSOOMP (optimized)PLAN- not satisfiedPLAN- satisfied
Fig. 6: BER vs. SNR performance comparison. The label “PLAN-not satisfied” indicates that the resolution constraint is not enforced inthe channel realization, while “PLAN-satisfied” indicates the Dopplershifts are well-separated as given in Theorem 3.
a strategy does not work well for our channel estimation
application. Thus, the alternating/gradient descent approach
(NADM) for the non-convex objective function suffers from
local minima. If the channel is identifiable, it can be shown
that the gradient is only zero at the true channel in the
noiseless case, however it is also proven that the variance of
the gradient is high, implying slow convergence. In contrast, a
different parametric representation leads to a convex optimiza-
tion wherein the atomic norm heuristic can be employed to
promote sparsity/low-rank. The parametric, low rank, atomic
norm (PLAN) approach can be implemented via a semi-
definite program. Uniqueness conditions are established for
PLAN as well as optimality conditions. Finally, a scaling
law is developed that is dependent on a practically achiev-
able resolution constraint (how close two paths can be in
Doppler and delay). A challenge for the future is to fully
incorporate the functional form of both Doppler and delay.
As true channel sparsity is lost when practical pulse shapes
are employed, the proposed methods offer strong improvement
over a classical estimation strategy which assumes a truly
sparse observation, i.e. LASSO. Simulation results show that
the proposed methods can offer anywhere from 5 − 12dB
improvement over LASSO. Furthermore, a 2 dB improvement
is observed, on average, over the basis expansion method that
considers leakage effects.
APPENDIX A: PROOF OF THEOREM 1
We will show the contrapositive, i.e., if∂J(HHHt,HHHν)
∂HHHt= 000,
then it implies ΔΔΔ = 000. Standard linear algebra, Wirtinger
calculus [36], and the definition of the matrices XXXn imply,
∂J (HHHt,HHHν)
∂HHHt=
∂
(∑nT−1n=0
∣∣∣y[n]− Tr(XXXnHHHtHHH
Tν
)∣∣∣2)∂HHHt
= −nT−1∑n=0
(Tr(XXXnHHH
∗tHHH
∗νT)−
Tr(XXXnHHHtHHH
Tν
))∗
XXXnHHHν
=
nT−1∑n=0
Tr(XXX∗
nΔΔΔH)xnHHHν [n, :]. (44)
Clearly, if∂J(HHHt,HHHν)
∂HHHt= 000, then, we must have
Tr(XXX∗
nΔΔΔH)xnHHHν [n, :] = 000 for every 0 ≤ n ≤ nT − 1.
Since, xn �= 0 (by domain restriction) and HHHν [n, :] �= 0T (by
definition), we must have Tr(XXX∗
nΔΔΔH)
= 0 for every 0 ≤n ≤ nT −1, or equivalently Π(ΔΔΔ) = 0 ⇐⇒ Π
(HHH∗
tHHH∗νT)=
Π(HHHtHHHν
T)
. Using the unique identifiability of the global
optimum implies HHH∗tHHH
∗νT= HHHtHHH
Tν , or equivalently ΔΔΔ = 000. �
APPENDIX B: PROOF OF THEOREM 2
Before starting the proof of this Theorem, we state Lemma
2 that is useful in the rest of our proof.
Lemma 2 (e.g., see [37]). Let XXX ∈ m×n and CCC ∈ n×n
be arbitrary matrices and let ⊗ denote the matrix Kroneckerproduct operation. We have
Tr{XXXCCCXXXH
}= vec (XXX)
H(CCC⊗ III) vec (XXX) . � (45)
Proof of Theorem 2: Let us define the shorthand DDD =HHHνHHH
Hν , so that AAA = DDD ⊗ III. From (44) and the definition of
the matrices XXXn, we get
∂J (HHHt,HHHν)
∂HHHt=
nT−1∑n=0
Tr(XXX∗
nΔΔΔH)xnHHHν [n, :]
=
nT−1∑n=0
xnxHn ΔΔΔ∗[:, n]HHHν [n, :].
Since the training pilots are generated using a random BPSK
sequence, we have E{xnxn
H}= III for every 0 ≤ n ≤ nT −1
and therefore,
E
{∂J (HHHt,HHHν)
∂HHHt
}=
nT−1∑n=0
ΔΔΔ∗[:, n]HHHν [n, :] = ΔΔΔ∗HHHν .
Using Lemma 2, we have∥∥∥∥E{∂J (HHHt,HHHν)
∂HHHt
}∥∥∥∥2
F
= Tr(ΔΔΔ∗DDDΔΔΔT
)= vec (ΔΔΔ∗)
H(DDD⊗ III) vec (ΔΔΔ∗) .
(46)
Next, we will invoke some standard properties of the Kro-
necker matrix product [37]. We have (YYY1 ⊗YYY2)H
= YYYH1 ⊗YYYH
2
for arbitrary matrices YYY1 and YYY2, so that BBB is Hermitian
11
symmetric by the Hermitian symmetry of both DDD and III.Further, if YYY1 ∈ p×p and YYY2 ∈ q×q respectively have
eigenvalues λi, 1 ≤ i ≤ p and μj , 1 ≤ j ≤ q (listed with
multiplicities), then YYY1 ⊗ YYY2 has the pq eigenvalues (with
multiplicities) λiμj , (i, j) ∈ {1, . . . , p}×{1, . . . , q}. Thus, the
minimum and maximum eigenvalues of BBB are (by definition)
the same as that of DDD. Since
xHDDDx =∥∥∥HHHH
ν x∥∥∥22≤∥∥∥HHHH
ν
∥∥∥22‖x‖22,
with equality achieved when x is the leading eigenvector of
HHHHν , we have 000 � DDD �
∥∥∥HHHHν
∥∥∥22III. Therefore, BBB is positive
semidefinite with maximum eigenvalue as∥∥∥HHHH
ν
∥∥∥22= ‖HHHν‖22
and (28) is proved. From (46) and standard linear algebra, we
have∥∥∥∥∂J (HHHt,HHHν)
∂HHHt
∥∥∥∥2
F
= Tr
([nT−1∑n=0
xnxHn ΔΔΔ∗[:, n]HHHν [n, :]
]H×
nT−1∑n′=0
xn′xHn′ΔΔΔ
∗[:, n′]HHHν [n′, :]
)
=
nT−1∑n=0
nT−1∑n′=0
Tr
(HHHν [n, :]
HΔΔΔT [:, n]xnxHn ×
xn′xHn′ΔΔΔ
∗[:, n′]HHHν [n′, :]
).
Since the training pilots are generated using a random BPSK
sequence, we get,
E
{∥∥∥∥∂J (HHHt,HHHν)
∂HHHt
∥∥∥∥2
F
}
=
nT−1∑n=0
nT−1∑n′=0
Tr(HHHν [n, :]
HΔΔΔT [:, n]ΔΔΔ∗[:, n′]HHHν [n′, :]
)
+ (m0 − 1)
nT−1∑n=0
Tr(HHHν [n, :]
HΔΔΔT [:, n]ΔΔΔ∗[:, n]HHHν [n, :])
=
∥∥∥∥∥nT−1∑n=0
ΔΔΔ∗[:, n]HHHν [n, :]
∥∥∥∥∥2
F
+
(m0 − 1)
nT−1∑n=0
‖ΔΔΔ[:, n]‖22‖HHHν [n, :]‖22
=
∥∥∥∥E{∂J (HHHt,HHHν)
∂HHHt
}∥∥∥∥2
F
+ (m0 − 1)p0
nT−1∑n=0
‖ΔΔΔ[:, n]‖22,
where the expression for the expected gradient E{
∂J(HHHt,HHHν)∂HHHt
}in the last equality follows from (46) and
‖HHHν [n, :]‖22 =
p0∑l=1
|HHHν [n, l]|2 =
p0∑l=1
1 = p0.
This completes the proof of the Theorem 2. �
APPENDIX C: PROOF OF THEOREM 3
According to our analysis in Section IV-B1, if we could
design a vector-valued function (polynomial) μ(ν) that satis-
fies conditions (C-1) and (C-2), then the optimization problem
in Equation (32) will recover the optimal solution of our
channel estimation problem. In this proof, we use a technique
called dual certifier construction [25], [26], [34]. Based on
this technique, we construct a randomized dual polynomial
μ(ν), i.e., the dual certifier, with the use of Fejer’s kernel
[34] and show that given enough number of measurements,
the constructed dual certifier satisfies both conditions (C-1)
and (C-2) with high probability.
Consider the the squared Fejer’s kernel as [34]
fr(ν) =1
nT
nT+m0−1∑n=m0
fne−j2πnν , (47)
where fn = 1nT
∑nT
k=n−nT
(1− |k|
nT
)(1− |n−k|
nT
). Define
the randomized matrix-valued version of the squared Fejer’s
kernel as [25], [26], [34]
FFFr(ν) =1
nT
nT+m0−1∑n=m0
fne−j2πnνxnx
Hn . (48)
Since the training signal is generated using an i.i.d. ran-
dom source with Rademacher distribution, it is clear that
E {FFFr(ν)} = fr(ν)IIIm0and for its derivative we have
E{FFF′r(ν)
}= f ′
r(ν)IIIm0, where IIIm0
denotes an m0 × m0
identity matrix. Now, we define a candidate vector-valued dual
certifier polynomial μ(ν) as
μ(ν) =
p0∑k=1
FFFr(ν − νk)αk +FFF′r(ν − νk)βk, (49)
where αk = [αk,1, · · · , αk,m0]T
and βk = [βk,1, · · · , βk,m0]T
are constant coefficients. Clearly, the candidate μ(ν) defined
in Equation (49) follows the valid form of μ(ν) given in
Equation (39). Coefficients αk and βk are selected such that
the candidate μ(ν) satisfies condition (C-2) and part of (C-1),
namely
μ(νk) = sign (ηk) gk, (50)
μ′(νk) = 0, i.e., maximum occurs at νk. (51)
We can summarize the above equations as
ΓΓΓ[αT
1 , · · · ,αTp0, γβT
1 , · · · , γβTp0
]T= g, (52)
where g =[sign (η1) g
T1 , · · · , sign (ηp0
) gTp0,0T , · · · ,0T
]T,
γ =√|f ′′
r (0)|, and the matrix ΓΓΓ can be written as
ΓΓΓ =1
nT
nT+m0−1∑n=m0
(νnνHn )⊗ (xnx
Hn )fn, (53)
where ⊗ is the Kronecker product and νn =[e−j2πnν1 , .., e−j2πnνp0 , j2πn
γ e−j2πnν1 , .., j2πnγ e−j2πnνp0
]H.
Thus, if we show that ΓΓΓ is invertible, then we can easily
evaluate αk and βk from the system of equations in Equation
(52) and accordingly μ(ν) will satisfy both (50) and (51).
Lemma 3 shows that for enough number of measurements
and well-separated Doppler shift parameters, the matrix ΓΓΓ is
invertible with high probability.
12
Lemma 3. [See Proposition 16 in [25] and Lemma 2.2 in [34]] Define an event Eε = {‖ΓΓΓ− E {ΓΓΓ}‖ ≤ ε} for the generatedrandom i.i.d. sequence xn with Rademacher distribution, then
1) Let 0 < δ < 1 and |νi − νj | ≥ 1nT
for ∀i �= j. Then forany ε ∈ (0, 0.5], as long as
nT ≥ 80m0p0ε2
log
(4m0p0
δ
),
event Eε occurs with probability at least 1− δ.2) Define ΓΓΓ = E {ΓΓΓ}. Let |νi − νj | ≥ 1
nTfor ∀i �= j. Then
ΓΓΓ is invertible.3) Given that Eε holds for an ε ∈ (0, 0.25], then we have∥∥∥ΓΓΓ−1 − ΓΓΓ
−1∥∥∥ ≤ 2ε
∥∥∥ΓΓΓ−1∥∥∥ and
∥∥ΓΓΓ−1∥∥ ≤ 2
∥∥∥ΓΓΓ−1∥∥∥. �
Thus, the construction of μ(ν) in Equation (49) ensures
condition (C-2) and μ′(νk) = 0 for ∀k. To complete the proof
we need to show that ‖μ(ν)‖2 < 1 for all ν ∈ [−0.5, 0.5]\Tνto guarantee condition (C-1). We show that this condition will
be satisfied by proposed μ(ν) in Lemma 5 and Lemma 6. But
before stating these lemmas, let us to define some notations
and state Lemma 4 that we use in the proof of Lemma 5
and 6. Let ΓΓΓ−1 = [LLL RRR] where LLL ∈ IR2m0p0×m0p0 and RRR ∈IR2m0p0×m0p0 , then using (52), we have[
αT1 , · · · ,αT
p0, γβT
1 , · · · , γβTp0
]T= LLLg.
If we multiply both side of above equation by
ΩΩΩ(m)(ν) =1
γm
[FFF(m)r (ν − ν1), · · · ,FFF(m)
r (ν − νp0),
1
γFFF(m+1)r (ν − ν1), · · · ,
1
γFFF(m+1)r (ν − ν1)
]H, (54)
where ΩΩΩ(m)(ν) denotes the mth order derivative of the func-
tion ΩΩΩ(ν) for m = 0, 1, 2, · · · , then we can present the mth
order entry-wise derivative of μ(ν) by
1
γmμ(m)(ν) = ΩΩΩ(m)(ν)HLLLg. (55)
Similarly, if we define μ(ν) = E {μ(ν)} and ΓΓΓ−1
=[LLL RRR
],
then we have
1
γmμ(m)(ν) =
[E{ΩΩΩ(m)(ν)
}]H(LLL⊗ IIIm0
)g. (56)
Furthermore, we can write
ΩΩΩ(m)(ν) =1
nT
nT+m0−1∑n=m0
(j2πn
γ)mfne
j2πnννn ⊗ xnxHn ,
(57)
and E{ΩΩΩ(m)(ν)
}= ω(m)(ν)⊗ III, where
ω(m)(ν) =1
γm
[f (m)r (ν − ν1), · · · , f (m)
r (ν − νp0),
1
γf (m+1)r (ν − ν1), · · · ,
1
γf (m+1)r (ν − ν1)
]H
=1
nT
nT+m0−1∑n=m0
(j2πn
γ)mfne
j2πnν(νn ⊗ III). (58)
Now using these relationships, we can use Lemma 4 to
show that μ(m)(ν) is concentrated around μ(m)(ν) with high
probability.
Lemma 4 (see the proof of Theorem 3 in [38]). Consider|νi − νj | ≥ 1
nTfor ∀i �= j and let δ ∈ (0, 1). Then, for
m = 0, 1, 2, · · · , we have
1
γm
∥∥∥μ(m)(ν)− μ(m)(ν)∥∥∥2≤ ε (59)
for nT ≥ cm0p0
ε2 log3(nTm0p0
δε
), where c is a constant number,
with probability at least 1− δ. �
Let us define T nearν = ∪p0
k=1 [νk − νε, νk + νε] and T farν =
[−0.5, 0.5]\T nearν where νε = O
(1nT
), e.g., say νε =
0.1nT
.
Lemma 5. Assume |νi − νj | ≥ 1nT
for ∀i �= j and let δ ∈(0, 1). Then,
‖μ(ν)‖2 < 1, for ∀ν ∈ T farν
with probability at least (1 − δ) for nT ≥cm0p0 log
3(nTm0p0
δ
). �
Proof of Lemma 5: We start by a relationship results from
triangular inequality
‖μ(ν)‖2 ≤ ‖μ(ν)− μ(ν)‖2 + ‖μ(ν)‖2, (60)
where μ(ν) = E {μ(ν)}. To complete the proof, since from
Lemma 4, we know that ‖μ(ν)− μ(ν)‖2 approaches to zero
with high probability for ν ∈ (0, 1), we just need to show that
‖μ(ν)‖2 < 1 for ν ∈ T farν . From (56), we have
‖μ(ν)‖2 = supx:‖x‖2=1
xHμ(ν) (61)
= supx:‖x‖2=1
xH [E {ΩΩΩ(ν)}]H (LLL⊗ III)g (62)
= supx:‖x‖2=1
p0∑k=1
[ω(ν)HLLL
]k
(xH ηkgk
)< 0.99992.
(63)
The above inequality follows from the fact that(xH ηkgk
)≤ 1
and the proof of Lemma 2.4 in [34] for ν ∈ T farν (or see
Lemma 10 in [26]).
Lemma 6. Assume |νi − νj | ≥ 1nT
for ∀i �= j and let δ ∈(0, 1). Then,
‖μ(ν)‖2 < 1, for ∀ν ∈ T nearν
with probability at least (1 − δ) for nT ≥cm0p0 log
3(nTm0p0
δ
). �
Proof of Lemma 6: Our choice of the coefficients implies
thatd‖μ(ν)‖2
2
dν |ν=νk= 0 and
d2‖μ(ν)‖22dν2
|ν=νk= Re
{2μ(ν)H
dμ(ν)
dν
}|ν=νk
= 0.
Thus, for ν ∈ T near = [νk − νε, νk + νε], to proof the claim
of the theorem, it is sufficient to show thatd2‖μ(ν)‖2
2
dν2 < 0.
13
Note that
1
2
d2‖μ(ν)‖22dν2
= ‖μ′(ν)‖22 +Re{μ′′(ν)Hμ(ν)
}, (64)
for ν ∈ T nearν . Using Lemma 4, we can write
1
γ2‖μ′(ν)‖22 =
∥∥∥∥ 1γ (μ′(ν)− μ′(ν) + μ′(ν))
∥∥∥∥2
2
≤ ε2 +2ε‖μ′(ν)‖2
γ+
‖μ′(ν)‖22γ2
. (65)
Similar to calculations in Lemma 2.3 and 2.4 in [34], we have
‖μ′(ν)‖2 ≤ 1.6nT and γ > πnT√3
for nT ≥ 2. Therefore, we
have
1
γ2‖μ′(ν)‖22 ≤ ε2 + 1.75ε+
‖μ′(ν)‖22γ2
. (66)
Similarly, we have ‖μ(ν)‖2 ≤ 1 and ‖μ′′(ν)‖2 ≤ 21.15n2T
for ν ∈ T near, thus
1
γ2Re
{μ′′(ν)Hμ(ν)
}=
1
γ2Re
{(μ′′(ν)− μ′′(ν) + μ′′(ν))
H(μ(ν)− μ(ν) + μ(ν))
}
≤ ε2 + 4.25ε+Re
{μ′′(ν)Hμ(ν)
}γ2
. (67)
Therefore, substituting (66) and (67) in the inequality (64), we
have
1
2γ2
d2‖μ(ν)‖22dν2
< 2ε2 + 6ε+
1
γ2
(‖μ′(ν)‖22 +Re
{μ′′(ν)Hμ(ν)
}).
(68)
Similar to argument in Lemma 2.3 in [34] (see equation 2.19
and afterward), we can conclude that
1
γ2
(‖μ′(ν)‖22 +Re
{μ′′(ν)Hμ(ν)
})≤ −0.029.
Therefore, we have 12
d2‖μ(ν)‖22
dν2 ≤ 2ε2 + 6ε − 0.029 < 0, for
ε small enough, e.g., ε ≤ 10−5. This completes the proof.
Putting the results of Lemma 5 and Lemma 6 together, The-
orem 3 is proved, since μ(ν) is verified to satisfy both condi-
tions (C-1) and (C-2) with high probability for enough number
of measurements. �
REFERENCES
[1] G. Matz, H. Bolcskei, and F. Hlawatsch, “Time-frequency foundationsof communications: Concepts and tools,” IEEE Signal Processing Mag-azine, vol. 30, pp. 87–96, Nov 2013.
[2] S. Beygi, U. Mitra, and E. G. Ström, “Nested sparse approximation:Structured estimation of v2v channels using geometry-based stochasticchannel model,” IEEE Transactions on Signal Processing, vol. 63,pp. 4940–4955, Sept 2015.
[3] G. A. Hollinger, S. Choudhary, P. Qarabaqi, C. Murphy, U. Mitra, G. S.Sukhatme, M. Stojanovic, H. Singh, and F. Hover, “Underwater datacollection using robotic sensor networks,” IEEE Journal on SelectedAreas in Communications, vol. 30, pp. 899–911, June 2012.
[4] S. Beygi and U. Mitra, “Multi-scale multi-lag channel estimation us-ing low rank approximation for ofdm,” IEEE Transactions on SignalProcessing, vol. 63, pp. 4744–4755, Sept 2015.
[5] P. Bello, “Characterization of randomly time-variant linear channels,”IEEE Transactions on Communications Systems, vol. 11, pp. 360–393,December 1963.
[6] E. Aktas and U. Mitra, “Single-user sparse channel acquisition inmultiuser DS-CDMA systems,” IEEE Transactions on Communications,vol. 51, pp. 682–693, April 2003.
[7] C. Carbonelli, S. Vedantam, and U. Mitra, “Sparse channel estimationwith zero tap detection,” IEEE Transactions on Wireless Communica-tions, vol. 6, pp. 1743–1763, May 2007.
[8] S. F. Cotter and B. D. Rao, “Sparse channel estimation via matchingpursuit with application to equalization,” IEEE Transactions on Com-munications, vol. 50, pp. 374–377, March 2002.
[9] J. Homer, I. Mareels, R. R. Bitmead, B. Wahlberg, and A. Gustafsson,“LMS estimation via structural detection,” IEEE Transactions on SignalProcessing, vol. 46, pp. 2651–2663, Oct 1998.
[10] M. Kocic, D. Brady, and M. Stojanovic, “Sparse equalization for real-time digital underwater acoustic communications,” in ‘Challenges of OurChanging Global Environment’. Conference Proceedings. OCEANS ’95MTS/IEEE, vol. 3, pp. 1417–1422 vol.3, Oct 1995.
[11] C. R. Berger, S. Zhou, J. C. Preisig, and P. Willett, “Sparse channelestimation for multicarrier underwater acoustic communication: Fromsubspace methods to compressed sensing,” IEEE Transactions on SignalProcessing, vol. 58, pp. 1708–1721, March 2010.
[12] G. Tauböck, F. Hlawatsch, D. Eiwen, and H. Rauhut, “Compressiveestimation of doubly selective channels in multicarrier systems: Leakageeffects and sparsity-enhancing processing,” IEEE Journal of SelectedTopics in Signal Processing, vol. 4, pp. 255–271, April 2010.
[13] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressedchannel sensing: A new approach to estimating sparse multipath chan-nels,” Proceedings of the IEEE, vol. 98, pp. 1058–1076, June 2010.
[14] N. Michelusi, U. Mitra, A. F. Molisch, and M. Zorzi, “UWBsparse/diffuse channels, Part I: Channel models and bayesian estima-tors,” IEEE Transactions on Signal Processing, vol. 60, pp. 5307–5319,Oct 2012.
[15] N. Michelusi, U. Mitra, A. F. Molisch, and M. Zorzi, “UWBsparse/diffuse channels, Part II: Estimator analysis and practical chan-nels,” IEEE Transactions on Signal Processing, vol. 60, pp. 5320–5333,Oct 2012.
[16] S. Beygi and U. Mitra, “Structured estimation of time-varying narrow-band wireless communication channels,” in 2017 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP),pp. 3529–3533, March 2017.
[17] S. Beygi and U. Mitra, “Time-varying narrowband channel estima-tion: Exploiting low-rank and sparsity via bilinear representation,” in2016 50th Asilomar Conference on Signals, Systems and Computers,pp. 1235–1239, Nov 2016.
[18] D. Eiwen, G. Tauböck, F. Hlawatsch, and H. G. Feichtinger, “Com-pressive tracking of doubly selective channels in multicarrier systemsbased on sequential delay-doppler sparsity,” in 2011 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP),pp. 2928–2931, May 2011.
[19] D. Eiwen, G. Tauböck, F. Hlawatsch, H. Rauhut, and N. Czink,“Multichannel-compressive estimation of doubly selective channels inMIMO-OFDM systems: Exploiting and enhancing joint sparsity,” in2010 IEEE International Conference on Acoustics, Speech and SignalProcessing, pp. 3082–3085, March 2010.
[20] D. Eiwen, G. Tauböck, F. Hlawatsch, and H. G. Feichtinger, “Groupsparsity methods for compressive channel estimation in doubly disper-sive multicarrier systems,” in 2010 IEEE 11th International Workshopon Signal Processing Advances in Wireless Communications (SPAWC),pp. 1–5, June 2010.
[21] A. F. Molisch, Wireless communications, vol. 34. John Wiley & Sons,2012.
[22] S. Oymak, B. Recht, and M. Soltanolkotabi, “Sharp time–data tradeoffsfor linear inverse problems,” IEEE Transactions on Information Theory,vol. 64, pp. 4129–4158, June 2018.
[23] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht,“Low-rank solutions of linear matrix equations via procrustes flow,”ICML’16, pp. 964–973, JMLR.org, 2016.
[24] M. R. Hestenes, “Multiplier and gradient methods,” Journal of Opti-mization Theory and Applications, vol. 4, pp. 303–320, Nov 1969.
[25] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht, “Compressed sensing offthe grid,” IEEE Transactions on Information Theory, vol. 59, pp. 7465–7490, Nov 2013.
[26] Y. Chi, “Guaranteed blind sparse spikes deconvolution via lifting andconvex optimization,” IEEE Journal of Selected Topics in Signal Pro-cessing, vol. 10, pp. 782–794, June 2016.
14
[27] S. Choudhary, S. Beygi, and U. Mitra, “Delay-doppler estimationvia structured low-rank matrix recovery,” in 2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP),pp. 3786–3790, March 2016.
[28] M. Fazel, H. Hindi, and S. Boyd, “Rank minimization and applicationsin system theory,” in Proceedings of the 2004 American Control Con-ference, vol. 4, pp. 3273–3278 vol.4, June 2004.
[29] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization,”SIAM Rev., vol. 52, pp. 471–501, Aug. 2010.
[30] J. P. Haldar and D. Hernando, “Rank-constrained solutions to linearmatrix equations using powerfactorization,” IEEE Signal ProcessingLetters, vol. 16, pp. 584–587, July 2009.
[31] R. Willoughby, “Solutions of ill-posed problems (A. N. Tikhonov andV. Y. Arsenin),” SIAM Review, vol. 21, no. 2, pp. 266–267, 1979.
[32] W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, and P. B.Kramer, Numerical recipes: the art of scientific computing. AIP, 1987.
[33] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The con-vex geometry of linear inverse problems,” Foundations of ComputationalMathematics, vol. 12, pp. 805–849, Dec 2012.
[34] E. J. Candes and C. Fernandez-Granda, “Towards a mathematical theoryof super-resolution,” Communications on Pure and Applied Mathemat-ics, vol. 67, pp. 906–956, June 2014.
[35] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1.” http://cvxr.com/cvx, Mar. 2014.
[36] R. Remmert, Theory of complex functions, vol. 122. Springer Science& Business Media, 2012.
[37] K. B. Petersen, M. S. Pedersen, et al., “The matrix cookbook,” TechnicalUniversity of Denmark, vol. 7, p. 15, 2008.
[38] R. Heckel and M. Soltanolkotabi, “Generalized line spectral estimationvia convex optimization,” IEEE Transactions on Information Theory,vol. 64, pp. 4001–4023, June 2018.