ieee transactions on information theory, vol. 62,...

28
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016 5117 From Denoising to Compressed Sensing Christopher A. Metzler, Arian Maleki, and Richard G. Baraniuk Abstract— A denoising algorithm seeks to remove noise, errors, or perturbations from a signal. Extensive research has been devoted to this arena over the last several decades, and as a result, todays denoisers can effectively remove large amounts of additive white Gaussian noise. A compressed sensing (CS) reconstruction algorithm seeks to recover a structured signal acquired using a small number of randomized measurements. Typical CS reconstruction algorithms can be cast as iteratively estimating a signal from a perturbed observation. This paper answers a natural question: How can one effectively employ a generic denoiser in a CS reconstruction algorithm? In response, we develop an extension of the approximate message pass- ing (AMP) framework, called denoising-based AMP (D-AMP), that can integrate a wide class of denoisers within its iterations. We demonstrate that, when used with a high-performance denoiser for natural images, D-AMP offers the state-of-the- art CS recovery performance while operating tens of times faster than competing methods. We explain the exceptional performance of D-AMP by analyzing some of its theoretical features. A key element in D-AMP is the use of an appropriate Onsager correction term in its iterations, which coerces the signal perturbation at each iteration to be very close to the white Gaussian noise that denoisers are typically designed to remove. Index Terms— Compressed sensing, denoiser, approximate message passing, Onsager correction. I. I NTRODUCTION A. Compressed Sensing T HE fundamental challenge faced by a compressed sensing (CS) reconstruction algorithm is to reconstruct a high-dimensional signal from a small number of measure- ments. The process of taking compressive measurements can Manuscript received June 16, 2014; revised March 23, 2016; accepted March 24, 2016. Date of publication April 20, 2016; date of current version August 16, 2016. C. A. Metzler was supported in part by the National Science Foundation (NSF) through the Graduate Research Fellowship Program, in part by the Department of Defense (DOD) through the National Defense Science and Engineering Graduate Program, in part by NSF under Grant CCF1527501, in part by the Air Force Office of Scientific Research under Grant FA9550-14-1-0088, in part by the Army Research Office under Grant W911NF-15-1-0316, in part by the Office of Naval Research under Grant N00014-12-1-0579, and in part by DOD under Grant HM04761510007. A. Maleki was supported by NSF under Grant CCF-1420328. R. G. Baraniuk was supported in part by NSF under Grant CCF1527501, in part by the Air Force Office of Scientific Research under Grant FA9550-14-1-0088, in part by the Army Research Office under Grant W911NF-15-1-0316, in part by the Office of Naval Research under Grant N00014-12-1-0579, and in part by DOD under Grant HM04761510007. This paper was presented at the 2015 International Conference on Image Processing and the 2015 Sampling Theory and its Applications Workshop. C. A. Metzler and R. G. Baraniuk are with the Department of Electrical and Computer Engineering, Rice University, Houston, TX 77023 USA (e-mail: [email protected]; [email protected]). A. Maleki is with the Department of Statistics, Columbia University in the City of New York, New York, NY 10023 USA (e-mail: arian@stat. columbia.edu). Communicated by H. Pfister, Associate Editor for Coding Theory. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIT.2016.2556683 be thought of as a linear mapping of a length n signal vector x o to a length m, m n, measurement vector y . Because this process is linear, it can be modeled by a measurement matrix C m×n . The matrix can take on a variety of physical interpretations: In a compressively sampled MRI, might be sampled rows of an n × n Fourier matrix [1], [2]. In a single pixel camera, might be a sequence of 1s and 0s representing the modulation of a micromirror array [3]. Oftentimes a signal x o is sparse (or approximately sparse) in some transform domain, i.e., x o = u with sparse u , where represents the inverse transform matrix. In this case we lump the measurement and transformation into a single measurement matrix A = . When a sparsifying basis is not used A = . Future references to the measurement matrix refer to A. The compressed sensing reconstruction problem is to deter- mine which signal x o produced y when sampled according to y = Ax o + w where w represents measurement noise. Because A R m×n and m n, the problem is severely under-determined. 1 Therefore to recover x o one first assumes that x o possesses a certain structure and then searches, among all the vectors x that satisfy y Ax , for one that also exhibits the given structure. In case of sparse x o , one recovery method is to solve the convex problem minimize x x 1 subject to y Ax 2 2 λ, (1) which is known formally as basis pursuit denoising (BPDN). It was first shown in [4] and [5] that if x o is sufficiently sparse and A satisfies certain properties, then (1) can accurately recover x o . The initial work in CS solved (1) using convex program- ming methods. However, when dealing with large signals, such as images, these convex programs are extremely computa- tionally demanding. Therefore, lower cost iterative algorithms were developed; including matching pursuit [6], orthogonal matching pursuit [7], iterative hard-thresholding [8], com- pressive sampling matching pursuit [9], approximate message passing [10], and iterative soft-thresholding [11]–[16], to name just a few. See [17] and [18] for a complete set of references. Iterative thresholding (IT) algorithms generally take the form x t +1 = η τ (A z t + x t ), z t = y Ax t , (2) where η τ ( y ) is a shrinkage/thresholding non-linearity, x t is the estimate of x o at iteration t , and z t denotes the estimate of the 1 Note that for notational simplicity in our current derivations and algorithms we restrict A to be in R m×n . However, an extension to C m×n is also possible. 0018-9448 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 11-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016 5117

    From Denoising to Compressed SensingChristopher A. Metzler, Arian Maleki, and Richard G. Baraniuk

    Abstract— A denoising algorithm seeks to remove noise, errors,or perturbations from a signal. Extensive research has beendevoted to this arena over the last several decades, and as aresult, todays denoisers can effectively remove large amountsof additive white Gaussian noise. A compressed sensing (CS)reconstruction algorithm seeks to recover a structured signalacquired using a small number of randomized measurements.Typical CS reconstruction algorithms can be cast as iterativelyestimating a signal from a perturbed observation. This paperanswers a natural question: How can one effectively employ ageneric denoiser in a CS reconstruction algorithm? In response,we develop an extension of the approximate message pass-ing (AMP) framework, called denoising-based AMP (D-AMP),that can integrate a wide class of denoisers within its iterations.We demonstrate that, when used with a high-performancedenoiser for natural images, D-AMP offers the state-of-the-art CS recovery performance while operating tens of timesfaster than competing methods. We explain the exceptionalperformance of D-AMP by analyzing some of its theoreticalfeatures. A key element in D-AMP is the use of an appropriateOnsager correction term in its iterations, which coerces the signalperturbation at each iteration to be very close to the whiteGaussian noise that denoisers are typically designed to remove.

    Index Terms— Compressed sensing, denoiser, approximatemessage passing, Onsager correction.

    I. INTRODUCTION

    A. Compressed Sensing

    THE fundamental challenge faced by a compressedsensing (CS) reconstruction algorithm is to reconstructa high-dimensional signal from a small number of measure-ments. The process of taking compressive measurements can

    Manuscript received June 16, 2014; revised March 23, 2016; acceptedMarch 24, 2016. Date of publication April 20, 2016; date of current versionAugust 16, 2016. C. A. Metzler was supported in part by the National ScienceFoundation (NSF) through the Graduate Research Fellowship Program, in partby the Department of Defense (DOD) through the National Defense Scienceand Engineering Graduate Program, in part by NSF under Grant CCF1527501,in part by the Air Force Office of Scientific Research underGrant FA9550-14-1-0088, in part by the Army Research Office underGrant W911NF-15-1-0316, in part by the Office of Naval Research underGrant N00014-12-1-0579, and in part by DOD under Grant HM04761510007.A. Maleki was supported by NSF under Grant CCF-1420328. R. G. Baraniukwas supported in part by NSF under Grant CCF1527501, in part by the AirForce Office of Scientific Research under Grant FA9550-14-1-0088, in partby the Army Research Office under Grant W911NF-15-1-0316, in part bythe Office of Naval Research under Grant N00014-12-1-0579, and in partby DOD under Grant HM04761510007. This paper was presented at the2015 International Conference on Image Processing and the 2015 SamplingTheory and its Applications Workshop.

    C. A. Metzler and R. G. Baraniuk are with the Department of Electricaland Computer Engineering, Rice University, Houston, TX 77023 USA(e-mail: [email protected]; [email protected]).

    A. Maleki is with the Department of Statistics, Columbia University inthe City of New York, New York, NY 10023 USA (e-mail: [email protected]).

    Communicated by H. Pfister, Associate Editor for Coding Theory.Color versions of one or more of the figures in this paper are available

    online at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIT.2016.2556683

    be thought of as a linear mapping of a length n signal vectorxo to a length m, m � n, measurement vector y. Because thisprocess is linear, it can be modeled by a measurement matrix� ∈ Cm×n . The matrix � can take on a variety of physicalinterpretations: In a compressively sampled MRI, � might besampled rows of an n × n Fourier matrix [1], [2]. In a singlepixel camera, � might be a sequence of 1s and 0s representingthe modulation of a micromirror array [3].

    Oftentimes a signal xo is sparse (or approximately sparse)in some transform domain, i.e., xo = �u with sparse u,where � represents the inverse transform matrix. In this casewe lump the measurement and transformation into a singlemeasurement matrix A = ��. When a sparsifying basis isnot used A = �. Future references to the measurement matrixrefer to A.

    The compressed sensing reconstruction problem is to deter-mine which signal xo produced y when sampled accordingto y = Axo + w where w represents measurement noise.Because A ∈ Rm×n and m � n, the problem is severelyunder-determined.1 Therefore to recover xo one first assumesthat xo possesses a certain structure and then searches, amongall the vectors x that satisfy y ≈ Ax , for one that also exhibitsthe given structure. In case of sparse xo, one recovery methodis to solve the convex problem

    minimizex

    ‖x‖1 subject to ‖y − Ax‖22 ≤ λ, (1)which is known formally as basis pursuit denoising (BPDN).It was first shown in [4] and [5] that if xo is sufficiently sparseand A satisfies certain properties, then (1) can accuratelyrecover xo.

    The initial work in CS solved (1) using convex program-ming methods. However, when dealing with large signals, suchas images, these convex programs are extremely computa-tionally demanding. Therefore, lower cost iterative algorithmswere developed; including matching pursuit [6], orthogonalmatching pursuit [7], iterative hard-thresholding [8], com-pressive sampling matching pursuit [9], approximate messagepassing [10], and iterative soft-thresholding [11]–[16], to namejust a few. See [17] and [18] for a complete set of references.

    Iterative thresholding (IT) algorithms generally take theform

    xt+1 = ητ (A∗zt + xt ),zt = y − Axt , (2)

    where ητ (y) is a shrinkage/thresholding non-linearity, xt is theestimate of xo at iteration t , and zt denotes the estimate of the

    1Note that for notational simplicity in our current derivations and algorithmswe restrict A to be in Rm×n . However, an extension to Cm×n is also possible.

    0018-9448 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • 5118 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    Fig. 1. QQplot comparing the distributions of the effective noise of theIST and AMP algorithms at iteration 5 while reconstructing a 50% sampledBarbara test image. Notice the heavy tailed distribution of IST. AMP remainsGaussian because of the Onsager correction term.

    residual y−Axo at iteration t . When ητ (y) = (|y|−τ )+sign(y)the algorithm is known as iterative soft-thresholding (IST).

    AMP extends iterative soft-thresholding by adding an extraterm to the residual known as the Onsager correction term:

    xt+1 = ητ (A∗zt + xt ),zt = y − Axt + 1

    δzt−1〈η′τ (A∗zt−1 + xt−1)〉. (3)

    Here, δ = m/n is a measure of the under-determinacyof the problem, 〈·〉 denotes the average of a vector, and1δ 〈η′τ (A∗zt−1 + xt−1)〉, where η′τ represents the derivativeof ητ , is the Onsager correction term. The role of this termis illustrated in Figure 1. This figure compares the QQplot2

    of xt + A∗zt − xo for IST and AMP. We call this quantitythe effective noise of the algorithm at iteration t . As is clearfrom the figure, the QQplot of the effective noise in AMPis a straight line. This means that the noise is approximatelyGaussian. This important feature enables the accurate analysisof the algorithm [10], [19], the optimal tuning of theparameters [20], and leads to the linear convergence of xt tothe final solution [21]. We will employ this important featureof AMP in our work as well.

    B. Main Contributions

    A sparsity model is accurate for many signals and hasbeen the focus of the majority of CS research. Unfortunately,sparsity-based methods are less appropriate for many imagingapplications. The reason for this failure is that natural imagesdo not have an exactly sparse representation in any knownbasis (DCT, wavelet, curvelet, etc.). Figure 2 shows thewavelet coefficients of the classic signal processing imageBarbara. The majority of the coefficients are non-zero andmany are far from zero. As a result, algorithms that seek onlywavelet-sparsity fail to recover the signal.

    In response to this failure, researchers have considered moreelaborate structures for CS recovery. These include minimal

    2A QQplot is a visual inspection tool for checking the Gaussianity ofthe data. In a QQplot, deviation from a straight line is an evidence ofnon-Gaussianity.

    Fig. 2. Histogram of the Daubechies 4 wavelet coefficients of the Barbara testimage. Notice the non-sparse distribution of the coefficients. Sparsity-basedcompressed sensing algorithms fail because of this distribution.

    total variation [1], [22], block sparsity [23], wavelet treesparsity [24], [25], hidden Markov mixture models [26]–[28],non-local self-similarity [29]–[31], and simple representationsin adaptive bases [32], [33]. Many of these approaches haveled to significant improvements in imaging tasks.

    In this paper, we take a complementary approach to enhanc-ing the performance of CS recovery of non-sparse signals [34].Rather than focusing on developing new signal models,we demonstrate how the existing rich literature on signaldenoising can be leveraged for enhanced CS recovery.3 Theidea is simple: Signal denoising algorithms (whether basedon an explicit or implicit model) have been developed andoptimized for decades. Hence, any CS recovery scheme thatemploys such denoising algorithms should be able to capturecomplicated structures that have heretofore not been capturedby existing CS recovery schemes.

    The approximate message passing algorithm (AMP)[21], [35] presents a natural way to employ denoising algo-rithms for CS recovery. We call the AMP that employsdenoiser D D-AMP. D-AMP assumes that xo belongs to aclass of signals C ⊂ Rn , such as the class of natural imagesof a certain size, for which a family of denoisers {Dσ :σ > 0} exists. Each denoiser Dσ can be applied to xo + σ zwith z ∼ N(0, I ) and will return an estimate of xo that ishopefully closer to xo than xo + σ z. These denoisers mayemploy simple structures such as sparsity or much more com-plicated structures, which we will discuss in Section VII-A.In this paper we treat each denoiser as a black box; it receivesa signal plus Gaussian noise and returns an estimate of xo.Hence, we do not assume any knowledge of the signalstructure/information the denoising algorithm is employing toachieve its goal. This makes our derivations applicable to awide variety of signal classes and a wide variety of denoisers.

    3In this paper, denoising refers to any algorithm that receives xo+σ z, whereσ z ∼ N(0, σ 2 I ) denotes the noise, as its input and returns an estimate ofxo as its output. Refer to Sections III-B and VII-A for more information ondenoisers.

  • METZLER et al.: FROM DENOISING TO CS 5119

    Fig. 3. Reconstructions of a piecewise constant signal that was sampled at a rate of δ = 1/3. Notice that NLM-AMP successfully reconstructs the piecewiseconstant signal whereas AMP, which is based on wavelet thresholding, does not.

    D-AMP has several advantages over existing CS recoveryalgorithms: (i) It can be easily applied to many differentsignal classes. (ii) It outperforms existing algorithms and isextremely robust to measurement noise (our simulation resultsare summarized in Section VII). (iii) It comes with an analysisframework that not only characterizes its fundamental limits,but also suggests how we can best use the framework inpractice.

    D-AMP employs a denoiser in the following iteration:

    xt+1 = Dσ̂ t (xt + A∗zt ),zt = y − Axt + zt−1divDσ̂ t−1(xt−1 + A∗zt−1)/m,

    (σ̂ t )2 = ‖zt‖22m

    . (4)

    Here, xt is the estimate of xo at iteration t and zt is anestimate of the residual. As we will show later, xt + A∗ztcan be written as xo + v t , where v t can be considered as i.i.d.Gaussian noise.4 σ̂ t is an estimate of the standard deviation ofthat noise. divDσ̂ t−1 denotes the divergence of the denoiser.

    5

    The term zt−1divDσ̂ t−1(xt−1 + A∗zt−1)/m is the Onsagercorrection term. We will show later that this term has a majorimpact on the performance of the algorithm. The explicitcalculation of this term is not always straightforward: manypopular denoisers do not have explicit formulations. However,we will show that it may be approximately calculated withoutrequiring the explicit form of the denoiser.

    D-AMP applies an existing denoising algorithm to vectorsthat are generated from compressive measurements. The intu-ition is that at every iteration D-AMP obtains a better estimateof xo and that this sequence of estimates eventually convergesto xo.

    To predict the performance of D-AMP, we will employa novel state evolution framework to theoretically track thestandard deviation of the noise, σ̂t at each iteration of D-AMP.

    4This conjecture has been validated empirically elsewhere [21], [35], [36]for simpler denoisers. For known σt the conjecture has been proven for scalardenoisers in [37]. By combining the proof of [37] with the proof techniquedeveloped in [38] we can prove the above conjecture for the scalar denoisers.Since scalar denoisers are not of our main concern in this paper, we do notinclude a proof here. We will present empirical evidence that it holds formany of the state-of-the-art image denoising algorithms.

    5In the context of this work the divergence divD(x) is simply thesum of the partial derivatives with respect to each element of x ,

    i.e., divD(x) =n∑

    i=1∂ D(x)∂xi

    , where xi is the i th element of x .

    Our framework extends and validates the state evolutionframework proposed in [10] and [21]. Through extensivesimulations we show that in high-dimensional settings (for thesubset of denoisers that we consider in this paper) our stateevolution predicts the mean square error (MSE) of D-AMPaccurately. Based on the state evolution we characterize theperformance of D-AMP and connect the number of measure-ments D-AMP requires to the performance of the denoiser.We also employ the state evolution to address practical con-cerns such as the tuning of the parameters of denoisers andthe sensitivity of the algorithm to measurement noise. Further-more, we use the state evolution to explore the optimality ofD-AMP. We postpone a detailed discussion to Section III.

    Figure 3 compares the performance of the original AMP(which employs sparsity in the wavelet domain) with that ofD-AMP based on the non-local means denoisingalgorithms [39], called NLM-AMP here. Since NLM isa better denoiser than wavelet thresholding for piecewiseconstant functions, NLM-AMP dramatically outperformsthe original AMP. The details of our simulations are givenin Section VII-D.

    C. Related Work

    1) Approximate Message Passing and Extensions: In thelast five years, message passing and approximate mes-sage passing algorithms have been the subject of extensiveresearch in the field of compressed sensing [10], [19], [21],[21], [28], [36]–[38], [40]–[52]. Most previously publishedpapers consider a Bayesian framework in which a signal priorpx is defined on the class of signals C to which xo belongs.Message passing algorithms are then considered as heuristicapproaches of calculating the posterior mean, E(xo | y, A).Message passing has been simplified to approximate messagepassing (AMP) by employing the high dimensionality of thedata [10]. The state evolution framework has been proposedas a way of analyzing the AMP algorithm [10]. The maindifference between this line of work and our work is that wedo not assume any signal prior on the signal space C . Thisdistinction introduces a difference between the state evolutionframework we develop in this paper and the ones that havebeen developed elsewhere. We will highlight the connectionbetween these two different state evolutions in Section IV.

    Note that in the development of D-AMP we are notconcerned about whether the algorithm is approximating

  • 5120 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    a posterior distribution for a certain prior or not. Nor arewe concerned about whether or not the denoisers used withinD-AMP’s iterations are tied to any prior. Instead, we rely ononly one important feature of AMP—that xt + A∗zt − xobehaves similar to i.i.d. Gaussian noise. Our analysis is basedon this assumption. We validate this assumption with extensivesimulations that are presented in Section VII-C.

    Donoho et al. [35] also extended the AMP frameworkbased upon the fact that xt + A∗zt − xo behaves similarto i.i.d. Gaussian noise. In their framework the denoiser Dcan be any scale-invariant function. There are several majordifferences between our work and theirs: (i) We do not imposescale-invariance on the denoiser, because this assumption doesnot hold for many practical denoisers. (ii) We present a farbroader validation of our method and state evolution: Theempirical validation [35] presented is concerned with veryspecific simple denoisers and has remained at the level ofmaximin phase transition [21]. Likewise, the state evolutionthey employed in their empirical validation is based on theBayesian framework described above and the validations arerestricted to simple distributions. In this paper we consider adeterministic version of the state evolution and for the firsttime present evidence that such a state evolution can in factpredict the performance of D-AMP. Note that the evidencewe present goes far beyond the match in the maximin phasetransition. This development is important because the maximinframework employed in [35] is not useful in most practicalapplications that deal with naturally occurring signals. (iii) Wepresent a signal-dependent parameter tuning strategy for AMPand show that our deterministic state evolution can cope withthose situations as well. (iv) We show how practical denoiserswhose explicit functional form is not given can be employedin AMP. (v) We investigate the optimality of D-AMP as ameans to employ different denoisers in the AMP algorithm.

    While writing this paper, we became aware of anotherrelevant paper about extensions to the AMP algorithm [53].In this work, the authors employ AMP with scalar denoisersthat are better adapted to the statistics of natural images.By doing so, they have obtained a major improvement overexisting algorithms. In this paper, we consider a much broaderclass of denoisers. We not only show how the AMP algorithmcan be adapted to such denoisers; we also explore the theo-retical properties of our recovery algorithms.

    2) Model-Based CS Imaging: Many researchers havenoticed the weakness of sparsity-based methods for imagingapplications and have therefore explored the use of morecomplicated signal models. These models can be enforcedexplicitly, by constraining the solution space, or implicitly, byusing penalty functionals to encourage solutions of a certainform.

    Initially these model-based methods were restricted to sim-ple concepts like minimal total variation [22] and blocksparsity [23], but they have since been extended to structuressuch as wavelet trees [24], [25] and mixture models [26]–[28].Furthermore, some researchers have employed more compli-cated signal models through non-local regularization [29]–[31]and the use of adaptive over-complete dictionaries [32], [33].A non-local regularization method, NLR-CS [31], represents

    the current state-of-the-art in CS recovery. Through the use ofdenoisers, rather than explicit models or penalty functionals,our algorithm outperforms these methods on standard testimages.

    An additional reconstruction algorithm does not fit into anyof the above categories but in many ways relates closely toour own. Egiazarian et al. developed a denoising-based CSrecovery algorithm [54] that uses the same research group’sBM3D denoising algorithm [55] to impose a non-parametricmodel on the reconstructed signal. This method solves the CSproblem when the measurement matrix is a subsampled DFTmatrix. The method iteratively adds noise to the missing partof the spectra and then applies BM3D to the result. In [31]it was shown that the BM3D-based algorithm performedconsiderably worse than NLR-CS. Therefore it is not testedhere.

    Finally we should emphasize another major differencebetween our work and other approaches designed for imagingapplications. D-AMP comes with an accurate analysis thatexplains the behavior of the algorithm, its optimality prop-erties, and its limitations. Such an accurate analysis does notexist for other methods.

    D. Structure of the Paper

    The remainder of this paper is structured as follows:Section II introduces our D-AMP algorithm and some of itsmain features. Section III is devoted to the theoretical analysisof D-AMP and its optimality properties. Section IV establishesa connection between our state evolution and existing stateevolutions. Section V explains two different approaches tocalculating the Onsager correction term. Section VI explainshow to smooth poorly behaved denoisers so that they can beused within our framework. Section VII summarizes our mainsimulation results: it provides evidence on the validity of ourstate evolution framework; it provides a detailed guideline onsetting and tuning of different parameters of the algorithms; itcompares the performance of our D-AMP algorithm with thestate-of-the-art algorithms in compressive imaging.

    II. DENOISING-BASED APPROXIMATE MESSAGE PASSING

    Consider a family of denoising algorithms Dσ for a classof signals C. Our goal is to employ these denoisers to obtaina good estimate of xo ∈ C from y = Axo + w, wherew ∼ N(0, σ 2w I ). We start with the following approach that isinspired by the iterative hard-thresholding algorithm [8] andits extensions for block-based compressive imaging [56]–[58].To better understand this approach, consider the noiseless set-ting in which y = Axo, and assume that the denoiser is a pro-jection onto C . The affine subspace defined by y = Ax and theset C are illustrated in Figure 4. We assume that the point xois the unique point in the intersection of y = Ax and C .

    We know that the solution lies in the affine subspace{x |y = Ax}. Therefore, starting from x0 = 0, we move inthe direction that is orthogonal to the subspace, i.e., A∗y.A∗y is closer to the subspace however it is not necessarilyclose to C . Hence, we employ denoising (or projection in thefigure) to obtain an estimate that satisfies the structure of our

  • METZLER et al.: FROM DENOISING TO CS 5121

    Fig. 4. Reconstruction behavior of denoising-based iterative thresholdingalgorithm.

    Fig. 5. QQplot comparing the distribution of the effective noise of D-IT andD-AMP at iteration 5 while reconstructing a 50% sampled Barbara test image.Notice the highly Gaussian distribution of D-AMP and the slight deviationfrom Gaussianity at the ends of the D-IT QQplot. This difference is due toD-AMP’s use of an Onsager correction term. The denoiser that is employedin these simulations is BM3D. BM3D will be reviewed in Section VII-A.

    signal class C . After these two steps we obtain D(A∗y). As isalso clear in the figure, by repeating these two steps, i.e.,moving in the direction of the gradient and then projectingonto C, our estimate may eventually converge to the correctsolution xo. This leads us to the following iterative algorithm6:

    xt+1 = Dσ̂ (A∗zt + xt),zt = y − Axt . (5)

    For ease of notation, we have introduced the vector ofestimated residual as zt . We call this algorithm denoising-based iterative thresholding (D-IT). Note that if we replace D(that was assumed to be projection onto set C in Figure 4)with a denoising algorithm we implicitly assume that xt +A∗ztcan be modeled as xo + v t , where v t ∼ N(0, (σ t )2 I ) and isindependent of xo. Hence, by applying a denoiser we obtaina signal that is closer to xo. Unfortunately, as is shown inFigure 5(a) (we will show stronger evidence in Section III-C),this assumption is not true for D-IT. This is the same

    6Note that if D is a projection operator onto C and C is a convex set,then this algorithm is known as projected gradient descent and is known toconverge to the correct answer xo.

    phenomenon that we observed in Section I-A for iterativesoft-thresholding.

    Our proposed solution to avoid the non-Gaussianity ofthe noise in the case of the iterative thresholding algorithmswas to employ message passing/approximate message passing.Following the same path, we propose the following messagepassing algorithm:

    xt+1·→a = Dσ̂ t

    ⎜⎜⎜⎝

    ⎢⎢⎢⎣

    ∑b �=a Ab1ztb→1∑b �=a Ab2ztb→2

    ...∑b �=a Abnztb→n

    ⎥⎥⎥⎦

    ⎟⎟⎟⎠

    ,

    zta→i = ya −∑

    j �=iAa j x tj→a. (6)

    Here, xt·→a = [xt1→a, xt2→a, . . . , xtn→a]T provides an estimateof xo. σ̂ t denotes the standard deviation of the vector

    v t·→a =

    ⎜⎜⎜⎝

    ⎢⎢⎢⎣

    ∑b �=a Ab1ztb→1∑b �=a Ab2ztb→2

    ...∑b �=a Abnztb→n

    ⎥⎥⎥⎦

    ⎟⎟⎟⎠

    − xo.

    Our empirical findings, summarized in Section VII-C, showthat v t·→a closely resembles i.i.d. Gaussian noise in high-dimensional settings (both m and n are large). This result hasbeen rigorously proved for a class of scalar denoisers and canalso be proved for a class of block-wise denoisers [35], [36].7

    Despite their advantage in avoiding the non-Gaussianityof the effective noise vector v, message passing algorithmshave m (number of measurements) different estimates of xo;each xt·→a is an estimate of xo. Similarly, they have n differentestimates of the residual y − Axo. The update of all thesemessages is computationally demanding. Fortunately, if theproblem is high dimensional, we can approximate a messagepassing algorithm’s iterations and obtain the denoising-basedapproximate message passing algorithm (D-AMP):

    xt+1 = Dσ̂ t (xt + A∗zt ),zt = y − Axt + zt−1 divDσ̂ t−1(x

    t−1 + A∗zt−1)m

    . (7)

    The only difference between D-AMP and D-IT is again in theOnsager correction term zt−1divDσ̂ t−1(xt−1 + A∗zt−1)/m.The derivation of D-AMP from the Denoising-based MessagePassing (D-MP) algorithm is similar to the derivation of AMPfrom Message Passing (MP) which can be found in [21, Ch. 5].Similar to D-IT, D-AMP relies on the assumption that theeffective noise v t = xt + A∗zt − xo resembles i.i.d. Gaussiannoise (independent of the signal xo) at every iteration. Ourempirical findings confirm this assumption: Figure 5(b) dis-plays the effective noise at iteration 5 of D-AMP with BM3Ddenoising, which will be briefly explained in Section VII-A(we call this algorithm BM3D-AMP). Notice the clearly

    7There are some subtle differences between our claim regarding theGaussianity of v t·→a and the claims presented in other works. Our claimis made in a deterministic setting, while in existing works the Gaussianityclaim is made in regards to stochastic settings. This point will be clarifiedin Section IV.

  • 5122 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    Gaussian distribution. Based on this observation, and strongerevidence that we will provide in Section VII-C, we conjecturethat v t indeed behaves as additive white Gaussian noise forhigh dimensional problems. The proof of this property is leftfor future work. In this paper we not only provide strongempirical evidence to support our conjecture, but also exploreits theoretical implications.

    III. THEORETICAL ANALYSIS OF D-AMP

    The main objective of this section is to characterize thetheoretical properties of the D-AMP framework. In this section(and also in our simulations) we start the algorithm withx0 = 0 and z0 = y. Our analysis is under the high dimensionalsetting: n, m are very large while m/n = δ < 1 is a fixednumber. δ is called the under-determinacy of the system ofequations.

    A. Notation

    We use boldfaced capital letters such as A for matrices.For a matrix A; Ai , Ai, j , and A∗ denote its i th column,i j th element, and its transpose, respectively. Small letters suchas x are reserved for vectors and scalars. For a vector x , xiand ‖x‖p denote the i th element of the vector and its p-norm,respectively. The notations E and P denote the expected valueof a random variable (or a random vector) and probability ofan event, respectively. If the expected value is with respectto two random variables X and Z , then EX (or EZ ) denotesthe expectation with respect to X (or Z ) and E denotes theexpected value with respect to both X and Z .

    B. Denoiser Properties

    The role of a denoiser is to estimate a signal xo belongingto a class of signals C ⊂ Rn from noisy observations, xo +σ�,where � ∼ N(0, I ), and σ > 0 denotes the standard deviationof the noise. We let Dσ denote a family of denoisers indexedwith the standard deviation of the noise. At every value of σ ,Dσ takes xo + σ� as the input and returns an estimate of xo.

    To analyze D-AMP, we require the denoiser family to be(near) proper, monotone, and Lipschitz continuous (proper andmonotone are defined below). Because most denoisers easilysatisfy these first two properties, and can be modified to satisfythe third (see Section VI), the requirements do not overlyrestrict our analysis.

    Definition 1: Dσ is called a proper family of denoisers oflevel κ (κ ∈ (0, 1)) for the class of signals C if

    supxo∈C

    E‖Dσ (xo + σ�) − xo‖22n

    ≤ κσ 2, (8)

    for every σ > 0. Note that the expectation is with respect to� ∼ N(0, I ).

    To clarify the above definition, we consider the followingexamples:

    Example 1: Let C denote a k-dimensional subspace ofRn (k < n). Also, let Dσ (y) be the projection of y ontosubspace C denoted by PC (y). Then,

    E‖Dσ (xo + σ�) − xo‖22n

    = knσ 2,

    for every xo ∈ C and every σ 2. Hence, this family of denoisersis proper of level k/n.

    Proof: First note that since the projection onto a subspaceis a linear operator and since PC(xo) = xo we have

    E‖PC (xo + σ�) − xo‖22 = E‖xo + σ PC (�) − xo‖22= σ 2E‖PC (�)‖22.

    Also note that since P2C = PC , all the eigenvalues of PC areeither zero or one. Furthermore, since the null space of PCis n − k dimensional, the rank of PC is k. Hence, PC has keigenvalues equal to 1 and the rest are zero. Hence ‖PC (�)‖22follows a χ2 distribution with k degrees of freedom andE‖PC (xo + σ�) − xo‖22 = kσ 2. �

    Next we consider a slightly more complicated example thathas been popular in signal processing for the last twenty-fiveyears. Let �k denote the set of k-sparse vectors.

    Example 2: Let η(y; τσ ) = (|y| − τσ )+sign(y) denote thefamily of soft-thresholding denoisers. Then

    supxo∈�k

    E‖η(xo + σ�; τσ ) − xo‖22n

    =[ (1 + τ 2)k

    n+ n − k

    nE(η(�1; τ ))2

    ]σ 2.

    Similar results can be found in other papers including [19].But since the proof is short and the result is slightly differentfrom similar existing results, we mention the proof here.

    Proof: For notational simplicity we assume that the first kcoordinates of xo are non-zero and the rest are equal to zero.

    E‖η(xo + σ�; τσ ) − xo‖22nσ 2

    =∑k

    i=1 E(η(xo,i + σ�i ; τσ )−xo,i)2nσ 2

    + n−knσ 2

    E(η(σ�n; τσ ))2

    =∑k

    i=1 E(η( xo,i

    σ + �i ; τ)− xo,iσ

    )2

    n+ n − k

    nE(η(�n; τ ))2.

    (9)

    Note that E(η( xo,i

    σ + �i ; τ)− xo,iσ

    )2is an increasing function

    of xo,iσ [59]. Therefore, it is straightforward to see that

    E

    (η( xo,i

    σ+ �i ; τ

    )− xo,i

    σ

    )2

    ≤ limxo,i →∞

    E

    (η( xo,i

    σ+ �i ; τ

    )− xo,i

    σ

    )2 = 1 + τ 2, (10)

    where the last step swaps the lim and E (by the dominatedconvergence theorem). We obtain the desired result by com-bining (9) and (10). �

    Note that the optimal threshold τ to use within soft-thresholding depends on the sparsity k/n of the signal beingdenoised. One can optimize the parameter τ for every valueof k/n and obtain an optimized family of denoisers. Figure 6displays the level κ of the optimized soft-thresholding interms of k/n. Note that for sparse signals (k/n small) soft-thresholding is an effective denoiser and thus κ is small.

    The previous denoisers both utilized prior knowledge aboutthe structure of the signal (its dimensionality and its sparsity)in order to denoise xo. When nothing is known about xo a

  • METZLER et al.: FROM DENOISING TO CS 5123

    Fig. 6. The level κ of optimal soft-thresholding method as a functionof normalized sparsity k/n. For sparse signals, soft-thresholding is a highperformance denoiser.

    proper denoiser might be too much to ask for. For instance,consider the maximum likelihood estimator.

    Example 3: If Dσ (xo + σ�) is the maximum likelihoodestimate of xo from xo + σ�, then

    E‖Dσ (xo + σ�) − xo‖22nσ 2

    = 1.So, this family of denoisers are not proper of level κ for anyκ < 1. The proof of this statement is straightforward andhence is skipped here. It has been shown that ([61, Ch. 5])for any denoiser D̃σ we have

    supxo∈Rn

    E‖D̃σ (xo + σ�) − xo‖22nσ 2

    = 1.

    In this example the class of signals we have considered isgeneric and hence the denoiser cannot employ any specificstructure in xo.

    There are occasions when we want to deal with denoisersthat are not proper because of an error/bias term that isindependent of the noise level. To deal with scenarios suchas these, we introduce the definition near proper.

    Definition 2: Dσ is called a near proper family of denoisersof levels κ (κ ∈ (0, 1)) and B (B ∈ R+) for the class ofsignals C if

    supxo∈C

    E‖Dσ (xo + σ�) − xo‖22n

    ≤ κσ 2 + B, (11)

    for every σ > 0. Note that the expectation is with respect to� ∼ N(0, I ).

    As in Definition 1, the constants κ and B determine thequality of the denoiser family. Better denoisers have smallerconstants.

    Example 4: Let Cp = {x ∈ Rn : ‖x‖p ≤ 1} for some0 < p ≤ 1.8 For a fixed k, let Dσ denote a denoiser that,through oracle information, knows the indices of the k largestelements of x and projects the noisy observation xo +σ� ontothose coordinates. Then

    supxo∈Cp

    E‖Dσ (xo + σ�) − xo‖22n

    ≤ knσ 2 + k

    1−2/p

    n(2/p − 1) ,

    8For every 0 < p ≤ 1, ‖x‖pp =∑n

    i=1 |xi |p .

    for every xo ∈ Cp and every σ 2. Hence, this family of denoisersis near proper with κ = kn and B = (k+1)

    1−2/pn(2/p−1) .

    Proof: Let � denote the set of indices of the k-largestcoefficients of xo. For a vector x , define x� in the followingway: x�,i = xi if i ∈ � and otherwise x�,i = 0. Note thatxo,� is the best k-term approximation of xo. We have

    E‖Dσ (xo + σ�) − xo‖22 = E‖xo,� + σ�� − xo‖22= ‖xo,� − xo‖22 + σ 2E‖��‖22. (12)

    Following the same logic as used in Example 1 we see that

    E‖��‖22 = k. (13)The term ‖xo,�−xo‖22 above is simply the squared 2-norm

    of the smallest n − k values of xo. Below we obtain an upperbound for this quantity. Note that since xo ∈ Cp, we have

    n∑

    i=1|xo,i |p ≤ 1. (14)

    Let xo,( j ) denote the j th largest element in absolute valueof xo. It is clear that |xo,(1)| ≥ |xo,(2)|, …, |xo,( j−1)| ≥ |xo,( j )|.Combining this fact with (14) we obtain j |xo,( j )|p ≤ 1, whichin turn implies |xo,( j )| ≤ j−1/p. Returning to (12), we see that

    ‖xo,� − xo‖22 ≤n∑

    j=k+1‖xo,( j )‖2 ≤

    n∑

    j=k+1j−2/p

    ≤∫ ∞

    kγ −2/pdγ = γ

    1−2/p

    (1 − 2/p)∣∣∣∞k

    = (k)1−2/p

    (2/p − 1) .(15)

    Substituting (13) and (15) into (12) gives the desiredresult. �

    In subsequent sections we assume our signal belongs to aclass C for which we have a proper or near proper familyof denoisers Dσ . The class and denoiser can be very general.For instance, we may assume C to be the class of naturalimages and Dσ to denote the BM3D algorithm9 at differentnoise levels [55].

    Definition 3: We call a denoiser monotone if for every xoits risk function

    R(σ 2, xo) = E(‖Dσ (xo + σ�) − xo‖22)

    n,

    is a non-decreasing function of σ 2.We make a few remarks regarding monotone denoisers.Remark 1: Monotonicity is a natural property to expect

    from denoisers. Many standard denoisers such as soft-thresholding and group soft-thresholding are monotone if weoptimize over the threshold parameter. See [20, Lemma 4.4]for more information.

    Remark 2: If a family of denoisers Dσ is not monotone,then it is straightforward to construct a new denoiser thatoutperforms Dσ . Here is a simple proof. Suppose that forσ1 < σ2 we have

    R(σ 21 , xo) > R(σ22 , xo).

    9We will review this algorithm briefly in Section VII-A.

  • 5124 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    Then construct a new denoiser for noise level σ1 in thefollowing way:

    D̃σ1(y) = E�̃ Dσ2(

    y +√

    σ 22 − σ 21 �̃)

    ,

    where �̃ ∼ N(0, I ) is independent of y and E�̃ (y +√σ 22 − σ 21 �̃) denotes the expected value with respect to �̃. Let

    σ̃2 =√

    σ 22 − σ 21 . A simple application of Jensen’s inequalityshows that

    E�(‖D̃σ1(xo + σ1�) − xo‖22)n

    = E�(‖E�̃ Dσ2(xo + σ1� + σ̃2�̃) − xo‖22)

    n

    ≤ E�,�̃(‖Dσ2(xo + σ1� + σ̃2 �̃) − xo‖22)

    n.

    Note that since �̃ and � are independentE�̃,� (‖Dσ2 (xo+σ1�+σ̃2�̃)−xo‖22)

    n = R(σ 22 , xo). Therefore,D̃ improves D and does not violate the monotone property.Therefore, as is clear from this statement, non-monotonedenoisers are not desirable in general since we can easilyimprove them.

    In the rest of the paper we consider only monotonedenoisers.

    C. State Evolution

    A key ingredient in our analysis of D-AMP is the state evo-lution; a series of equations that predict the intermediate MSEof AMP algorithms at each iteration. Here we introduce a new“deterministic” state-evolution to predict the performance of

    D-AMP. Starting from θ0 = ‖xo‖22n the state evolution generatesa sequence of numbers through the following iterations:

    θ t+1(xo, δ, σ 2w) =1

    nE‖Dσ t (xo + σ t�) − xo‖22, (16)

    where (σ t )2 = θ tδ (xo, δ, σ 2w) + σ 2w and the expectation is withrespect to � ∼ N(0, I ). Note that our notation θ t+1(xo, δ, σ 2w)is set to emphasize that θ t may depend on the signal xo, theunder-determinacy δ, and the measurement noise. Considerthe iterations of D-AMP and let xt denote its estimateat iteration t . Our empirical findings show that the MSEof D-AMP is predicted accurately by the state evolution.We formally state our finding.

    Finding 1: If the D-AMP algorithm starts from x0 = 0,then for large values of m and n, state evolution predicts themean square error of D-AMP, i.e.,

    θ t (xo, δ, σ2w) ≈

    1

    n‖xt − xo‖22.

    Based on extensive simulations, we believe that this findingis true if the following properties are satisfied: (i) The elementsof the matrix A are i.i.d. Gaussian (or subGaussian) with meanzero and standard deviation 1/m. (ii) The noise w is also

    Fig. 7. The MSE of the intermediate estimate versus the iteration count forBM3D-AMP and BM3D-IT alongside their predicted state evolution. Noticethat BM3D-AMP is well predicted by the state evolution whereas BM3D-ITis not.

    i.i.d. Gaussian. (iii) The denoiser D is Lipschitz continuous.10

    In all our simulations the elements of A are i.i.d. Gaussian.The same is true for the elements of w.

    Figure 7 compares the state evolution predictions ofD-AMP (based on the BM3D denoising algorithm [55]) withthe empirical performance of D-AMP and D-IT. As is clearfrom this figure, the state evolution is accurate for D-AMPbut not for D-IT. We have checked the validity of the abovefinding for the following denoising algorithms: (i) BM3D [55],(ii) BLS-GSM [61], (iii) Non-local means [39], (iv) AMP withsoft-wavelet-thresholding [10], [62]. We report some of oursimulations on this phenomenon in Section VII-C. We haveposted our code online11 to enable other researchers to verifyour findings in more general settings and explore the validityof this conjecture on a wider range of denoisers.

    In the following sections we assume that the state evolutionis accurate for D-AMP and derive some of the main featuresof D-AMP based on this assumption.

    D. Analysis of D-AMP in the Absence of Measurement Noise

    In this section we consider the noiseless setting σ 2w = 0and characterize the number of measurements D-AMP requires(under the validity of the state evolution framework) to recoverthe signal xo exactly. We consider monotone denoisers, asdefined in section III-B. Consider the state evolution equationunder the noiseless setting σ 2w = 0:

    θ t+1(xo, δ, 0) = 1n

    E‖Dσ t (xo + σ t�) − xo‖22,

    where (σ t )2 = θ t (xo,δ,0)δ . Starting with θ0(xo, δ, 0) =‖xo‖22

    n ,depending on the value of δ there are two conceivable scenar-ios for the state evolution equation:

    (i) θ t (xo, δ, 0) → 0 as t → ∞.(ii) θ t (xo, δ, 0) � 0 as t → ∞.

    10A denoiser is said to be L-Lipschitz continuous if for every x1, x2 ∈ Cwe have ‖D(x1)− D(x2)‖22 ≤ L‖x1 − x2‖22. Many advanced image denoisershave no closed form expression, thus it is very hard to verify whether ornot they are Lipschitz continuous. That said, every advanced denoisers wetested was found to closely follow our state evolution equations (Finding 1),suggesting they are in fact Lipschitz. In Section VI we show examples inwhich Lipschitz continuity is violated and propose a simple approach fordealing with discontinuous denoisers.

    11http://dsp.rice.edu/software/DAMP-toolbox

  • METZLER et al.: FROM DENOISING TO CS 5125

    θ t (xo, δ, 0) → 0 implies the success of D-AMP algorithm,while θ t (xo, δ, 0) � 0 implies its failure in recovering xo. Themain goal of this section is to study the success and failureregions.

    Lemma 1: For monotone denoisers, if for δ0,θ t (xo, δ0, 0) → 0, then for any δ > δ0, θ t (xo, δ, 0) → 0 aswell.

    Proof: Define (σ t )2 = θ t (xo,δ,σ 2w)δ . Clearly, sinceθ t (xo, δ0, σ 2w) → 0 so does σ t . Our first claim is that for everyσ 2 <

    ‖xo‖22nδ0

    = (σ 0)2 (this is where D-AMP is initialized) wehave

    1

    nδ0E‖Dσ (xo + σ�) − xo‖22 < σ 2, ∀σ 2 > 0.

    Suppose that this is not true and define

    σ 2∗ = supσ 2≤ ‖xo‖

    22

    nδ0

    {σ 2 : 1nδ0

    E‖Dσ (xo + σ�) − xo‖22 ≥ σ 2}.

    We claim that if 1nδ0 E‖Dσ (xo + σ 0�) − xo‖22 < (σ 0)2, thenσ t → σ∗ as t → ∞. First, it is straightforward to see that

    1nδ0

    E‖Dσ∗(xo + σ∗�) − xo‖22 = σ 2∗ . For σ > σ∗ we know that1

    nδ0E‖Dσ (xo + σ�) − xo‖22 < σ 2.

    By using the monotonicity of the denoiser we have for everyσ ≥ σ∗

    1

    nδ0E‖Dσ (xo + σ�) − xo‖22

    ≥ 1nδ0

    E‖Dσ∗(xo + σ∗�) − xo‖22 = σ 2∗ .This (through simple induction) implies that for every t ,

    (σ t )2 ≥ (σ∗)2.Furthermore according to the definition of σ 2∗ and the fact thatσ t > σ∗, we have

    (σ t+1)2 = 1nδ0

    E‖Dσ t (xo + σ t�) − xo‖22 < (σ t )2.

    Therefore, σ t+1 is a decreasing sequence with lower bound σ∗.Hence, σ t converges to σ∞ ≥ σ∗. The last step is to show thatσ∞ = σ∗. If this is not the case, then σ∞ > σ∗. But accordingthe definition of σ∗ and the supposition that σ∞ > σ∗, we have

    1

    nδ0E‖Dσ∞(xo + σ∞�) − xo‖22 < (σ∞)2,

    which is a contradiction to σ∞ being a fixed point. Henceσ∞ = σ∗. Since σ∞ = 0, we conclude that σ∗ = 0 and wehave

    1

    nδ0E‖Dσ (xo + σ�) − xo‖22 < σ 2, ∀σ 2 > 0.

    Since, δ > δ0 we can conclude that

    1

    nδE‖Dσ (xo + σ�) − xo‖22 < σ 2, ∀σ 2 > 0.

    Hence the only fixed point of this equation is also atzero and hence θ t (xo, δ, 0) → 0. Note that allthe above argument is based on the assumption that

    1nδ0

    E‖Dσ (xo + σ 0�) − xo‖22 < (σ 0)2. What if this assumptionis violated? Using similar argument it is straightforward toshow that if the 1nδ0 E‖Dσ (xo + σ�) − xo‖22 has a fixed pointabove (σ 0)2, then the algorithm converges to the closest fixedpoint above σ0, which is a contradiction again. Also, if thealgorithm does not have any fixed point above σ 0, then it willdiverge to infinity, which is again a contradiction. �

    Note that for very small values of δ, it is straightforwardto see that θ t (xo, δ, 0) � 0 as t → ∞. If we combine thisresult with Lemma 1 we conclude the following simple result:For small values of δ D-AMP fails in recovering xo. As δincreases, after a certain value of δ D-AMP will successfullyrecover xo from its undersampled measurements. Define

    δ∗(xo) = infδ∈(0,1){δ : θ

    t (xo, δ, 0) → 0 as t → ∞}.

    δ∗(xo) denotes the minimum number of measurementsrequired for the successful recovery of xo. Our goal is tocharacterize δ∗(xo) in terms of the performance (we willclarify what we mean by performance) of the denoisingalgorithm. However, since the number of measurements δ∗(xo)depends on the signal xo, a more natural question in the designof a system is the following: How many measurements doesD-AMP require to recover every signal xo ∈ C? The followingresult addresses this question.

    Proposition 1: Suppose that for signal class C the denoiserDσ is proper at level κ . Then

    supxo∈C

    δ∗(xo) ≤ κ.

    Proof: The proof of this proposition is a simple applica-tion of the state evolution equation. Similar to the proof ofLemma 1 define

    (σ t (xo, δ, σ2w))

    2 = θt (xo, δ, σ 2w)

    δ.

    Also for notational simplicity we use the notation σ t instead ofσ t (xo, δ, 0) in the equation below. According to state evolutionwe have

    (σ t+1)2 = 1nδ

    E‖Dσ t (xo + σ t�) − xo‖22= (σ

    t )2

    nδ(σ t )2E‖Dσ t (xo + σ t�) − xo‖22

    ≤ (σt )2

    δsupxo∈C

    E‖Dσ t (xo + σ t�) − xo‖22n(σ t )2

    ≤ κ(σt )2

    δ. (17)

    It is straightforward to see that

    (σ t (xo, δ, 0))2 ≤(κ

    δ

    )t(σ 0(xo, δ, 0))

    2.

    Hence, if δ > κ , then (σ t (xo, δ, 0))2 → 0 as t → ∞. �We can apply Proposition 1 to the examples of Section III-B

    and derive some well-known results, such as the phase transi-tion of AMP with the soft-threshold denoiser [10].

    If our denoiser is only nearly proper, perfect recovery maynot be possible. However, we can use the same technique tobound the recovery error of D-AMP.

  • 5126 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    Lemma 2: Let Dσ denote a near proper family of denoiserswith levels κ and B, as defined in Definition 2. Then, if δ > κ ,the error of D-AMP is upper bounded by

    limt→∞(σ

    t (xo, δ, 0))2 ≤ Bδ − κ .

    Proof: The proof of this result is much like the one used

    for proper denoisers. Again define σ t (xo, δ, σ 2w) = θt (xo,δ,σ 2w)

    δ .Using the state evolution and the definition of near proper wehave

    (σ t+1(xo, δ, 0))2 = 1nδ

    E‖Dσ t (xo,δ,0)(xo+σ t(xo, δ, 0)�)−xo‖22

    ≤ κ(σt (xo, δ, 0))2 + B

    δ.

    Hence

    (σ t (xo, δ, 0))2 ≤ (κδ)t

    ||xo||22n

    + (1 − (κ/δ)t

    1 − κ/δ )B

    δ.

    For δ > κ , the limit of this sequence is as follows

    limt→∞(σ

    t (xo, δ, 0))2 ≤ Bδ − κ .

    �Note that the proof techniques employed above was first

    developed in [10] and was later employed to establish thephase transition of AMP extensions [35]. There are someminor differences between our derivation and the derivationspresented in the other papers since we have not adopted theminimax setting.

    E. Noise Sensitivity of D-AMP

    In Section III-D we considered the performance of D-AMPin the noiseless setting where σ 2w = 0. This section will bedevoted to the analysis of D-AMP in the presence of themeasurement noise. Here we assume that the denoiser is nearproper at levels κ and B , i.e.,

    supσ 2

    supxo∈C

    E‖Dσ (xo + σ�) − xo‖22n

    ≤ κσ 2 + B. (18)

    The following result shows that D-AMP is robust to themeasurement noise. Let θ∞(xo, σ 2w, δ) denote the fixed pointof the state evolution equation. Since there is measurementnoise, θ∞(xo, σ 2w, δ) �= 0, i.e., D-AMP will not recover xoexactly. We define the noise sensitivity of D-AMP as

    N S(σ 2w, δ) = supxo∈C

    θ∞(xo, δ, σ 2w).

    The following proposition provides an upper bound for thenoise sensitivity as a function of the number of measurementsand the variance of the measurement noise.

    Proposition 2: Let Dσ denote a near proper family ofdenoisers at levels κ and B. Then, for δ > κ , the noisesensitivity of D-AMP satisfies

    N S(σ 2w, δ) ≤κσ 2w + B

    1 − κδ. (19)

    Fig. 8. The MSE of BM3D-AMP reconstructions of 128 × 128 Barbaratest image with varying amounts of measurement noise at different samplingrates (δ).

    Proof: Note that θ∞(xo, δ, σ 2w) is a fixed point of the stateevolution equation and hence it satisfies

    θ∞(xo, δ, σ 2w) =1

    nE‖Dσ∞(xo + σ∞(xo, δ, σ 2w)�) − xo‖22,

    where σ∞(xo, δ, σ 2w) =√

    θ∞(xo, δ, σ 2w)/δ + σ 2w. Therefore,N S(σ 2w, δ) = sup

    xo∈Cθ∞(xo, δ, σ 2w)

    = supxo∈C

    1

    nE‖Dσ∞(xo + σ∞(xo, δ, σ 2w)�) − xo‖22

    ≤ supxo∈C

    κ(σ∞(xo, τ, σ 2w))2 + B

    = supxo∈C

    κ

    (θ∞(xo, δ, σ 2w)

    δ+ σ 2w

    )

    + B

    = κδ

    N S(σ 2w, δ) + κσ 2w + B.A simple calculation completes the proof. �

    Substituting in B = 0 into the above result gives the noisesensitivity for proper denoisers.

    N S(σ 2w, δ) ≤κσ 2w

    1 − κδ. (20)

    There are several interesting features of this proposition thatwe would like to emphasize.

    Remark 3: The bound we presented in Proposition 2 is aworst case analysis. The bound may be achieved for certainsignals in C and certain noise variances. However, for mostsignals in C and most noise variances D-AMP will performbetter than what is predicted by the bound. Figure 8 shows theperformance of BM3D-AMP in terms of the standard deviationof the noise.

  • METZLER et al.: FROM DENOISING TO CS 5127

    The technique we employed above was first developedin [19]. The result we derived in Proposition 2 can beconsidered as a generalization of the result of [19] to muchbroader class of denoisers.

    As an aside, upper and lower bounds were recently derivedfor the minimax noise sensitivity of any recovery algorithmwhen the measurement matrix is i.i.d. Gaussian and the com-pressively sampled signal is sparse [48]. Note that while ourresults can be applied to sparse signals, they have been derivedunder much more general setting. In this section we discussedupper bounds on the noise sensitivity. See Section III-G forsome preliminary results on the lower bound.

    F. Tuning the Parameters of D-AMP

    Practical denoisers typically have a few free parametersand the denoisers’ performance relies on the effective tuningof these parameters. One of the simplest examples of adenoiser with parameters is soft-thresholding (introduced inExample 2), for which the threshold can be regarded asa parameter. There exists extensive literature on tuning thefree parameters of denoisers [63], [64]. Diverse and powerfulalgorithms such as SURE (Stein’s Unbiased Risk Estimation)have been proposed for this purpose [65].

    D-AMP can employ any of these tuning schemes. However,once we use a denoising algorithm in the D-AMP frameworkthe problem of tuning the free parameters of the denoiserseems to become dramatically more difficult: to produce goodperformance from D-AMP the parameters must be tunedjointly across different iterations. To state this challenge weoverload our notation of a denoiser to Dσ,τ , where τ denotesthe denoiser’s parameters. According to this notation the stateevolution is given by

    ot+1(τ 0, τ 1, . . . , τ t ) = 1n

    E‖Dσ t ,τ t (xo + σ t�) − xo‖22,

    where (σ t )2 = ot (τ 0,τ 1,...,τ t−1)δ +σ 2w . Note that we have changedour notation for the state evolution variables to emphasize thedependence of ot+1 on the choice of the parameters we pick atat the previous iterations. The first question that we ask is thefollowing: What does the optimality of τ 0, τ 1, . . . , τ t mean?Suppose that the sequence of parameters τ t is bounded.

    Definition 4: A sequence of parameters τ 1∗ , . . . , τ t∗ is calledoptimal at iteration t + 1 if

    ot+1(τ 0∗ , . . . , τ t∗) = minτ 0,τ 1,...,τ t

    ot+1(τ 0, τ 1, . . . , τ t ).

    Note that τ 0∗ , . . . , τ t∗ is optimal in the sense that theyproduce the smallest mean square error D-AMP can achieveafter t iterations. This definition was first given in [20] for theAMP algorithm based on soft-thresholding.

    It seems from our formulation that we should solve a jointoptimization on τ 0, . . . , τ t to obtain the optimal values ofthese parameters. However, it turns out that in D-AMP theoptimal parameters can be found much more easily. Considerthe following greedy algorithm for setting the parameters:

    (i) Tune τ 0 such that o1(τ 0) is minimized. Call the optimalvalue τ 0∗ .

    (ii) If τ 0, . . . , τ t−1 are set to τ 0∗ , . . . , τ t−1∗ , then set τ t suchthat it minimizes ot+1(τ 0∗ , . . . , τ t−1∗ , τ t ).

    Note that the above strategy is a greedy parameter selection.The following result proves that in the context of D-AMP thisgreedy strategy is optimal:

    Lemma 3: Suppose that the denoiser Dσ,τ is monotone inthe sense that infτ E‖Dσ,τ (xo+σ�)−xo‖22 is a non-decreasingfunction of σ . If τ 0∗ , . . . , τ t∗ is generated according to thegreedy tuning algorithm described above, then

    ot+1(τ 0∗ , . . . , τ t∗) ≤ ot+1(τ 0, . . . , τ t ), ∀τ 0, . . . , τ t ,for every t.

    Proof: Our proof is based on an induction. According tothe first step of our procedure we know that

    o1(τ 0∗ ) ≤ o1(τ 0), ∀τ 0.Now suppose that the claim of the theorem is true for everyt ≤ T . We would like to prove that the result also holds fort = T + 1, i.e.,

    oT+1(τ 0∗ , . . . , τ T∗ ) ≤ oT +1(τ 0, . . . , τ T ), ∀τ 0, . . . , τ T .Suppose that it is not true and for τ 0o , . . . , τ

    To we have

    oT+1(τ 0∗ , . . . , τ T∗ ) > oT+1(τ 0o , . . . , τ To ). (21)

    Clearly,

    oT +1(τ 1∗ , τ 2∗ , . . . , τ T∗ ) =1

    nE‖Dσ t ,τ T (xo + σ T∗ �) − xo‖22,

    where (σ T∗ )2 = oT (τ 0∗ ,...,τ T −1∗ )

    δ + σ 2w . If we define (σ To )2 =oT (τ 0o ,...,τ

    T −1o )

    δ +σ 2W , then according to the induction assumptionσ T∗ ≤ σ To . Therefore, according to the monotonicity of thedenoiser

    oT+1(τ 0∗ , τ 1∗ , . . . , τ T∗ ) = infτ T

    1

    nE‖Dσ T∗ ,τ T (xo + σ T∗ �) − xo‖22

    ≤ infτ T

    1

    nE‖Dσ to,τ T (xo + σ To �) − xo‖22

    ≤ 1n

    E‖Dσ to ,τ To (xo + σ To �) − xo‖22= oT+1(τ 0o , τ 1o , . . . , τ To ).

    This is in contradiction with (21). Hence,

    oT+1(τ 0∗ , . . . , τ T∗ ) ≤ oT +1(τ 0, . . . , τ T ), ∀τ 0, . . . , τ T .�

    To summarize the above discussion, greedy parameter tun-ing is optimal for D-AMP, thus the tuning of D-AMP is assimple (or as difficult) as the tuning of the denoising algorithmthat is employed in D-AMP. Many researchers in the area ofsignal denoising have optimized the parameters of state-of-the-art denoisers, such as BM3D. Lemma 3 implies that optimallytuned denoisers will induce the best possible performance fromD-AMP. Therefore, the tuning of D-AMP has already beenthoroughly addressed in the denoising literature [63]–[65].

  • 5128 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    G. Optimality of D-AMP

    1) Problem Definition: D-AMP is a framework by whichto employ denoisers to solve linear inverse problems. But isD-AMP optimal? In other words, given a family of denoisers,Dσ , for a set C , can we come up with an algorithm forrecovering xo from y = Axo + w that outperforms D-AMP?Note that this problem is ill-posed in the following sense: thedenoising algorithm might not capture all the structures thatare present in the signal class C . Hence, a recovery algorithmemploys extra structures not used by the denoiser (and thusnot used by D-AMP) clearly might outperform D-AMP.In the following sections we use two different approaches toanalyze the optimality of D-AMP.

    2) Uniform Optimality: Let Eκ denote the set of all classesof signals C for which there exists a family of denoisers DCσthat satisfies

    supσ 2

    supxo∈C

    E‖DCσ (xo + σ�) − xo‖22nσ 2

    ≤ κ. (22)

    We know from Proposition 1 that for any C ∈ Ek , D-AMPrecovers all the signals in C from δ > κ measurements.

    We now ask our uniform optimality question: Does thereexist any other signal recovery algorithm that can recover allthe signals in all these classes with fewer measurements thanD-AMP? If the answer is affirmative, then D-AMP is sub-optimal in the uniform sense, meaning there exists an approachthat outperforms D-AMP uniformly over all classes in Eκ .The following proposition shows that any recovery algorithmrequires at least m = κn measurements for accurate recovery,i.e., D-AMP is optimal in this sense.

    Proposition 3: If m∗ denotes the minimum number ofmeasurements required (by any recovery algorithm) for a setC ∈ Eκ , then

    supC∈Eκ

    m∗(C)n

    ≥ κ.Proof: First note that according to Example 1 any κn

    dimensional subspace of Rn belongs to Eκ (assume that κn isan integer). From the fundamental theorem of linear algebrawe know that to recover the vectors in a k dimensionalsubspace we require at least k measurements. Hence

    supC∈Eκ

    m∗(C)n

    ≥ κnn

    = κ.

    �According to this simple result, D-AMP is optimal for at

    least certain classes of signals and certain denoisers. Hence,it cannot be uniformly improved.

    3) Single Class Optimality: The uniform optimality frame-work we introduced above considers a set of signal classes andmeasures the performance of an algorithm on every class inthis set. However, in many applications such as imaging we areinterested in the performance of D-AMP on a specific class ofsignals, such as images. Unfortunately, the uniform optimalityframework does not provide any conclusion in such cases.Therefore, in this section we introduce another framework forevaluating the optimality of D-AMP that we call single classoptimality.

    Let C denote a class of signals. Instead of assuming that weare given a family of denoisers for the signals in class C , weassume that we can find the denoiser that brings about the bestperformance from D-AMP. This ensures that D-AMP employsas much information as it can about C . Let θ∞D (xo, δ, σ 2w)denote the fixed point of the state evolution equation givenin (16). Note that we have added a subscript D to our notationfor θ to indicate the dependence of this quantity on the choiceof the denoiser. The best denoiser for D-AMP is a denoiser thatminimizes θ∞D (xo, δ, σ 2w). Note that according to Finding 1,θ∞D (xo, δ, σ 2w) corresponds to the mean square error of thefinal estimate that D-AMP returns.

    Definition 5: A family of denoisers D∗σ is called minimaxoptimal for D-AMP at noise level σ 2w , if it achieves

    infDσ

    supxo∈C

    θ∞D (xo, δ, σ 2w).

    Note that according to our definition, the optimal denoiser maydepend on both σ 2w and δ and it is not necessarily unique.We call the version of D-AMP that employs D∗σ , D∗-AMP.

    Armed with this definition, we formally ask the singleclass optimality question: Can we provide a new algorithmthat can recover signals in class C with fewer measurementsthan D∗-AMP? If negative, it means that if we employ theoptimal denoiser for D-AMP algorithm no other algorithm canoutperform D-AMP. Unfortunately, we will show that thereare signal classes for which D-AMP is not optimal in thissense. Our proof requires the following standard definitionfrom statistics text books [60]:

    Definition 6: The minimax risk of a set of signals C at thenoise level σ 2 is defined as

    RM M (C, σ2) = inf

    Dsupxo∈C

    E‖D(xo + σ�) − xo‖22,

    where the expected value is with respect to � ∼ N(0, I ).If DMσ achieves RM M (C, σ

    2), then it will be called the familyof minimax denoisers for the set C under the square loss.

    Proposition 4: The family of minimax denoisers for C is afamily of optimal denoisers for D-AMP. Furthermore, in orderto recover every xo ∈ C, D∗-AMP requires at least nκM Mmeasurements:

    κM M = supσ 2>0

    RM M (σ 2)

    nσ 2.

    Proof: Since the proof of this result is slightly moreinvolved, we postpone it to Appendix A. �

    Based on this result, we can simplify the single classoptimality question: Does there exist any recovery algorithmthat can recover every xo ∈ C from fewer observations thannκM M ? Unfortunately, the answer is affirmative.

    Consider the following extreme example. Let Bnk denote theclass of signals that consist of k ones and n − k zeros. Defineρ = k/n and let φ(z) denote the density function of a standardnormal random variable.

    Proposition 5: For very high dimensional problems, thereare recovery algorithms that can recover signals in Bk accu-rately from 1 measurement. On the other hand, D∗-AMPrequires at least n(κM M − o(1)) measurement to recover

  • METZLER et al.: FROM DENOISING TO CS 5129

    signals from this class, where

    κM M = supσ 2>0

    1

    σ 2Ez1∼φ

    (ρφσ (z1)

    ρφσ (z1)+(1−ρ)φσ(z1+1)−1)2

    ρ

    + Ez1∼φ(

    ρφσ (z1 − 1)ρφσ (z1 − 1) + (1 − ρ)φσ (z1)

    )2(1 − ρ),

    where φσ (z) = φ(z/σ).The proof of this result is slightly more involved and

    hence is postponed to Appendix B. According to this propo-sition, since κM M is non-zero, the number of measurementsD∗-AMP requires is proportional to the ambient dimension n,while the actual number of measurements that is requiredfor recovery is equal to 1. Hence, in such cases D∗-AMP issub-optimal.

    However, it is also important to note that while D-AMP issub-optimal for this class, according to Proposition 3 D-AMPis optimal for other classes. Characterizing the classes ofsignals for which D-AMP is optimal is left as an opendirection for future research. Despite this sub-optimalityresult, we will show in Section VII that D-AMP providesimpressive results for the class of natural images andoutperforms state-of-the-art recovery algorithms.

    H. Additional Miscellaneous Properties of D-AMP

    1) Better Denoisers Lead to Better Recovery: This intuitiveresult is a key feature of D-AMP. We formalize it below.

    Theorem 1: Let a family of denoisers D1σ be a betterdenoiser than a family D2σ for signal xo in the following sense:

    E‖D1σ (xo + σ�)−xo‖22nσ 2

    ≤ E‖D2σ (xo+σ�)−xo‖22

    nσ 2, ∀σ 2 >0.

    (23)

    Also, let θ∞Di (xo, δ, σ2w) denote the fixed point of state evolution

    for denoiser Di . Then,

    θ∞D1(xo, δ, σ2w) ≤ θ∞D2(xo, δ, σ 2w).

    Proof: The proof of this result is straightforward. Since,the state evolution of D1 is uniformly lower than D2, its fixedpoint is lower as well. �

    2) D-AMP as a Regularization Technique: Explicitregularization is a popular technique to recover signalsfrom an undersampled set of linear measurements [5], [22],[29]–[33], [66]. In these approaches a cost function, J (x), alsoknown as a regularizer, is considered on Rn . This functionreturns large values for x /∈ C and returns small valuesfor x ∈ C . Regularized techniques recover xo from measure-ments y by setting up and solving the following optimizationproblem:

    x̂ = argminx

    1

    2||y − Ax ||22 + λJ (x). (24)

    Since in many cases J (x) is non-convex and non-differentiable, iterative heuristic methods have been proposedfor solving the above optimization problem.12 D-AMPprovides another heuristic approach for solving (24). It has

    12Many of these methods solve (24) accurately when J is convex.

    two main advantages over the other heuristics: (i) D-AMP canbe analyzed by the state evolution theoretically. Hence, we cantheoretically predict the number of measurements requiredand the noise sensitivity of D-AMP. (ii) The performanceof most heuristic methods depend on their free parameters.As discussed in Section III-F there are efficient approachesfor tuning the parameters of D-AMP optimally. Below webriefly review the application of D-AMP for solving (24).

    Assume that there exists a computationally efficientscheme for solving the optimization problem χJ (u; λ) =arg min 12‖u − x‖22 + λJ (x). χJ (u, λ) is called the proximaloperator for the function J . The D-AMP algorithm for solv-ing (24) is given by

    xt+1 = χJ (xt + A∗zt ; λt ),zt = y − Axt +zt−1divχJ (xt−1 + A∗zt−1; λt−1)/m. (25)

    Considering χJ as a denoiser, this algorithm has exactlythe same interpretation as our generic D-AMP algorithm.Furthermore, if the explicit calculation of the Onsager cor-rection term is challenging we can employ the Monte Carlotechnique that will be discussed in Section V-B.

    IV. CONNECTION WITH OTHER STATE EVOLUTIONS

    In Section III we introduced a new, “deterministic” stateevolution (SE) and used it to analyze D-AMP. Here we reviewthis SE and compare it with AMP’s Bayesian SE, which wasfirst introduced in [10] and [17].

    A. Deterministic State Evolution

    The deterministic SE assumes that xo is an arbitrary but

    fixed vector in C . Starting from θ0 = ‖xo‖22n , the deterministicSE generates a sequence of numbers through the followingiterations:

    θ t+1(xo, δ, σ 2w) =1

    nE�‖Dσ t (xo + σ t�) − xo‖22, (26)

    where (σ t )2 = 1δ θ t (xo, δ, σ 2w) + σ 2w and � ∼ N(0, I ).

    B. Bayesian State Evolution

    The Bayesian SE assumes that xo is a vector drawn froma probability density function (pdf) px , where the support of

    px is a subset of C . Starting from θ̄0 = ‖xo‖22

    n , the BayesianSE generates a sequence of numbers through the followingiterations:

    θ̄ t+1(px, δ, σ 2w) =1

    nExo,�‖Dσ̄ t (xo + σ̄ t�) − xo‖22, (27)

    where (σ̄ t )2 = 1δ θ̄ t (px , δ, σ 2w) + σ 2w . We have used the nota-tion θ̄ to distinguish the Bayesian SE from its deterministiccounterpart. In Definition 6 we presented a definition of theoptimal denoiser under the deterministic framework. One cando the same for the Bayesian framework.

    Definition 7: A family of denoisers D̃σ is called Bayes-optimal for D-AMP at noise level σ 2w , if it achieves

    infDσ

    θ̄∞D (px, δ, σ 2w).

  • 5130 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    It is straightforward to see that the family D̃σ (xo + σ�) =E(xo | xo + σ�) is Bayes-optimal for D-AMP.

    While the deterministic and Bayesian SEs are different,we can establish a connection between them by employingstandard results in theoretical statistics regarding the connec-tion between the minimax risk and the Bayesian risk. Nextsection briefly discusses this connection.

    C. Connection Between the Two State Evolutions

    In this section we would like to establish a connectionbetween the fixed points of the Bayes-optimal denoisers andthe minimax-optimal denoisers for D-AMP. Let θ̄∞(px , δ, σ 2w)denote the fixed point of the Bayesian SE (27) associated withthe family of Bayes-optimal denoisers from Definition 7. Also,let θ∞(xo, δ, σ 2w) denote the fixed point of the deterministicSE (16) for the family of minimax denoisers from Definition 5.

    Theorem 2: Let P denote the set of all distributions whosesupport is a subset of C. Then,

    suppx ∈P

    θ̄∞(px , δ, σ 2w) ≤ supxo∈C

    θ∞(xo, δ, σ 2w).

    Proof: For an arbitrary family of denoisers Dσ we have

    Exo,�‖Dσ (xo+σ�) − xo‖22 ≤ supxo∈C

    E�‖Dσ (xo + σ�) − xo‖22.(28)

    If we take the minimum with respect to Dσ on both sidesof (28), we obtain the following inequality

    Exo,�‖D̃σ (xo + σ�)−xo‖22 ≤ supxo∈C

    E�‖DM M (xo+σ�) − xo‖22,

    where DM M denotes the minimax denoiser and D̃σ ( xo +σ�)denotes E(xo | xo + σ�). Let (σ̄∞)2 = θ̄

    ∞(xo,δ,σ 2w)δ + σ 2w and

    (σ∞mm)2 = θ∞(xo,δ,σ 2w)

    δ + σ 2w . Also, for notational simplicityassume that supxo∈C E�‖DM M (xo + σ̄∞�) − xo‖22 is achievedat a certain value xmm . We then have

    θ̄∞(px, δ, σ 2w) =Exo,�‖D̃σ̄∞(xo + σ̄∞�) − xo‖22

    n

    ≤ E�‖DM Mσ̄∞ (xmm + σ̄∞�) − xmm‖22

    n. (29)

    This inequality implies that θ̄∞(px , δ, σ 2w) is below thefixed point of the deterministic SE using DM M at xmm .Therefore, because supxo∈C θ

    ∞(xo, δ, σ 2w) will be equal toor above the fixed point of DM M at xmm , it will satisfysuppx ∈P θ̄

    ∞(px , δ, σ 2w) ≤ supxo∈C θ∞(xo, δ, σ 2w). �Under some general conditions it is possible to prove that

    supπ∈P

    infDσ

    Exo,�‖Dσ (xo + σ�) − xo‖22= inf

    Dσsupxo∈C

    E�‖Dσ (xo + σ�) − xo‖22. (30)

    For instance, if we have

    supπ∈P

    infDσ

    Exo,�‖Dσ (xo + σ�) − xo‖22= inf

    Dσsupπ∈P

    Exo,�‖Dσ (xo + σ�) − xo‖22,

    then (29) holds as well. Since we work with square loss inthe SE, swapping the infimum and supremum is permittedunder quite general conditions on P . For more information,see [68, Appendix A]. If (29) holds, then we can follow similarsteps as in the proof of Theorem 2 to prove that under the sameset of conditions we can have

    suppx ∈P

    infDσ

    θ̄∞(px, δ, σ 2w) = infDσ

    supxo∈C

    θ∞(xo, δ, σ 2w).

    In words, the supremum of the fixed point of the Bayesian SEwith the Bayes-optimal denoiser is equivalent to the supremumof the fixed point of the deterministic SE with the minimaxdenoiser.

    D. Why Bother?

    Considering that the deterministic and Bayesian SEs look sosimilar, and under certain conditions have the same suprema,it is natural to ask why we developed the deterministic SE atall. That is, what is gained by using SE (16) rather than (26)?

    The deterministic SE is useful because it enables us to dealwith signals with poorly understood distributions. Take, forinstance, natural images. To use the Bayesian SE on imagingproblems, we would first need to characterize all imagesaccording to some generalized, almost assuredly inaccurate,pdf. In contrast, the deterministic SE deals with specificsignals, not distributions. Thus, even without knowledge ofthe underlying distribution, so long as we can come up withrepresentative test signals, we can use the deterministic SE.Because the SE shows up in the parameter tuning, noisesensitivity, and performance guarantees of AMP algorithms,being able to deal with arbitrary signals is invaluable.

    V. CALCULATION OF THE ONSAGER CORRECTION TERM

    So far, we have emphasized that the key to the successof approximate message passing algorithms is the Onsagercorrection term, zt−1divDσ̂ t−1(xt−1+A∗zt−1)/m, but we havenot yet addressed how one can calculate it for an arbitrarydenoiser. In this section we provide some guidelines on thecalculation of this term. If the input-output relation of thedenoiser is known explicitly, then calculating the divergence,divD(x), and thus the Onsager correction term, is usuallystraightforward.13 We will review some popular denoisersand calculate the corresponding Onsager correction terms inthe next section. However, most state-of-the-art denoisers arecomplicated algorithms for which the input-output relationis not explicitly known. In Section V-B we show that, evenwithout an explicit input-output relationship, we can calculatea good approximation for the Onsager correction term.

    A. Soft-Thresholding for Sparse and Low-Rank Signals

    Three of the most popular signal classes in the literatureare sparse, group-sparse, and low-rank signals (when thesignal has a matrix form). The most popular denoisers for

    13In the context of this work the divergence divD(x) is simply thesum of the partial derivatives with respect to each element of x ,

    i.e., divD(x) =n∑

    i=1∂ D(xi )

    ∂xi.

    高镇高亮

  • METZLER et al.: FROM DENOISING TO CS 5131

    these signals are soft-thresholding, block soft-thresholding,and singular value soft-thresholding, respectively. The goal ofthis section is to derive the Onsager correction term for each ofthese denoisers. Most of the results mentioned in this sectionhave been derived elsewhere. We summarize these results tohelp the reader understand the steps involved in explicitlycomputing the Onsager correction term.

    1) Soft-Thresholding: Let ητ denote the soft-threshold func-tion. ητ (x) for x ∈ Rn denotes the component-wise applicationof the soft-threshold function to the elements of x . In this casewe have divητ (x) = ∑ni=1 I(|xi | > τ), where I denotes theindicator function.

    2) Block Soft-Thresholding: For a vector xB ∈ RB blocksoft-thresholding is defined as ηBτ (xB) = (‖xB‖2 − τ ) xB‖xB‖2 .In other words, the threshold function retains the phase ofthe vector xB and shrinks its magnitude toward zero. Letn = M B and x = [(x1B)T , (x2B)T , . . . , (x MB )T ]T . The notationηBτ (x) is defined as the block soft-thresholding function thatis applied to each individual block. The divergence of blocksoft-thresholding can then be calculated according to

    divηBτ (x) =M∑

    =1

    (

    B − (B − 2)2

    ‖xB‖22

    )

    I(‖xB)‖2 ≥ τ ).

    This result was derived in [35], [36], and [69].3) Singular Value Thresholding: Let Xo ∈ Rn×n denote

    our signal of interest. If Xo is low-rank then it can beestimated accurately from its noisy version � = Xo + σWwhere Wi j denote i.i.d., N(0, 1) random variables. If thesingular value decomposition of � is given by USVT , withS = diag(σ1, . . . , σn), where σi s denote the singular valuesof �, then the estimate of Xo has the form

    X̂ = SVTλ(�)= Udiag((σ1 − λ)+, (σ2 − λ)+, . . . , (σn − λ)+)VT ,

    in which λ is a regularization parameter that can be optimizedfor the best performance. Again this denoiser can be employedin the D-AMP framework to recover low-rank matrices fromtheir underdetermined set of linear equations. To calculatethe Onsager correction term we should compute divSVTλ(�).According to [69] the divergence of singular value threshold-ing is given by

    divSVTλ(�) =n∑

    i=1I(σi > λ) + 2

    n∑

    i, j=1,i �= j

    σi (σi − λ)+σ 2i − σ 2j

    .

    B. Monte Carlo Method

    While simple denoisers often yield a closed form for theirdivergence, high-performance denoisers are often data depen-dent; making it very difficult to characterize their input-outputrelation explicitly. Here we explain how a good approximationof the divergence can be obtained in such cases. This methodrelies on a Monte Carlo technique first developed in [65]. Theauthors of that work showed that given a denoiser Dσ,τ (x),using an i.i.d. random vector b ∼ N(0, I ), we can estimate

    Fig. 9. Predicted and observed intermediate MSEs of hard-thresholding-based AMP, with and without smoothing. Notice that the smoothed version iswell predicted by the state evolution whereas hard-thresholding-based AMPwithout smoothing is not. This discrepancy is due the fact that the hard-thresholding denoiser is not continuous.

    the divergence with

    divDσ,τ = lim�→0 Eb

    {

    b∗(

    Dσ,τ (x + �b) − Dσ,τ (x)�

    )}

    ≈ Eb(

    1

    �b∗(Dσ,τ (x + �b) − Dσ,τ (x))

    )

    ,

    for very small �.

    The only challenge in using this formula is calculat-ing the expected value. This can be done efficiently usingMonte Carlo simulation. We generate M i.i.d., N(0, I ) vectorsb1, b2, . . . , bM . For each vector bi we obtain an estimate ofthe divergence d̂iv

    i. We then obtain a good estimate of the

    divergence by averaging

    divD̂σ,τ = 1M

    M∑

    i=1d̂iv

    i.

    According to the weak law of large numbers, as M → ∞this estimate converges to Eb

    ( 1� b

    ∗(Dσ,τ (x + �b) − Dσ,τ (x))).

    When dealing with images, due to the high dimensionalityof the signal, we can accurately approximate the expectedvalue using only a single random sample. That is, we canlet M = 1.14 Note that in this case the calculation of theOnsager correction term is quite efficient and requires onlyone additional application of the denoising algorithm. In allof the simulations in this paper we have used either the explicitcalculation of the Onsager correction term or the Monte Carlomethod with M = 1.

    VI. SMOOTHING A DENOISER

    The denoiser used within D-AMP can take on almostany form. However, the state evolution predictions are notnecessarily accurate if the denoiser is not Lipschitz continuous.This requirement seems to disallow some popular denois-ers with discontinuities, such as hard-thresholding. Figure 9compares the state evolution predictions for the hard thresh-olding denoiser alongside the actual performance of D-AMPbased on hard thresholding; the state evolution predictions fail

    14When dealing with short signals (n < 1000), rather than images, we foundthat using additional Monte Carlo samples produced more consistent results.

    高镇高亮

  • 5132 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

    entirely. One simple idea to resolve this issue is to “smooth”the denoisers. The smoothed version should behave nearlythe same as the original denoiser but, because it has nodiscontinuities, should satisfy the state evolution equations.The concept of smoothing simple denoisers and this process’seffects on the performance of simple denoisers has beenanalyzed in [71] and [72]. Here we explain how smoothingcan be applied in practice.

    Let η(x) be a discontinuous denoiser. Now define a newdenoiser η̃(x) as follows

    η̃(x) =∫

    ζ∈Rnη(x − ζ ) 1

    (2π)n/2rne−‖ζ‖22/2r2dζ, (31)

    where dζ = dζ1dζ2 . . . dζn .The denoiser η̃(x) is simply the convolution of η(x) with

    a Gaussian kernel with standard deviation r . Note that thewidth r dictates the amount of smoothness we apply to η.Larger values of r lead to a smoother η̃. Below we present asimple lemma that proves η̃(x) is in fact smooth.

    Suppose that η(x1, x2, . . . , xn) satisfies the following con-dition:

    |ηi (ζ̃ )ζ̃i |e−‖ζ̃‖224r2 d ζ̃ < ∞. (32)

    Note that this condition implies that ηi is not growing veryfast as ζ1, ζ2, . . . , ζn → ∞.

    Lemma 4: If η satisfies (32), then η̃(x) defined in (31) iscontinuously differentiable with a bounded derivative and isthus Lipschitz continuous.

    Proof: To prove η̃ is continuously differentiable withbounded derivative, we prove that all the partial deriva-tives exist, are bounded, and are continuous. Let η =(η1, η2, . . . , ηn) and η̃ = (η̃1, η̃2, . . . , η̃n). By a simple changeof integration variables we obtain

    η̃(x) =∫

    ζ̃∈Rnη(ζ̃ )

    1

    (2π)n/2rne−‖x−ζ̃‖22/2r2d ζ̃ . (33)

    Now we calculate the j th partial derivative of η̃i . Let b j ∈ Rndenote a vector whose elements are all zero except for thej th element, which is equal to one. Then,

    ∂η̃i (x)

    ∂x j= lim

    γ→0η̃i (x + γ b j ) − η̃i (x)

    γ

    = limγ→0

    ζ̃∈Rn

    ηi (ζ̃ )

    (2π)n/2rn

    ⎜⎝

    e−‖x+γ b j −ζ̃‖22

    2r2 −e−‖x−ζ̃‖22

    2r2

    γ

    ⎟⎠ d ζ̃ .

    (34)

    From the mean value theorem we conclude that there exists γ̃between 0 and γ such that

    (e−‖x+γ b j−ζ̃‖22/2r2 − e−‖x−ζ̃‖22/2r2

    γ

    )

    = (ζ̃ j − γ̃ − x j )r2

    e−‖x+γ̃ b j −ζ̃‖22/2r2 , (35)

    where the last equality is due to the fact that the j th elementof b j is equal to one. Also, note that for |γ | < 1 we have∣∣∣∣∣

    ζ̃ j − γ̃ − x jr2

    e−‖x+γ̃ b j−ζ̃‖22/2r2∣∣∣∣∣

    ≤ |ζ̃ j | + 1 + |x j |r2

    e−∑

    k �= j (xk−ζ̃k )2/2r2 e−ζ̃2j +2(|x j |+1)|ζ̃ j |

    2r2 ,

    where to obtain the last equality we used the fact that all theelement of b j except the j th one are zero. Define

    h(ζ̃ ) = |ζ̃ j | + 1 + |x j |r2

    e−∑

    k �= j (xk−ζ̃k )2/2r2e−ζ̃2j +2(|x j |+1)|ζ̃ j |

    2r2 .

    It is straightforward to use (32) and check that∫ ∣∣∣∣ηi (ζ̃ )

    1

    (2π)n/2rnh(ζ̃ )

    ∣∣∣∣ dz̃ < ∞.

    So far we have proved that the absolute value of theintegrand in (34) is less than or equal to C

    (2π)n/2rnh(ζ̃ ), which

    is an integrable function. Hence, we can employ the dominatedconvergence theorem to show that

    ∂η̃i (x)

    ∂x j= lim

    γ→0η̃i (x + γ b j ) − η̃i (x)

    γ

    =∫

    ζ̃∈Rn

    ηi (ζ̃ )

    (2π)n/2rnlimγ→0

    ⎜⎝

    e−‖x+γ b j −ζ̃‖22

    2r2 −e−‖x−ζ̃‖22

    2r2

    γ

    ⎟⎠ d ζ̃

    =∫

    ζ̃∈Rn

    ηi (ζ̃ )

    (2π)n/2rnζ̃ j − x j

    r2e−‖x−ζ̃‖22/2r2d ζ̃ . (36)

    It is straightforward to conclude that this derivative is bounded.Proving the continuity of the derivative employs the same linesof reasoning and hence we skip it. �

    Calculating η̃(x) from η(x) is not straightforward for thefollowing two reasons: (i) Equation (31) dictates that weintegrate over all of Rn , and (ii) We usually do not have accessto the explicit form of η. To get around this problem we againturn to Monte Carlo sampling.

    To approximately calculate (31) using Monte Carlo sam-pling first generate a series of M random vectors h1, h2, ...hM ,each with i.i.d. Gaussian elements with standard deviation r .Next, for each hi , pass hi + x through the discontinuousdenoiser η(x) and then average the results to get a smoothdenoiser ˆ̃η(x). That is approximate η̃(x) with

    ˆ̃η(x) = 1M

    M∑

    i=1η(x + hi ), (37)

    where hi ∼ N(0, r2I) for all i .Figure 11 compares the input-output relationship of the

    hard-thresholding denoiser before and after it has beensmoothed using this method. Notice the smoothing processcompletely removes the discontinuities but otherwise leavesthe function intact.

    The above discussion does not provide any suggestion onhow we should pick the smoothing parameter r . In fact,rigorous study of the effect of r in AMP requires the evaluation

    高镇高亮

    高镇高亮

  • METZLER et al.: FROM DENOISING TO CS 5133

    Fig. 10. Reconstructions of a sparse signal that was sampled at a rate of δ = 1/3. Notice that D-AMP based on smoothed-hard-thresholding successfullyreconstructs the signal whereas D-AMP based on hard-thresholding, does not. The failure of hard-thresholding-based D-AMP is due to the discontinuity ofthe hard-thresholding denoiser.

    Fig. 11. Hard-thresholding and smoothed-hard-thresholding denoisers. Notehow the smoothing process has removed the discontinuities.

    of the difference | ‖xt−xo‖22N − θ t (xo, δ, σ 2w)|, in terms of thedimension. We leave it as an open problem for future research.Nevertheless, from a practical perspective r can be consideredas just another denoiser parameter. The problem of optimizingdenoiser parameters has been extensively studied in the fieldof image processing [63]–[65].

    Figures 9 and 10 demonstrate the benefits of smoothing adenoiser using this approach. Unlike D-AMP using the orig-inal hard-thresholding denoiser, D-AMP using the smootheddenoiser closely follows the state evolution. This change issignificant because it allows us to take advantage of the theoryand tuning strategies developed in Section III. More impor-tantly, Figure 10 illustrates how D-AMP based on smoothed-hard-thresholding dramatically outperforms its discontinuouscounterpart.

    Before proceeding, we would like to emphasize that theabove process is not needed for any of the advanced denoisersthat we explored in this paper. We found that advanceddenoisers satisfy the state evolution and perform exceptionallyin D-AMP without any smoothing. We believe this findingimplies they are sufficiently smooth to begin with.

    VII. SIMULATION RESULTS FOR IMAGING APPLICATIONS

    To demonstrate the efficacy of the D-AMP framework,we evaluate its performance on imaging applications.

    A. A Menagerie of Image Denoising Algorithms

    As we have discussed so far, D-AMP employs a denoisingalgorithm for signal recovery problems. In this section,

    we briefly review some well-known image denoisingalgorithms that we would like to use in D-AMP. We laterdemonstrate that any of these denoisers, as well as manyothers, can be used within our D-AMP algorithm in orderto reconstruct various compressively sampled signals. As wediscussed in Section III-H, theory says that if denoisingalgorithm M outperforms denoising algorithm N , thenD-AMP based on M will outperform D-AMP based on N .We will see this behavior in our simulations as well.

    Below we represent a noisy image with the vector f ;f = x + σ z where x is the noise-free version of the image,σ is the standard deviation of the noise, and the elements ofz follow an i.i.d. Gaussian distribution.

    1) Gaussian kernel regression: One of the simplest andoldest denoisers is Gaussian kernel regression, which isimplemented via a Gaussian filter. As the name suggests,it simply applies a filter whose coefficients follow aGaussian distribution to the noisy image. It takes theform:

    x̂ = f � G, (38)where G and � denote the Gaussian kernel and theconvolution operator, respectively. The Gaussian filteroperates under the model that a signal is smooth. That is,neighboring pixels should have similar values. Note thatthis assumption is violated on image edges and hencethis denoiser tends to over-smooth them. Compared toother approaches Gaussian kernel regression has a verylow implementation cost. However, it does not removenoise as well as other denoisers.

    2) Bilateral filter: Similar to kernel regression, the bilateralfilter [72] sets each pixel value according to a weightedaverage of neighboring