robust shrinkage prior estimation

Upload: subhankar-ghosh

Post on 02-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Robust shrinkage prior estimation

    1/21

    a

    rXiv:1412.8161v1

    [math.ST]28De

    c2014

    Posterior Concentration Properties of a General Class of

    Shrinkage Estimators around Nearly Black Vectors

    Prasenjit Ghosh and Arijit Chakrabarti

    Applied Statistics Unit, Indian Statistical Institute, Kolkata, India

    December 30, 2014

    Abstract

    We consider the problem of estimating a high-dimensional multivariate normal mean vec-tor when it is sparse in the sense of being nearly black. Optimality of Bayes estimates corre-sponding to a very general class of continuous shrinkage priors on the mean vector is studiedin this work. The class of priors considered is rich enough to include a wide variety of heavy-tailed priors including the horseshoe which are in extensive use in sparse high-dimensionalproblems. In particular, the three parameter beta normal mixture priors, the generalizeddouble Pareto priors, the inverse gamma priors and the normal-exponential-gamma priorsfall inside this class. We work under the frequentist setting where the data is generated ac-cording to a multivariate normal distribution with a fixed unknown mean vector. Under theassumption that the number of non-zero components of the mean vector is known, we showthat the Bayes estimators corresponding to this general class of priors attain the minimaxrisk (possibly up to a multiplicative constant) corresponding to the l2 loss. Further an upper

    bound on the rate of contraction of the posterior distribution around the estimators understudy is established. We also provide a lower bound to the posterior variance for an im-portant subclass of this general class of shrinkage priors that include the generalized doublePareto priors with shape parameter = 1

    2, the three parameter beta normal mixtures with

    parameters a= 12

    and b >0 (including the horseshoe in particular), the inverse gamma priorwith shape parameter ( = 1

    2) and many other shrinkage priors. This work is inspired by

    the recent work of van der Pas et al (2014) on the posterior contraction properties of thehorseshoe prior under the present set-up. We extend their results for this general class ofpriors and come up with novel unifying proofs using properties of slowly varying functions.This work shows that the general scheme of arguments in van der Pas et al (2014) can beused in greater generality.

    1 Introduction

    With rapid advancements in modern technology and computing facilities, high throughput datahave become common place in real life problems across diverse scientific fields such as genomics,biology, medicine, cosmology, finance, economics and climate studies. As a result inferential prob-lems involving a large number of unknown parameters are coming to the fore. Problems where thenumber of unknown parameters grows as least as fast as the number of observations are typicallycalled high-dimensional. In such problems, often times it is also true that only a few of theseparameters are of real importance. For example, in a high dimensional regression problem, it isoften true that the proportion of non-zero regressors or regressors with large magnitude is quitesmall compared to the total number of candidate regressors. This is called the phenomenon ofsparsity. A common Bayesian approach to model sparse high-dimensional data is to use a two-component point mass mixture prior for the parameters and they put a positive mass at zero (to

    1Email for correspondence:prasenjit [email protected] Arijit Chakrabarti:[email protected]

    1

    http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1http://arxiv.org/abs/1412.8161v1mailto:[email protected]:[email protected]:[email protected]:[email protected]://arxiv.org/abs/1412.8161v1
  • 8/10/2019 Robust shrinkage prior estimation

    2/21

    induce sparsity) and a heavy tailed continuous distribution (to identify the non-zero coefficients).These are also referred to as spike and slab priors or two-groups priors. This is a very naturalway of modelling data of this kind from a Bayesian view point. See Johnstone and Silverman

    (2004) andEfron(2004) in this context.

    Use of the two-groups prior, although very natural, poses a very daunting task computation-ally. Note that the cardinality of the model space becomes 2p wherep is the number of parametersinvolved, and even for moderately large p like 50, it is practically impossible to study posteriorprobabilities of the different models. Sometimes it is also possible that most of the parametersare very close to zero, but not exactly equal to zero. So in such a case a contiuous prior maybe able to capture sparsity in a more flexible manner. Due to these reasons, significant effortshave gone into modeling sparse high-dimensional data in recent times through hierarchical one-group continuous priors, which are also called one-group shrinkage priors. Bayesian anlysis iscomputationally much more tractable than the two-group prior in such cases and easily imple-mentable through standard MCMC techniques. But more importantly, these priors are suitable

    to capture sparsity since they accord a significant chunk of probability around zero while theyhave tails which are heavy enough to ensure a prori large probabilities for large parameter values.In general, such priors are expressed as multivariate scale-mixtures of normals that mix over twolevels of parameters appearing in the scale, referred to as a global shrinkage parameter anda local shrinkage parameter. While the global shrinkage parameter accounts for the overallsparsity in the data by shrinking the noise obervations to the origin, the local shrinkage parame-ters are helpful in detecting the obvious signals by leaving the large observations mostly unshrunk.

    A great variety of one-group shrinkage priors have appeared in the literature over the years. No-table early examples are thet-prior inTipping(2001), the double-exponential prior inPark and Casella(2008) andHans(2009) and the normal-exponential-gamma priors in Griffin and Brown (2005).Very recentlyCarvalho et al (2009, 2010) introduced the horseshoe prior, which has very ap-pealing properties. Subsequently, many other one-group priors have been proposed in the lit-erature, e.g, inPolson and Scott (2011, 2012), Armagan et al (2011), Armagan et al (2012) andGriffin and Brown(2010, 2012, 2013). The class of three parameter beta normal mixture pri-ors was introduced in Armagan et al (2011) and generalized double Pareto class of priors wasintroduced byArmagan et al (2012). The three parameter beta normal mixture family of priorsencompasses among others the horseshoe, the Strawderman-Berger and the normal-exponential-gamma priors. Very recently, a different class of one-group priors named Dirichlet-Laplace (DL)priors have been introduced inBhattacharya et al (2014). They investigated its various theoret-ical properties and demonstrated its good performances through extensive simulations.

    As commented inCastillo and van der Vaart(2012), the Bayesian approach to sparsity is notdriven by the ultimate goal of producing estimators that attain the minimax rate or for thatmatter posterior distributions with rate of contraction same as the mimimax rate. However, for

    theoretical investigations, minimax rate can be taken as a benchmark and this is a motivation tostudy this kind of optimality properies for the Bayesian approach to sparsity. In an importantarticle,Johnstone and Silverman(2004) focused on the case where a two-groups prior is used tomodel the mean parameters. They showed that if the unknown proportion of non-zero meansis estimated by marginal maximum likelihood and a co-ordinatewise posterior median estimateis used, the resulting estimator attains the minimax rate with respect to lq loss, q (0, 2]. InCastillo and van der Vaart(2012), the full Bayes approach was studied where they found condi-tions on the two-groups prior that ensure contraction of the posterior distribution at the minimaxrate. In recent times, researchers have started to investigate the optimality properties of estima-tors and testing rules based on these, and posterior contraction rate where a one-group shrinkageprior has been used instead. Amongst various one-group shrinkage priors, the horseshoe priorhas acquired a prominent place in the Bayesian literature and it has been used extensively ininferential problems involving sparsity. Carvalho et al(2010) have theoretically showed good per-formance of the horseshoe estimator (the Bayes estimate corresponding to the horseshoe prior) in

    2

  • 8/10/2019 Robust shrinkage prior estimation

    3/21

    terms of the Kullbuck-Leibler risk when the true mean is zero. Datta and Ghosh(2013) showeda near oracle optimality property of multiple testing rules based on the horseshoe estimator inthe context of multiple testing. Ghosh et al(2014) extended their work by theoretically showing

    that the multiple testing rules based on a general class of tail robust shrinkage priors enjoy sim-ilar optimality properties as the horseshoe. This general class of shrinkage priors is rich enoughto include among others, the three parameter beta normal priors, the generalized double Paretopriors, the inverse gamma priors, and the normal-exponential-gamma priors, the horseshoe priorand the Strawderman-Berger prior, in particular. In an important recent article, van der Pas et al(2014) showed that for the problem of estimation of a sparse normal mean vector, the horseshoeestimator asymptotically achieves the minimax risk with respect to the l2 loss, possibly up toa multiplicative constant and the corresponding posterior distribution contracts at least as fastas the minimax rate around the posterior mean. This was shown assuming that the number ofnon-zero means is known and the global shrinkage parameter tends to zero at an appropriate rateas the dimension grows to infinity. They also provide conditions under which the horseshoe esti-mator combined with an empirical Bayes estimate of the global variance component still attains

    the minimax quadratic risk even when the number of non-zero means is unknown. In a beautifulrecent article,Bhattacharya et al (2014) showed that for the estimation of a sparse multivariatenormal mean vector, under the quadratic risk function, the posterior arising from the Dirichlet-Laplace prior attains a minimax optimal rate of posterior contraction, that is, the correspondingposterior distribution contracts at the minimax rate for an appropriate choice of the underlyingDirichlet concentration parameter. See also Bickel, Ritov and Tsybakov(2009) for the minimaxrisk properties of the Lasso estimator which is the least squares estimator of the regression coef-ficients with an L1 constraint on the regression coefficients. It was later shown byCastillo et al(2014) that the corresponding entire posterior distribution contracts at a much slower rate, thusindicating an inadequate measure of uncertainity in the estimate.

    A natural question to ask, and also posed in section 6 ofvan der Pas et al (2014), is whataspects of the shrinkage priors are essential towards obtaining optimal posterior concentrationproperties as obtained for the case of the horseshoe prior. As mentioned earlier, in the contextof simultaneous testing of a lrage number of independent normal means, Ghosh et al(2014) con-sidered a general class of heavy-tailed shrinkage priors and showed some optimality properties ofthe multiple testing rules induced by the corresponding Bayes estimates. Polson and Scott(2011)suggested that in sparse problems, one should choose the prior distribution corresponding to thelocal shrinkage parameter to be appropriately heavy-tailed so that large signals can escape thegravitational pull of the corresponding global variance component and are almost left unshrunkwhich is essential for the recovery of large signals when the data is sparse. It is to be mentionedin this context that priors with exponential or lighter tails, such as the Laplace or the double-exponential prior and the normal prior, fail to meet this condition. Motivated by this, we considerin this article, the problem of estimating a sparse multivariate normal mean vector based on avery general class of tail-robust shrinkage priors that is rich enough to include a wide variety

    of shrinkage priors, such as, the three parameter beta normal mixtures (which generalizes thehorseshoe prior in particular), the generalized double Pareto prior, the inverse-gamma priors, thehalf-t priors and many more. It is shown that when the underlying multivariate normal meanwhich is sparse in the nearly-black sense, the Bayes estimates corresponding to this general classof priors asymptotically attain the minimax-quadratic risk possibly upto a multiplicative factorand the entire posterior distribution contracts at least as fast as the minimax rate around theposterior mean assuming the number of non-zero means is known and that we are free to choosethe global shrinkage parameter which tends to zero at an appropriate rate as the dimension growsto infinity. An important contribution of our theoretical investigation is showing that shrinkagepriors which are appropriately heavy-tailed (to be defined in Section 2) and have sufficient massaround the origin, are good enough to attain the minimax optimal rate of contraction, providedthat the global tuning parameter is carefully chosen. We also provide a lower bound to the corre-sponding posterior variance for an important subclass of this general class of shrinkage priors thatinclude the generalized double Pareto priors with shape parameter = 1

    2, the three parameter

    3

  • 8/10/2019 Robust shrinkage prior estimation

    4/21

    beta normal mixtures with parameters a = 12

    and b > 0 (including the horseshoe in particular),the inverse gamma prior with shape parameter ( = 12 ) and many other shrinkage priors. Weprovide a general unifying argument that works for this general class under consideration and thus

    extends the work ofvan der Pas et al(2014).

    We organize the paper as follows. In Section 2, we describe the problem and the generalclass of shrinkage priors under considertaion. Section 3 contains the main theoretical results, thatestimators arising out of this general class of shrinkage priors attain the minimax quadratic riskupto some multiplicative constant and that the corresponding posterior distribution results in aminimax optimal rate of posterior contraction. Proofs of the main theorems and other theoreticalresults essential for their derivation are given in Section 4 (Appendix) followed by a discussion inSection 5.

    Notations: In this paper, we adopted the same convention of notation used invan der Pas et al(2014). Let{An} and{Bn} be two sequences of positive real numbers indexed by n. We writeAn Bn to denote 0 < limninfn

    AnBn limnsupn

    AnBn 0 independent ofn such that An cBn.

    2 A General Class of Tail Robust Shrinkage Priors

    Let us suppose that we observe an n-component random observation (X1, , Xn) Rn, suchthat

    Xi = i+i for i = 1, , n, (2.1)where the unknown parameters 1, , ndenote the effects under investigation and= (1, , n) Nn(0, In).

    Letl0[pn] denote the subset ofR

    n

    given by,l0[pn] = { Rn : #(1 j n: j= 0) pn}. (2.2)

    Suppose we want to estimate the true mean vector 0 = (01, , 0n) when 0 is known tobe sparse in the nearly black sense, that is, 0 l0[pn] with pn = o(n) as n . Thecorresponding squared minimax rate for estimating 0 is known to be 2pn log(n/pn)(1 +o(1)) asn (seeDonoho et al(1992)), that is,

    inf

    sup0l0[pn]

    E0 || 0||2 pnlog(n

    pn). (2.3)

    In (2.3) above and throughout this paper E0 denotes an expectation with respect to theNn(0, In)distribution. Our goal is to obtain an estimate 0 from a Bayesian view point with some good

    theoretical properties. As stated already in the introduction that a natural Bayesian approach tomodel (2.1) is to use a two-component ponit mass mixture prior for the is, given by,

    ii.i.d. (1 ){0}+ f, i= 1, , m. (2.4)

    where{0}denotes the distribution having probability mass 1 at the point 0, andfdenotes an abso-lutely continuous distribution over R. SeeMitchell and Beauchamp (1988) andJohnstone and Silverman(2004) in this context. It is usually recommended to choose a heavy tailed absolutely continuousdistribution f over R so that large observations can be recovered with higher degree of accuracy.Johnstone and Silverman(2004) used a t distribution in this context and used an empirical Bayesapproach in order to estimate the unknown mixing proportion via the method of marginalmaximum likelihood and showed that if the co-ordinatewise posterior median estimate is used,the resulting estimator of

    0 attains the minimax rate with respect to the l

    q loss, q

    (0, 2].

    Castillo and van der Vaart (2012) studied the full Bayes approach where they found conditions

    4

  • 8/10/2019 Robust shrinkage prior estimation

    5/21

    on the two-groups prior that ensure contraction of the posterior distribution at the minimaxrate. A detailed list of other empirical Bayes approaches to the two-group model can be foundinCastillo and van der Vaart(2012),Efron(2008)Jiang and Zhang(2009),Yuan and Lin(2005)

    and references therein.

    As already mentioned in the introduction that although the two groups prior ( 2.4) is consideredto be the most natural formulation for handling sparsity from a Bayesian view point, it offers aduanting computational challenge in high dimensional problems because of the enormously largemodel space (in this case it is 2n). Due to this reason, the one-group formulation to model sparsedata has received considerable attention from researchers over the years, mostly due to the ease oftheir computational tractibility. Polson and Scott (2011) showed that almost all such shrinkagepriors can be expressed as multivariate scale-mixture of normals which makes the computationbased on these one-group shrinkage priors much easier compared to the corresponding two-groupformulation. Standard Markov-chain Monte Carlo techniques are available in the Bayesian litera-ture for the computation of the corresponding Bayes estimates of the underlying model parameters.

    In this article, we consider Bayes estimators based on a general class of one-group shrinkage priorsgiven through the following hierarchical one-group formulation:

    Xi|i N(i, 1), independently for i = 1, , mi|(2i , 2) N(0, 2i 2), independently for i= 1, , m2i (2i ), independently for i = 1, , m

    with (2i ) being given by,

    (2i ) =K(2i )a1L(2i ), (2.5)

    whereK (0, ) is the constant of proportionality,a is a positive real number and L : (0, ) (0, ) is a measurable, non-constant, slowly varying function satisfying the following:Assumption 2.1. 1. limtL(t) (0, ), that is, there exists some positive real numberc0

    such that L(t) > c0 for all t t0, for some finite positive real number t0 depending on Landc0. Choose t0> 0 to be the minimum of all suchts such thatL(t)> c0.

    2. There exists some0< M < such that supt(0,) L(t) M.Recall that a measurable function L: (0, ) (0, ) is said to be slowly varying if for each

    fixed >0, L(x)L(x) 1 asx . A simple sufficient condition for a functionL : (0, ) (0, )to be slowly varying is that limxL(x) (0, ). Hence, every constant function is a slowlyvarying function. However, since we assume the prior given in (2.5) to be proper, the possibilityofL() being a constant function is immediately ruled out.

    Each2i

    is referred to as a local shrinkage parameter and the parameter2 is called the globalshrinkage parameter. For the theoretical treatment of this paper, we assume the global shrinkageparameter 2 to be known. We would like to mention here that a very broad class of one-groupshrinkage priors actually fall inside this above general class. For example, it can be easily seenthat the celebrated horseshoe prior is a member of this general class under study by simply tak-ing a = 0.5 and L(t) = t/(1 +t) in (2.5) satisfying both the conditions of Assumption (2.1).Ghosh et al (2014) observed that the three parameter beta normal mixtures (which include thehorseshoe and the normal-exponential-gamma priors as special cases) and the generalized doublePareto priors can be expressed in the above general form by showing that the corresponding priordistribution of the local shrinkage parameters can be written in the form given in (2.5) with thecorresponding L() satisfying Assumption (2.1). It is easy to verify that some other well knownshrinkage priors such as the families of inverse-gamma priors and the half-t priors are also coveredby this general class of prior distributions under consideration. We would like to mention in this

    context that the above general class exclude priors such as the double-exponential or Laplace prior

    5

  • 8/10/2019 Robust shrinkage prior estimation

    6/21

    or the normal prior which have exponential or lighter tails.

    From Theorem 1 ofPolson and Scott(2011) it follows that the above general class of one-group

    priors will be tail-robust in the sense that for any given >0, E(i|Xi, 2) Xi,for largeXis,which means for such priors large observations will be almost left unshrunk even when the globalshrinkage parameter2 is too small. We shall elucidate this fact in some greater detail in the forth-coming sections using properties of slowly varying functions. It was suggested in Polson and Scott(2011) that the global shrinkage parameter 2 should be very small so that smallXis or the noiseobservations can be shrunk towards the origin while the prior distribution of the local shrinkageparameters2i should have heavy tails so that large signals can escape the effect of

    2 and almostremain unshrunk. Thus (2.5) should result in a prior distribution for the is which has a highconcentration of mass near the origin but have thick tails at the extremes to accomodate largesignals. Polson and Scott (2011) also showed that for priors having exponential or lighter tails,such as the Laplace or the double-exponential prior, even the large Xis will always be shrunktowards the origin by some non-diminishing amount for small values of, which is certainly not

    desirable for the recovery of large signals in sparse situations.

    Now for a general global-local scale mixture of normals we have,

    i|(Xi, 2i , 2) N((1 i)Xi, (1 i)), i = 1/(1 +2i 2),independently fori = 1, , m, so that for each i, the posterior mean ofi is given by,

    E(i|Xi, 2i , 2) = (1 i)Xi, . (2.6)Next, using the iterated expectation formula it follows that,

    E(i|Xi, 2) = (1 E(i|Xi, 2))Xi. (2.7)The corresponding posterior mean E(|X, ) = (E(1|X1, 2), , E(m|Xm, 2)) will be the

    estimator arising out of the general class of shrinkage priors (2.5) and will be denoted by T(X).It will be shown in the next section that when 0 is sparse in the nearly black sense anda [ 12 , 1), the estimatorT(X) of0 will asymptotically attain the minimax rate (2.3) upto somemultiplicative constant assuming = pn

    n and that the posterior distribution contracts at least as

    fast as the minimax rate around the posterior mean.

    3 Theoretical Results

    In this section, we first state two optimality results of the general class of shrinkage estimatorswhen 12 a < 1, assuming that the number of non-zero parameters pn is known. Theorem3.1 states that the general class of heavy-tailed shrinkage priors attain the minimax risk underthe l2-norm, when

    12

    a < 1, possibly upto a multiplicative constant. Theorem 3.2 provides

    an upper bound to the variance corresponding to the posterior distribution based on the chosenclass of heavy-tailed distribution. Theorem 3.3 provides an upper bound to the rate of posteriorcontraction which is equal to the corresponding squared error minimax risk upto a multiplicativeconstant. Theorem 3.4 provides a lower bound to the posterior variance for an important subclassof this general class of shrinkage priors that gives more insight about the spread of the posteriordistribution around these estimators for various choices of. Our proofs are based on novel uni-fying arguments crucially exploiting properties of slowly varying functions. We however followedthe broad architecture of the proofs of the main theorems ofvan der Pas et al (2014). Lemmas4.3, 4.4 and 4.5, given in the appendix, on which Theorems 3.1 through 3.3 crucially hinge upon,are completely independent of the work of ofvan der Pas et al (2014). However, proofs of Lemma4.6 and Theorem 3.4 have been derived following some key arguments ofvan der Pas et al (2014).This shows that the general scheme of arguments invan der Pas et al(2014) can be used in greatergenerality which will be evident in the next section.

    6

  • 8/10/2019 Robust shrinkage prior estimation

    7/21

    Theorem 3.1. SupposeX Nn(0, In). Then the estimatorT(x) based on the general class ofshrinkage priors (2.5), with 12 a

  • 8/10/2019 Robust shrinkage prior estimation

    8/21

    around these estimators, we confine our attention to the case when L() given in (2.5) is non-decreasing over (0, ) with a = 0.5. This subclass include the generalized double Pareto priorswith shape parameter = 1

    2, the three parameter beta normal mixtures with parameters a = 1

    2

    and b >0 (including the horseshoe in particular), the inverse gamma prior with shape parameter(= 12 ) and many more (seeGhosh et al(2014) in this context). The next theorem gives a lowerbound on the posterior variance corresponding to this restricted subclass that provide more insightinto the effect of the choice of.

    Theorem 3.4. SupposeX Nn(0, In) and0 l0[pn]. Further assume that the functionL()given by (2.5) satisfies Assumption 2.1 and is non-decreasing over (0, ). Then for a = 1

    2,

    the variance of the posterior distribution corresponding to the general class of shrinkage priors,satisfies

    ni=1

    E0V ar(0i| Xi) pnn1

    log

    npn

    (3.5)

    if= (pnn ), > 0, asn, pn

    andpn = o(n).

    Proof. See Section 4.

    Following the line of arguments given in the discussion at the end of Theorem 3.4 ofvan der Pas et al(2014) it follows that, while for 0 < < 1 the posterior distribution corresponding to this re-stricted sub-class contracts at a sub-optimal rate, it contracts too quickly resulting in inadequatemeasure of uncertainity about the corresponding Bayes estimates when >1. On the other hand,the choice = 1 seems to be optimal in the following sense: the lower bound obtained in Theorem3.4 is of the order ofpn

    log(n/pn) which misses the minimax rate by a factor of

    log(n/pn) and

    this suggests that the posterior distribution corresponding to this restricted subclass concentratesaround the corresponding Bayes estimates at a rate close to the minimax rate (2.3).

    4 Appendix

    4.1 Appendix A: Proofs

    Lemma 4.1. For each fixedx R, the general class of shrinkage priors (2.5) satisfying Assump-tion (2.1) with fixed0 < a 0, and each fixed , (0, 1), the posterior distribution ofthe shrinkage coefficients = 1/(1 +

    2

    2

    ) based on the general class of shrinkage priors (2.5)satisfying Assumption (2.1), witha >0, satisfies the following:

    Pr( > |x, ) H(a,, )e (1)x22

    2a(2, , ) , uniformly inx R,

    where(2, , ) = (2, , )L 1

    2(

    1

    1),

    (2, , ) =

    12

    11

    t(a+ 12 +1)L(t)dt(a+ 1

    2)1

    12

    1

    1(a+ 12 )L( 12 1 1), and

    H(a,, ) = (a+ 1

    2

    )(1

    )a

    K()(a+12 ) ,

    8

  • 8/10/2019 Robust shrinkage prior estimation

    9/21

  • 8/10/2019 Robust shrinkage prior estimation

    10/21

    Note that 0< u < x2 0< ux2 < 0,

    | T(x) x | h(x, ). (4.5)

    Now observe that the function h1(x, ) is strictly decreasing in|x|. Therefore, for any fixed >0 and every >0,

    sup|x|>

    log( 1

    2a)

    h1(x, ) C |

    log( 1

    2a)

    s log( 12a

    )

    0

    eu/2ua+1/21du | 1

    implying thatlim0

    sup|x|>

    log( 1

    2a)

    h1(x, ) = 0. (4.6)

    Again the function h2(x, ) is eventually decreasing in|x|. Therefore, for all sufficiently small >0,

    sup|x|>

    log( 1

    2a)

    h2(x, ) h2( log( 1

    2a ), ).

    Let = lim0(2, , ) for every fixed , (0, 1).. Then 0 < 2(1) otherwise,

    10

  • 8/10/2019 Robust shrinkage prior estimation

    11/21

    whence it follows that

    lim0

    sup|x|>

    log( 1

    2a)

    h2(x, ) = 0 if > 2(1)

    otherwise, (4.7)

    Combining (4.6) and(4.7) together with the fact that

    lim0

    sup|x|>

    log( 1

    2a)

    h(x, ) lim0

    sup|x|>

    log( 1

    2a)

    h1(x, ) + lim0

    sup|x|>

    log( 1

    2a)

    h2(x, )

    it immediately follows that

    lim0

    sup|x|>

    log( 1

    2a)

    h(x, ) =

    0 if > 2(1) otherwise, (4.8)

    Observe that by choosing appropriately close to 1 and sufficiently close to 0, any realnumber larger than 2 can be expressed in the form 2(1) . For example, choosing =

    56

    and

    = 15 we obtain 2(1) = 3. Hence, given c >2, let us choose 0< , 2. Using the preceeding argumentsit therefore follows that given any c > 2, the absolute difference between the posterior meanT(x) and an observation x can be bounded above by a function h(x, ) depending onc such thatlim0sup|x|>

    log( 1

    2a)

    h(x, ) = 0 for all > c.

    Remark 4.1 Observe that the function h defined in the proof of Lemma 4.3 also satisfies thefollowing:

    For each fixed >0, lim|x|

    h(x, ) = 0.

    Proof of Theorem 3.1

    Proof. Suppose thatX Nn(, In), l0[pn] and pn = #{i: = 0}.Note that pn pn. Assumewithout any loss of generality that for i = 1, , pn, i= 0, while for i= pn+ 1, , n, i = 0.We split up the expectation E||T(X) ||2 into the two corresponding parts:

    ni=1

    Ei

    T(Xi) i2

    =

    pni=1

    Ei

    T(Xi) i2

    +

    ni= pn+1

    Ei

    T(Xi) i2

    We will now show that these two terms can be bounded by pn(1+2 log( 12a )) and (n pn)

    2alog(

    12a )

    respectively, up to multiplicative constants, for any choice of (0, 1).

    Non-zero is:

    Let =

    2 log( 12a ).

    Then

    Ei

    T(Xi) i2

    = Ei

    T(Xi) Xi

    +

    Xi i2

    2Ei

    T(Xi) Xi2

    + 2Ei

    Xi i2

    2 + 2 supx |

    T

    (x)

    x|

    2

    11

  • 8/10/2019 Robust shrinkage prior estimation

    12/21

    Using Lemma 4.3, given anyc >1, one can obtain a non-negative real-valued function h(x, ),depending onc, which satisfies the following:

    lim0 sup|x|> h(x, ) = 0 for all > c. (4.9)

    Claim: As 0 :arg max

    x| T(x) x | . (4.10)

    Proof of Claim: Letx0() = arg maxx| T(x) x |. Using the observation in Remark 4.1, it canbe easily established that|x0()| 0. On contrary, let us now assume theClaimto be false. Then, for all c >0,|x0()| > c infinitely often. Let us fix any x R {0},anyc >1 and any > c. Then, by the definition we have,

    |x||E(|x, )| = |T(x) x|

    |T(x0())

    x0()

    | sup|x|>

    h(x, )

    which would be a contradiction because as 0, sup|x|>h(x, ) 0 along a subsequence (us-ing (4.9)), whereas|x||E(|x, )| |x| as 0 which follows as an immediate consequenceof Corollary 4.1.

    Equation (4.10) together with the fact| T(x) || x | immediately leads to the following:

    supx

    | T(x) x |2 2 (4.11)

    whence it follows thatEiT

    (X

    i)

    i2 1 +2

    . (4.12)

    Parameters equal to zero: We split up the term for the zero means into two parts:

    E0T(X)2 =E0T(X)

    21{|X| } +E0T(X)21{|X| > }, (4.13)

    where =

    2 log( 12a

    ).

    Next using Lemma 4.1 we have, for any 0 <

  • 8/10/2019 Robust shrinkage prior estimation

    13/21

    For the second term we have:

    E0T(X)21{|X| > } = 2

    x2(x)dx

    = 2

    () + (1 ())

    2() + 2()

    =

    2

    2a(1 +o(1)) as 0

    2a (4.15)

    Combining equations (4.13), (4.14) and (4.15), it follows that for all sufficiently small ,

    ni=1

    Ei

    T(Xi) i2 pn(1 + 2 log( 12a )) + (n pn)2alog( 1

    2a ) (4.16)

    Putting = (pnn

    ) and taking supremum over all l0[pn] on both sides of (4.16) we obtain:

    supl0[pn]

    ni=1

    Ei

    T(Xi) i2

    pn+ pnlog(n

    pn) + (n pn)(pn

    n)2a

    log(

    n

    pn)

    which, for 1 and 12 a < 1, will be at most of the order pnlog(

    npn

    ) if pn = o(n), be-

    cause pn pn. Since the minimax quadratic risk (2.3) for this problem is always smaller than

    supl0[pn]n

    i=1Ei

    T(Xi) i2

    , the stated result follows immediately.

    Lemma 4.4. The posterior variance arising out of the general class of shrinkage priors (2.5) canbe represented by the following identity:

    V ar(|x, ) = T(x)x

    T(x) x2 +x2

    01

    (1+t2)5/2ta1L(t)e

    x2

    2(1+t2) dt

    0

    1(1+t2)1/2

    ta1L(t)e x2

    2(1+t2) dt

    which can be bounded from above by

    1. V ar(|x, ) 1 +x2.2. V ar(

    |x, ) ( 1

    x+x)T

    (x)

    T

    (x)2.

    Proof. By the law of iterated variance it follows that

    V ar(|x, ) = EV ar(|x,,)+V arE(|x,,)= E

    (1 )|x, +V arx(1 )|x, )

    = E

    (1 )|x, +x2V ar|x, )= E

    (1 )|x, +x2E2|x, x2E2|x,

    = T(x)

    x T(x) x2 +x2

    0

    1(1+t2)5/2

    ta1L(t)e x2

    2(1+t2) dt

    0

    1(1+t2)1/2

    ta1L(t)e x2

    2(1+t2) dt(4.17)

    13

  • 8/10/2019 Robust shrinkage prior estimation

    14/21

    which can equivalently be represented as by the following identity as well:

    V ar(|x, ) = E(1 )|x, +x2E(1 )

    2|x, x2E21 |x,

    = T(x)

    x T2(x) +x2

    0

    (t2)2

    (1+t2)5/2ta1L(t)e

    x22(1+t2) dt

    0

    1(1+t2)1/2

    ta1L(t)e x2

    2(1+t2) dt(4.18)

    Lemma 4.5. SupposeJ(x, ) =x2

    0(t2)2

    (1+t2)5/2ta1L(t)e

    x2

    2(1+t2) dt

    01

    (1+t2)1/2ta1L(t)e

    x2

    2(1+t2) dt

    . Then for any >0,

    J(x, )(x)dx 4 +2a +2a,

    where the functionL in the above expression is already defined in (2.5) and satisfies Assumption(2.1), with 1

    2 a 1 such that L is bounded over everycompact subsets of [A0, ) and

    limz

    zA0

    taL(t)dt

    z1aL(z) =

    1

    1 afor 0< a

  • 8/10/2019 Robust shrinkage prior estimation

    15/21

  • 8/10/2019 Robust shrinkage prior estimation

    16/21

    Proof of Theorem 3.2

    Proof. Suppose that X Nn(, In), l0[pn] and pn = #{i : i= 0}. Note that pn pn.Assume without any loss of generality that for i = 1, , pn, i= 0, while for i= pn+ 1, , n,i = 0. Let =

    2 log( 12a ).

    Nonzero means:

    By applying the same reasoning as in Lemma 4.3 to the final term ofV ar(i|x) in (4.17), thereexists a non-negative real-valued functionh(x, ) such thatV ar(i|x) h(x, ), whereh(x, ) 1as x for any fixed (0, 1). If 0, the function h(x, ) satisfies the following for anyc >1:

    lim0

    sup|x|>

    h(x, ) = 1 for all > c.

    Hence V ar(i| x)

    1, for any x > as 0. Now suppose x

    . Then by the boundV ar(i| x) 1 +x2 from Lemma 1.2, we find:V ar(i| x) 1 +2. (4.23)

    Therefore:pn

    i=1

    EiV ar(i| Xi) pn(1 +2). (4.24)

    Zero means:

    By the bound V ar(| x) 1 +x2, we find for any 1 :

    E0V ar(|

    X)1{|X|>

    } 2

    (1 +x2) 1

    2e

    x2

    2 dx

    2a

    +

    2a. (4.25)

    When|x| < , we consider the upper bound V ar(| x) T(x)x +J(x, ) from Lemma 1.2,where the termJ(x, ) denote simply the third term on the right hand side of (4.18). Again usingthe upper bound obtained of Lemma 3.1 it follows that:

    cc

    T(x)

    x

    12

    ex2

    2 dx 2a. (4.26)

    Therefore, using equations (4.25) and (4.26) and Lemma 1.4, it follows that:

    ni= pn+1

    E0V ar(i| Xi) (n pn)(+42a + 1)2a. (4.27)

    From equations (4.24) and (4.27) it finally follows that:

    E

    ni=1

    V ar(i| Xi) pn(1 +2) + (n pn)(+42a + 1)2a.

    Putting =pnn

    , we obtain:

    E

    ni=1

    V ar(i| Xi) pn(1 + log( npn

    )) +

    pn

    n

    2a(n pn)

    log(

    n

    pn) +

    pnn

    (42a)+ 1

    which will be at most of the order pnlog( npn

    ) asn for 12 a < 1 and 1, ifpn = o(n).

    16

  • 8/10/2019 Robust shrinkage prior estimation

    17/21

  • 8/10/2019 Robust shrinkage prior estimation

    18/21

  • 8/10/2019 Robust shrinkage prior estimation

    19/21

    4.2 Appendix B: Some Properties of Slowly Varying Functions

    Lemma 4.7. IfL is any slowly varying function then there exists A0 >0 such thatL is locally

    bounded in [A0, ), that is, L is bounded in all compact subsets of [A0, ).Proof. See Lemma 1.3.2 and subsequent discussion inBingham et al (1987).

    Lemma 4.8. IfL is any slowly varying function, A0 is so large such thatL is locally bounded in[A0, ) and > 1, then x

    A0tL(t)dt

    x+1L(x) 1

    1 + asx .

    Proof. See Proposition 1.5.8 ofBingham et al (1987).

    Lemma 4.9. IfL is any slowly varying function and < 1, then

    x t

    L(t)dt

    x+1L(x) 1

    + 1 asx .Proof. See Proposition 1.5.10 ofBingham et al (1987).

    5 Discussion

    We studied in this paper various theoretical properties of a general class of heavy-tailed shrinkagepriors in terms of the quadratic minimax risk for estimating a multivariate normal mean vectorwhich is known to be sparse in the sense of being neraly black. It is shown that Bayes estimatorsarising out of this general class asymptotically attain the mimimax risk in the l2norm possibly uptosome multiplicative constants. Optimal rate of posterior contraction of these prior distributionsin terms of the corresponding quadratic minimax rate has also been established. We provided

    a unifying theoretical treatment through exploiting properties of slowly varying functions thatholds for a very broad class of shrinkage priors including some well-known prior distributions suchas the horseshoe prior, the normal-exponential-gamma priors, the three parameter beta normalpriors, the generalized double Pareto priors, the inverse gamma priors and many others. Anothermajor contribution of this work is to show that shrinkage priors which are heavy-tailed and havesufficient mass around the origin, are good enough in order to attain the minimax optimal rateof contraction and one does not require a pole at the origin, provided that the global tuningparameter is carefully chosen, a question that was posed in van der Pas et al (2014). We observedthat (though not reported in this paper) when the number of non-zero means is unknown, theBayes estimators based on this general class of one-group priors, combined with the empiricalBayes estimate of the global shrinkage parameter as suggested in van der Pas et al (2014), stillattain the minimax risk upto a multiplicative constant in the l2 norm. In this sense, our work

    can be considered as a full extension of the posterior concentration properties for the horeseshoeprior obtained byvan der Pas et al(2014) over a very large class of global-local scale mixture ofnormals. Moreover, the optimal range [ 1

    2, 1) of the hyperparametera used in the definition of this

    general class is in accordance with that obtained in the context of multiple testing considered inGhosh et al (2014). Therefore, the results obtained in this paper can be thought of as anotherformal theoretical justification for the use of such priors. However, a more interesting questionwould be to investigate whether the posterior contraction properties of such prior distributionsstill remain when a hyperprior is placed over the global shrinkage parameter. We hope to addressthis problem elsewhere in future.

    Acknowledgement The authors would like to thank Professor Jayanta Kumar Ghosh for let-ting them aware about the recent work ofvan der Pas et al (2014) on the posterior contractionproperties of the horseshoe prior.

    19

  • 8/10/2019 Robust shrinkage prior estimation

    20/21

    References

    Armagan, A., Dunson, D. B., and Clyde, M. (2011). Generalized Beta Mixtures of Gaussians.

    In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N, and Weinberger, K. Q.(eds.), Advances in Neural Information Processing Systems 24 , 523-531.

    Armagan, A., Dunson, D. B., and Lee, J. (2012). Generalized double Pareto Shrinkage.StatisticaSinica, 23(1): 119-143.

    Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U. and Strawn, N.(2013). Posterior consistencyin linear models under shrinkage priors. Biometrika, 100(4): 1011-1018.

    Bhattacharya, A., Pati, D., Pillai, N., and Dunson, D. (2012). Dirichlet-Laplace priors for optimalshrinkage Arxiv preprint arXiv:1401.5398v1 .

    Bingham, N. H., Goldie, C. M., and Teugels, J. L. (1987). Regular Variation. In Rota, G.-C.(ed.), Encyclopedia of mathematics and its applications, vol 27, University Press, Cambridge,

    Great Britain.

    Carvalho, C., Polson, N., and Scott, J. (2009). Handling sparsity via the horseshoe. Journal ofMachine Learning Research W&CP, 5: 73-80.

    Carvalho, C., Polson, N., and Scott, J. (2010). The horseshoe estimator for sparse signals.Biometrika, 97(2): 465-480.

    Castillo, I., Schmidt-Heiber, J., and van der Vaart, A.(2014). Bayesian Linear Regression withSparse Priors arXiv:1403.0735.

    Castillo, I. and van der Vaart, A.(2012). Needles and straw in a haystack: Posterior ConcentrationFor Possibly Sparse Sequences The Annals of Statistics, 40(4): 2069-2101.

    Datta, J. and Ghosh, J. K. (2013). Asymptotic Properties of Bayes Risk for the horseshoe Prior.Bayesian Analysis, 8(1): 111-132.

    Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum Entropy andthe Nearly Black Object (with Discussion). Journal of the Royal Statistical Society. Series B(Methodological), 54: 4181.

    Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis.Journal of the American Statistical Association, 99(465): 96-104.

    Efron, B. (2008). Microarrays, Empirical Bayes and the two-groups Model. Statistical Science,23(1): 1-22.

    Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. BayesianAnalysis, 1(3): 515-533.

    Ghosal, S., Ghosh, J. K., and van der Vaart, A. W.(2000). Convergence Rates of PosteriorDistribution. The Annals of Statistics, 28(2): 500-531.

    Ghosh, P., Tang, X., Ghosh, M. and Chakrabarti, A.(2014). Asymptotic Properties of BayesRisk for a General Class of Shrinkage Priors in Multiple Hypothesis Testing Under Sparsity.http://arxiv.org/pdf/1310.7462.pdf (Submitted).

    Griffin, J. E. and Brown, P. J. (2005). Alternative prior distributions for variable selection withvery many more variables than observations. Technical report, University of Warwick.

    Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in

    regression problems. Bayesian Analysis, 5(1): 171-188.

    20

  • 8/10/2019 Robust shrinkage prior estimation

    21/21

    Griffin, J. E. and Brown, P. J. (2012). Structuring shrinkage: some correlated priors for regres-sion. Biometrika, 99(2): 481-487.

    Griffin, J. E. and Brown, P. J. (2013). Some priors for sparse regression modeling. BayesianAnalysis, 8(3): 691-702.

    Hans, C. (2009). Bayesian lasso regression. Biometrika, 96(4): 835-845.

    Jiang, W., and Zhang, C. H. (2009). General Maximum Likelihood Empirical Bayes Estimationof Normal Means. The Annals of Statistics, 37: 16471684.

    Johnstone, I. and Silverman, B. W. (2004). Needles and straw in haystacks: Empirical-Bayesestimates of possibly sparse sequences. The Annals of Statistics, 32(4): 1594-1649.

    Mitchell, T. and Beauchamp, J. (1988). Bayesian variable selection in linera regression (withdiscussion). Journal of the American Statistical Association, 83(404): 1023-1036.

    Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American StatisticalAssociation, 103(482): 681-686.

    van der Pas, S. L., Kleijn, B. J. K. and van der Vaart, A. W. (2014). The horseshoe estimator:Posterior concentration around nearly black vectors.

    Polson, N. G. and Scott., J. G. (2010). Large Scale Simultaneous Testing with HypergeometricInverted Beta Priors. Technical report.

    Polson, N. G. and Scott., J. G. (2011). Shrink Globally, Act Locally: Sparse Bayesian Regu-larization and Prediction. In Bernardo, J. M., Bayarri, J. M., Berger, J. O., Dawid, A. P.,Heckerman, D., Smith, A. F. M., and West, M. (eds.), Bayesian Statistics 9, Proceedings of the9th Valencia International Meeting, 501-538. Oxford University Press.

    Polson, N. G. and Scott., J. G. (2012). On the Half-Cauchy Prior for a Global Scale parameter.Bayesian Analysis, 7(2): 1-16.

    Scott, J. G. (2011). Bayesian estimation of intensity surfaces on the sphere via needlet shrinkageand selection. Bayesian Analysis, 6(2): 307-327.

    Scott, J. and Berger, J. O. (2006). An exploration of aspects of Bayesian multiple testing.Journal of Statistical Planning and Inference, 136(7): 2144-2162.

    Scott, J. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in thevariable-selection problem. The Annals of Statistics, 38(5): 2587-2619.

    Tipping, M. (2001). Sparse Bayesian learning and the Relevance Vector Machine. Journal ofMachine Learning Research, 1: 211-244.

    Yuan, M. and Lin,Y.(2005). Efficient Empirical Bayes Variable Selection and Estimation inLinear Models. Journal of the American Statistical Association, 100: 1215-1225.

    21