bayesian forecasting of value at risk and expected shortfall using adaptive importance sampling

International Journal of Forecasting 26 (2010) 231–247www.elsevier.com/locate/ijforecast

Bayesian forecasting of Value at Risk and Expected Shortfall usingadaptive importance sampling

Lennart Hoogerheidea,b,∗, Herman K. van Dijka,b

a Econometric Institute, Erasmus School of Economics, Erasmus University Rotterdam, Burgemeester Oudlaan 50, Rotterdam,The Netherlands

b Tinbergen Institute, Erasmus School of Economics, Erasmus University Rotterdam, Burgemeester Oudlaan 50, Rotterdam, The Netherlands

Abstract

An efficient and accurate approach is proposed for forecasting the Value at Risk (VaR) and Expected Shortfall (ES) measuresin a Bayesian framework. This consists of a new adaptive importance sampling method for the Quick Evaluation of Risk usingMixture of t approximations (QERMit). As a first step, the optimal importance density is approximated, after which multi-step ‘high loss’ scenarios are efficiently generated. Numerical standard errors are compared in simple illustrations and in anempirical GARCH model with Student-t errors for daily S&P 500 returns. The results indicate that the proposed QERMitapproach outperforms alternative approaches, in the sense that it produces more accurate VaR and ES estimates given the sameamount of computing time, or, equivalently, that it requires less computing time for the same numerical accuracy.c© 2010 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

Keywords: Value at Risk; Expected Shortfall; Numerical standard error; Importance sampling; Mixture of Student-t distributions; Variancereduction technique

s

1. Introduction

The issue that is considered in this paper is the ef-ficient computation of accurate estimates of two riskmeasures, Value at Risk (VaR) and Expected Shortfall(ES), using simulations, given a chosen model. Thereare several reasons why it is important to compute

∗ Corresponding author at: Econometric Institute, ErasmusSchool of Economics, Erasmus University Rotterdam, Burge-meester Oudlaan 50, Rotterdam, The Netherlands.

E-mail address: [email protected] (L. Hoogerheide).

0169-2070/$ - see front matter c© 2010 International Institute of Forecadoi:10.1016/j.ijforecast.2010.01.007

accurate VaR and ES estimates. An underestimationof the risk could obviously cause immense problemsfor banks and other participants in financial markets(e.g. bankruptcy). On the other hand, an overestima-tion of the risk may cause one to allocate too muchcapital as a cushion for risk exposure, which may havea negative effect on profits. Therefore, precise esti-mates of risk measures are obviously desirable. Forsimulation-based estimates of VaR and ES, there arealso several other issues that play a role. For ‘back-testing’ or model choice, it is important that this model

ters. Published by Elsevier B.V. All rights reserved.

http://www.elsevier.com/locate/ijforecast

mailto:[email protected]

http://dx.doi.org/10.1016/j.ijforecast.2010.01.007

232 L. Hoogerheide, H.K. van Dijk / International Journal of Forecasting 26 (2010) 231–247

choice be based on the quality of the model, rather thanthe ‘quality’ of the simulation run. For example, oneshould not choose a model merely because the sim-ulation noise stemming from pseudo-random drawscaused its historical VaR or ES estimates to be prefer-able. Next, risk measures should stay approximatelyconstant when the actual risk level stays about con-stant over time. If changes in risk measures over timeare merely caused by simulation noise, this leads touseless fluctuations in positions, leading to extra costs(e.g. transaction costs). Also, when choosing betweendifferent risky investment strategies based on a risk-return tradeoff, it is important that the computed riskmeasures be accurate. Decision-making on portfoliosshould not be misled by simulation noise. Moreover,the total volume of invested capital may obviously behuge, so that small percentage differences may corre-spond to huge amounts of money.

A typical disadvantage of computing simulation-based VaR and ES estimates with high precision isthat this requires a huge amount of computing time. Inpractice, such computing times are often too long for‘real time’ decision-making. Then, one typically facesthe choice between a lower numerical accuracy —using a smaller number of draws or an approximatingmethod — or a lower ‘modeling accuracy’ — using analternative, computationally easier, and typically lessrealistic model. In this paper we propose a simulationmethod that requires less computing time to reach acertain numerical accuracy, so that the latter choicebetween suboptimal alternatives may not be necessary.

The approaches for computing VaR and ES esti-mates can be divided into three groups (as indicated byMcNeil & Frey, 2000, p. 272): non-parametric histor-ical simulation, fully parametric methods based on aneconometric model with explicit assumptions on thevolatility dynamics and conditional distribution, andmethods based on extreme value theory. In this paperwe focus on the second group, although some of theideas could be useful in simulation-based approachesof the third group. We compute VaR and ES in aBayesian framework: we consider the Bayesian pre-dictive density. A specific focus is on the 99% quan-tile of a loss distribution for a 10-day-ahead horizon.This particular VaR measure is accepted by the BaselCommittee on Banking and Supervision of Banks forInternal Settlement (Basel Committee on Banking Su-pervision, 1995). The issues of model choice and

‘backtesting’ the VaR model or ES model are not ad-dressed directly. However, as was mentioned before,the numerical accuracy of the estimates can be indi-rectly important in the model choice or ‘backtesting’procedure, because simulation noise may misdirect themodel selection process.

The contributions of this paper are as follows. First,we consider the numerical standard errors of VaR andES estimates. Since VaR and ES are not simply uncon-ditional expectations of (a function of) a random vari-able, the numerical standard errors do not fit directlywithin the importance sampling estimator’s numeri-cal standard error formula from Geweke (1989). Weconsider the optimal importance sampling density thatmaximizes the numerical accuracy for a given numberof draws, as derived by Geweke (1989) for the caseof VaR estimation. Second, we propose a particular‘hybrid’ mixture density that provides an approxima-tion to this optimal importance density. The proposedimportance density is also useful, perhaps even moreso, as an importance density for ES estimation. This‘hybrid’ mixture approximation makes use of twomixtures of Student-t distributions, as well as thedistribution of future asset prices (or returns) forgiven parameter values and historical asset prices (orreturns). It is flexible, so that it can provide use-ful approximations in a wide range of situations,and is easy to simulate from. The main contribu-tion of this paper is the production of an iterativeapproach for constructing this ‘hybrid’ mixture ap-proximation. Just like the traditional approach dis-cussed below, it is ‘automatic’ in the sense that itonly requires a posterior density kernel — ratherthan the exact posterior density — and the distri-bution of future prices/returns given the parametersand historical prices/returns. We name the proposedtwo-step method, which involves first constructingan approximation to the optimal importance densityand subsequently using this for importance samplingestimation of the VaR or ES, the Quick Evaluation ofRisk using Mixture of t approximations (QERMit).The QERMit procedure makes use of the AdaptiveMixture of t (AdMit) approach (see Hoogerheide,Kaashoek, & Van Dijk, 2007), which constructs an ap-proximating mixture of Student-t distributions, givenonly a kernel of a target density. Hoogerheide et al.(2007) apply the AdMit approach in order to approx-imate and simulate from a non-elliptical posterior of

L. Hoogerheide, H.K. van Dijk / International Journal of Forecasting 26 (2010) 231–247 233

the parameters in an Instrumental Variable (IV) re-gression model. In this paper we consider the jointdistribution of parameters and future returns insteadof merely the parameters. Moreover, our goal is notto approximate this distribution of parameters and fu-ture returns but to approximate the optimal importancedensity, in which ‘high loss’ scenarios are generatedmore often, and subsequently ‘corrected’ by givingthese lower importance weights. Hence, the AdMit ap-proach is merely one of the ingredients for the pro-posed QERMit approach.

The outline of the paper is as follows. In Sec-tion 2 we discuss the computation of numerical stan-dard errors for VaR and ES estimates. Furthermore,we consider the optimal importance sampling density(due to Geweke, 1989), which minimizes the numeri-cal standard error (given a certain number of draws)for the case of VaR estimation. In Section 3, webriefly reconsider the AdMit approach (Hoogerheideet al., 2007). Section 4 describes the proposed QER-Mit method. In Section 5 we illustrate the possibleusefulness of the QERMit approach in an empiricalexample of estimating 99% VaR and ES in a GARCHmodel with Student-t innovations for S&P 500 log-returns. Finally, Section 6 concludes.

2. Bayesian estimation of Value at Risk andExpected Shortfall using importance sampling

In the literature, the VaR is referred to in several dif-ferent ways. The quoted VaR is either a percentage oran amount of money, referring to either a future port-folio value or a future portfolio value in deviation fromits expected value or current value. In this paper we re-fer to the 100α% VaR as the 100(1− α)% quantile ofthe percentage return’s distribution, and to ES as theexpected percentage return given that the loss exceedsthe 100α% quantile. With these definitions, VaR andES are typically values between −100% and 0%.1

The VaR is a risk measure with several advantages:it is relatively easy to estimate and it is easy to ex-plain to non-experts. The specific VaR measure of the

1 For certain derivatives, e.g. options or futures, it may notbe natural or even possible to quote profit or loss as a certainpercentage. The quality of our proposed method is not affected bywhich particular VaR or ES definition is used.

99% quantile for a horizon of two weeks (10 trad-ing days) is acceptable to the Basel Committee onBanking and Supervision of Banks for Internal Set-tlement (Basel Committee on Banking Supervision,1995). This is motivated by the fear of a liquiditycrisis where a financial institution might not be ableto liquidate its holdings for a two week period. Eventhough the VaR has become a standard tool in financialrisk management, it has several disadvantages. First,the VaR does not give any information about the po-tential size of losses that exceed the VaR level. Thismay lead to overly risky investment strategies that op-timize the expected profit under the restriction that theVaR is not beyond a certain threshold, since the po-tential ‘rare event’ losses exceeding the VaR may beextreme. Second, the VaR is not a coherent measure,as was indicated by Artzner, Delbaen, Eber, and Heath(1999); that is, it lacks the property of sub-additivity.The ES has clear advantages over the VaR: it does saysomething about losses exceeding the VaR level, andit is a sub-additive, coherent measure. This property ofsub-additivity means that the ES of a portfolio (firm)cannot exceed the sum of the ES measures of its sub-portfolios (departments). Adding these individual ESmeasures yields a conservative risk measure for thewhole portfolio (firm). Because of these advantages ofES over VaR, we consider ES in this paper as well asVaR. For a concise and clear discussion of the VaR andES measures, refer to Ardia (2008).

As was mentioned in the introduction, there are sev-eral approaches to computing VaR and ES estimates.In this paper we use a Bayesian approach in an econo-metric model with explicit assumptions on the volatil-ity dynamics and conditional distribution. We focuson the VaR and ES for long positions, and use thefollowing notation. The m-dimensional vector yt con-sists of the returns on (or prices of) m assets at timet . Throughout this paper, m is equal to 1. Applicationsto portfolios with m ≥ 2 assets are left as a topic forfurther research. Our data set on T historical returnsis y ≡ {y1, . . . , yT }. We consider h-step-ahead (h =1, 2, . . .) forecasting of VaR and ES, where we definethe vector of future returns y∗ ≡ {yT+1, . . . , yT+h}.The model has a k-dimensional parameter vector θ .Finally, we have a (scalar valued) profit and loss func-tion P L(y∗) that is positive for profits and negative forlosses.


When estimating the h-step-ahead 100α% VaR orES in a Bayesian framework, one can obviously usethe following straightforward approach, which we willrefer to as the ‘direct approach’ of Bayesian VaR/ESestimation:

Step 1. Simulate a set of draws θ i (i = 1, . . . ,n) from the posterior distribution, e.g. usingGibbs sampling (Geman & Geman, 1984) or theMetropolis-Hastings algorithm (Hastings, 1970;Metropolis, Rosenbluth, Rosenbluth, Teller, &Teller, 1953).Step 2. Simulate corresponding future paths y∗i ≡{yi

T+1, . . . , yiT+h} (i = 1, . . . , n) from the model,

given parameter values θ i and historical values y ≡{y1, . . . , yT }, i.e. from the density p(y∗|θ i , y).Step 3. Order the values P L(y∗i ) ascending asP L( j) ( j = 1, . . . , n). The VaR and ES are thenestimated as

V a RD A ≡ P L(n(1−α)) (1)

and

E SD A ≡1

n(1− α)

n(1−α)∑j=1

P L( j), (2)

the (n(1 − α))th sorted loss and the average of thefirst (n(1− α)) sorted losses, respectively.

For example, one may generate 10,000 profit/lossvalues, sort these in ascending order, and take the100th sorted value as the 99% VaR estimate. In orderto intuitively show that this ‘direct approach’ is notoptimal, consider the simple example of the standardnormal distribution with P L ∼ N (0, 1). If we thenestimate the 99% VaR by simulating 10,000 standardnormal variates and taking the 100th sorted value,then this estimate is mostly ‘based’ on only 100 outof 10,000 draws. There is no specific focus on the‘high loss’ subspace, the left tail. Roughly speaking,if we are only interested in the VaR or ES, then alarge subset of the draws seems to be ‘wasted’ ona subspace that we are not particularly interested in.An alternative simulation approach that allows oneto specifically focus on an important subspace isimportance sampling.

In importance sampling (IS), due to Hammersleyand Handscomb (1964) and introduced to Bayesianeconometrics by Kloek and Van Dijk (1978), the

expectation E[g(X)] of a certain function g(.) of therandom variable X ∈ Rr is estimated as

E[g(X)]I S =

1n

n∑i=1

w(X i )g(X i )

1n

n∑j=1

w(X j )

=

n∑i=1

w(X i )g(X i )

n∑j=1

w(X j )

, (3)

where X1, . . . , Xn are independent realizations fromthe candidate distribution with density (= importancefunction) q(x), and w(X1), . . . , w(Xn) are the

corresponding weights w(X) = p(X)q(X)

, where we only

know a kernel p(x) of the target density p∗(x) of X :p(x) ∝ p∗(x). The consistency of the IS estimator inEq. (3) is easily seen from

E[g(X)] =

∫g(x)p(x) dx∫

p(x) dx

=

∫g(x)w(x)q(x) dx∫w(x)q(x) dx

=E[w(X)g(X)]

E[w(X)]. (4)

The IS estimator V a R I S of the 100α% VaR is com-puted by solving E[g(X)]I S = 1 − α with indica-tor function g(X) = I {P L(X) ≤ V a R I S}, sincePr[P L(X) ≤ c] = E[I {P L(X) ≤ c}].2 This amountsto sorting the profit/loss values of the candidate drawsP L(X i ) (i = 1, . . . , n) ascending as P L(X ( j)) ( j =1, . . . , n), and finding the value P L(X (k)) such thatSk = 1 − α, where Sk ≡

∑kj=1 w(X

( j)) is the cumu-

lative sum of scaled weights w(X ( j)) ≡w(X ( j))∑ni=1 w(X

(i))

(scaled to add to 1) corresponding to the ascendingprofit/loss values. In general there will be no k suchthat Sk = 1 − α, so that one interpolates between thevalues of P L(X (k)) and P L(X (k+1)), where Sk+1 isthe smallest value with Sk+1 > 1− α.

2 In our IS approach, X contains the model parameters θ and thefuture error terms of the process of asset returns (or prices), whichtogether determine the future returns (or prices). This choice will beexplained in Section 4.


The IS estimator E S I S of the 100α% ES is subse-quently computed as

E S I S =

k∑j=1

w∗(X ( j)) P L(X ( j)),

the weighted average of the k values P L(X ( j)) ( j =

1, . . . , k), with weights w∗(X ( j)) ≡w(X ( j))∑ki=1 w(X

(i))

(adding to 1).

2.1. Numerical standard errors

Geweke (1989) provides formulas for the numeri-cal accuracy of the IS estimator E[g(X)]I S in Eq. (3).See also Hoogerheide, Van Dijk, and Van Oest (2009,chap. 7) for a discussion of the numerical accuracyof E[g(X)]I S . It holds for large numbers of draws nand under the mild regularity conditions reported byGeweke (1989) that E[g(X)]I S has approximately thenormal distribution N (E[g(X)], σ 2

IS). The accuracy

of the estimate E[g(X)]I S for E[g(X)] is reflectedby the numerical standard error σIS, and the 95%confidence interval for E[g(θ)] can be constructed as( E[g(X)]I S − 1.96 σIS, E[g(X)]I S + 1.96 σIS).

The numerical standard error (NSE) σIS,VaR of theIS estimator of the VaR or ES does not follow directlyfrom the NSE for E[g(X)]I S , as neither VaR nor ESare unconditional expectations E[g(X)] for a randomvariable X of which we know the density kernel.3 Forthe NSE of the VaR estimator, we make use of the deltarule. We have

Pr[P L(X) ≤ V a R] ≈ Pr[P L(X) ≤ V a R]

+∂ Pr[P L(X) ≤ c]

∂ c

∣∣∣∣c=V a R

(V a R − V a R

)⇒(5)

1− α ≈ Pr[P L(X) ≤ V a R]

+ pP L (V a R)(

V a R − V a R)⇒ (6)

3 Only if we knew the true value of the VaR with certaintywould the estimation of the ES reduce to the ‘standard’ situationof IS estimation of the expectation of P L(X), where X has thetarget density kernel ptarget (x) ∝ p(x)I {P L(x) ≤ V a R}. For an

estimated V a R value, the uncertainty on the ES estimator is largerthan that. This uncertainty has two sources: (1) the variation of thedraws Xi with P L(Xi ) ≤ V a R for V a R = V a R; and (2) thevariation in V a R.

Fig. 1. Illustration of the numerical standard error of the ISestimator for a VaR, a quantile of a profit/loss P L(X) functionof a random vector X . The uncertainty on Pr[P L(X) ≤ c] forc = V a R I S (on the vertical axis) is translated to the uncertainty onV a R I S (on the horizontal axis) by a factor 1

pP L (c), the inverse of

the density function that is the steepness of the displayed cumulativedistribution function [CDF] of the profit/loss distribution.

var(V a R) ≈var

(Pr[P L(X) ≤ V a R]

)( pP L (V a R))2

, (7)

where Eq. (6) results from Eq. (5) by substituting esti-mates for Pr[P L(X) ≤ V a R], Pr[P L(X) ≤ V a R]and pP L (V a R), where pP L (V a R) is the density ofP L(X) evaluated at V a R and Pr[P L(X) ≤ V a R] =1−α, since this equality defines V a R. Substituting therealized value of V a R I S for V a R into Eq. (7) and tak-ing the square root yields the numerical standard errorfor V a R I S :

σIS,VaR =σ

IS,Pr[P L≤V a R I S ]

pP L (V a R I S). (8)

The numerical standard error σIS,Pr[P L≤V a R I S ]

for theIS estimator of the probability Pr[P L(X) ≤ c] for c =V a R I S follows directly from the NSE for E[g(X)]I Sof Geweke (1989) with g(x) = I {P L(x) ≤ c}.In general we do not have an explicit formula forthe density pP L (c) of P L(X), but this is easily esti-mated by Pr[P L(X)≤c+ε]−Pr[P L(X)≤c−ε]

2ε . One can com-pute this for several ε values, and use the ε that leads tothe smallest estimate pP L(X)(c), and hence the largest(conservative) value for σ

I S,V a R. Alternatively, one

can use a kernel estimator of the profit/loss density atc = V a R I S . Fig. 1 provides an illustration of the nu-merical standard error for an IS estimator of a VaR, or,more generally, a quantile.

For the numerical standard error of the ES, we usethe fact that if the VaR were known with certainty, wewould be in a ‘standard’ situation of IS estimation of


the expectation of a variable P L(X), where X has thetarget density kernel ptarget (x) ∝ p(x)I {P L(x) ≤V a R} for which the NSE σI S,E S|V a R and the (asymp-totically valid) normal density follow directly fromGeweke (1989). Since we do have the (asymptoti-

cally valid) normal density N(

V a R I S, σI S,V a R

)of

the VaR estimator (as derived above), we can proceedas follows to estimate the density for the ES estimator.

Step 1. Construct a grid of VaR values, e.g. on theinterval [V a R I S − 4σ

I S,V a R, V a R I S + 4σ

I S,V a R].

Step 2. For each VaR value on the grid, evaluatethe NSE σI S,E S|V a R of the ES estimator given theVaR value, and evaluate the (asymptotically valid)normal density p(E S I S|V a R) of the ES estimatoron a grid.Step 3. Estimate the ES estimator’s densityp(E S I S) as the weighted average of the densitiesp(E S I S|V a R) in step 2, with weights from theestimated density of the VaR estimator p(V a R I S).

The numerical standard error of E S I S is now ob-tained as the standard deviation of the estimated den-sity p(E S I S). Fig. 2 illustrates the procedure forthe estimation of the ES estimator’s density in thecase of N (0, 1) distributed profit/loss. The left pan-els show the case of direct sampling, where the den-sity p(E S I S|V a R) clearly has a higher variance formore negative, more extreme VaR values, resultingin a skewed density p(E S I S). The reason for this isthat for these extreme VaR values the estimate of ESgiven VaR is based on only few draws. The right pan-els show the case of IS with the Student-t importancedensity (with 10 degrees of freedom), where the den-sity p(E S I S|V a R) has a variance which is hardly anyhigher for more extreme VaR values. The Student-t distribution’s fat tails ensure that the estimated ESis based on many draws for extreme VaR values aswell. This example already reflects one advantage ofIS (with an importance density having fatter tails thanthe target distribution) over a ‘direct approach’: IS re-sults in a lower NSE, and especially less downwarduncertainty on the ES — with a correspondingly lowerrisk of substantially underestimating the risk.

2.2. Optimal importance density

The optimal importance distribution for the IS es-timation of g ≡ E[g(X)] for a given kernel p(x)

of the target density p∗(x) and function g(x) whichminimizes the numerical standard error for a given(large) number of draws, is given by Geweke (1989,Theorem 3). This optimal importance density has thekernel qopt (x) ∝ |g(x) − g| p(x) (under the con-dition that E[|g(x) − g|] is finite). Geweke (1989)mentions three practical disadvantages of this optimalimportance distribution. First, it is different for differ-ent functions g(x). Second, a preliminary estimate ofg = E[g(X)] is required. Third, methods for simulat-ing from it would need to be devised. Geweke (1989)further notes that this result reflects the fact that im-portance sampling densities with fatter tails may bemore efficient than the target density itself, as is thecase in the example above. In such cases, the rela-tive numerical efficiency (RNE), which is the ratio of(an estimate of) the variance of an estimator based ondirect sampling to the IS estimator’s estimated vari-ance (with the same number of draws), exceeds 1.4

An interesting result is the case where g(x) is an in-dicator function I {x ∈ S} for a subspace S, so thatE[g(X)] = Pr[X ∈ S] = p. Then the optimal impor-tance density is given by

qopt (x) ∝

{(1− p) p(x) for x ∈ Sp p(x) for x 6∈ S

or

qopt (x) =

{c (1− p) p∗(x) for x ∈ Sc p p∗(x) for x 6∈ S,

with c being a constant. Then∫

x∈S qopt (x)dx =∫x 6∈S qopt (x)dx = 1/2.5 That is, half of the draws

should be made in S and half outside S, in proportionto the target kernel p(x) in both cases.

From Eq. (8) it can be seen that the NSE ofthe IS estimator for the 100α% VaR is proportionalto the NSE of the IS estimator of E[g(X)] with

4 The RNE is an indicator of the efficiency of the chosenimportance function; if the target and importance densities coincide,the RNE equals one, whereas a very poor importance density willhave an RNE close to zero. The inverse of the RNE is known as theinefficiency factor (IF).

5 This is easily seen from∫

x∈S qopt (x)dx +∫

x 6∈S qopt (x)dx = 1and the equality of∫

x∈Sqopt (x)dx = c (1− p)

∫x∈S

p∗(x)dx = c (1− p) p

and∫x 6∈S

qopt (x)dx = c p∫

x 6∈Sp∗(x)dx = c p (1− p).


Fig. 2. Example of a standard normally distributed profit/loss PL: illustration of the estimation of the ES estimator’s density using either aN (0, 1) importance density (left) — corresponding to the case of direct sampling — or a Student-t importance density (right). The top-leftpanel gives the densities p(E S I S |V a R) for several VaR values. The top-right panel shows the density p(V a R I S). The bottom panel gives thedensity p(E S I S).

Fig. 3. Example of a standard normally distributed profit/loss PL (on the horizontal axis): the profit/loss density (top left) and the importancedensity for IS estimation of the 99% VaR that is optimal in the typical Bayesian case with only the target density kernel known (bottom left).Also given is a mixture of three Student-t distributions, providing an approximation to the optimal importance density (right).

g(x) = I {x ∈ S}, where S is the subspace with the100(1−α)% lowest values P L(X); hence the optimalimportance density for VaR estimation results fromGeweke (1989): half of the draws should be made inthe ‘high loss’ subspace S and half of the draws out-side S, in proportion to the target kernel p(x). Intu-itively, a precise numerator of Eq. (3) requires drawsfrom S, whereas the denominator requires draws fromthe whole space of X . If the exact target density p∗(x)were known, rather than merely a kernel p(x), then themiddle expression of the IS (Eq. (3)) would consist ofonly the numerator, so that it would be optimal to onlysimulate draws in the ‘high loss’ subspace S.

For the case of N (0, 1) distributed profit/loss, thebottom left panel of Fig. 3 shows the optimal impor-tance density for IS estimation of the 99% VaR. Notethe bimodality.6 A mixture of Student-t distributionscan approximate such shapes, see e.g. Hoogerheide

6 The optimal importance density can also have more than 2modes. For example, if one shorts a straddle of options, one has highlosses for both large decreases and large increases of the underlyingasset’s price, and the optimal importance density is then trimodal.Especially in higher dimensions, where one may not directly havea good ‘overview’ of the target distribution, it is important to use aflexible method such as the AdMit approach.


et al. (2007). The right panel of Fig. 3 shows a mixtureof three Student-t distributions providing a reasonableapproximation.

The VaR estimation approach proposed in thispaper — Quick Evaluation of Risk using Mixture oft approximations (QERMit) — consists of two steps:(1) approximate the optimal importance density by acertain mixture density qopt (.), where we must firstcompute a preliminary (less precise) estimate of theVaR; and (2) apply IS using qopt (.). Step (1) shouldbe seen as an ‘investment’ of computing time that willeasily be ‘profitable’, since far fewer draws from theimportance density are required in step (2).

The optimal importance density for IS estimation ofthe ES does not follow from Geweke (1989). We onlymention that it will generally have fatter tails than theoptimal importance density for VaR estimation, just asthe optimal importance density for the estimation ofthe mean has fatter tails than the target distributionitself (which is optimal for estimating the median).Since we make use of a fat-tailed importance densityin any case — being ‘conservative’ in the sense ofensuring that our importance density does not ‘miss’relevant parts of the parameter space — we simply useour approximation qopt (.) of the optimal importancedensity for VaR estimation. In the examples this willbe shown to work well.

In the extreme case of a Student-t profit-lossdistribution with 2 degrees of freedom, the directsampling estimator of the ES has no finite variance— just like the distribution itself — whereas the ISestimator using a Student-t importance density with 1degree of freedom, a Cauchy density, does have a finitevariance. This example shows that the relative gain inprecision from performing IS over a direct simulationin the estimation of the 100α% ES can be infinite (forany α ∈ (0, 1))!

On the other hand, for VaR estimation the relativegain in precision from using IS rather than directsimulation (of the same number of independent drawsfrom the target distribution) is limited (for a givenα ∈ (0, 1)). From Geweke (1989, Theorem 3) weknow that (for a large number of draws n) the varianceof E[g(X)]I S with the optimal importance densityqopt (x) is approximately σ 2

I S,opt ≈1n E[|g(X)− g|]2.

For g(X) = I {X ∈ S} with Pr[X ∈ S] = 1 − α, wehave σ 2

I S,opt ≈1n [α(1− α)+ (1− α)α]

2=

4nα

2(1−

α)2. For direct simulation (of independent draws), the

variance of the estimator E[g(X)]DS results from thebinomial distribution: σ 2

DS =1nα(1 − α). The gain

from using IS rather than direct simulation is therefore

σ 2DS

σ 2I S,opt

≈1

4α (1− α), (9)

which is also the relative gain for the VaR estimator’sprecision (from Eqs. (7)–(8)). For α = 1/2, Eq. (9)reduces to 1: for the estimation of the median, theoptimal importance density is the target density itself.For α = 0.99, the α value that is specific focussedon in this paper, the relative gain in Eq. (9) is equalto 25.25. For α = 0.95 and α = 0.995 it is equal to5.26 and 50.25, respectively. It is intuitively clear thatthe more extreme the quantile, the larger the potentialgain by focusing on the smaller subspace of interestusing the IS method. Eq. (9) gives an upper boundaryfor the (theoretical) RNE in IS based estimation of the100α% VaR.7 However, one should not interpret Eq.(9) as an upper boundary of the gain from the QERMitapproach over the method that we name the ‘directapproach’, since the ‘direct approach’ typically yieldsserially correlated draws. If the serial correlation ishigh, due to a low Metropolis-Hastings acceptancerate in the case of non-elliptical shapes or simply dueto high correlations between parameters in the case ofthe Gibbs sampler, the relative gain can be much largerthan the boundary of Eq. (9). In such cases, the RNEof the ‘direct approach’ may be far below 1.

3. The Adaptive Mixture of t (AdMit) method

In this section we briefly consider the AdMitapproach, which is an important ingredient of ourQERMit method. The AdMit approach consists oftwo steps. First, it constructs a mixture of Student-t distributions which approximates a target distri-bution of interest. The fitting procedure relies onlyon a kernel of the target density, so that the nor-malizing constant is not required. In a second step,

7 This is an asymptotic result. For finite numbers of draws, thequoted estimated RNE values may exceed this boundary due toestimation error.


this approximation is used as an importance func-tion in importance sampling (or as a candidate den-sity in the independence chain Metropolis-Hastings al-gorithm) for estimating the characteristics of the tar-get density. The estimation procedure is fully auto-matic, and thus avoids the difficult task, especiallyfor non-experts, of tuning a sampling algorithm. Inthe standard case of importance sampling the can-didate density is unimodal. If the target distributionis multimodal then some draws may have huge im-portance weights or some modes may even be com-pletely missed. Thus, an important problem is thechoice of the importance density, especially whenlittle is known about the shape of the target densitya priori. The importance density should be close to thetarget density, and it is especially important that thetails of the candidate should not be thinner than thoseof the target. Hoogerheide et al. (2007) mention sev-eral reasons why mixtures of Student-t distributionsare natural candidate densities. First, they can provideaccurate approximations of a wide variety of targetdensities, including densities with substantial skew-ness and high kurtosis. Furthermore, they can dealwith multi-modality and with non-elliptical shapes dueto asymptotes. Second, this approximation can be con-structed by a quick, iterative procedure, and a mix-ture of Student-t distributions is easy to sample from.Third, the Student-t distribution has fatter tails thanthe normal distribution; the risk that the tails of thecandidate will be thinner than those of the target dis-tribution is small, especially if one specifies Student-t distributions with few degrees of freedom. Finally,Zeevi and Meir (1997) showed that under certain con-ditions any density function may be approximated toan arbitrary level of accuracy using a convex combi-nation of basis densities, and the mixture of Student-tdistributions falls within their framework.

The AdMit approach determines the number ofmixture components, the mixing probabilities, and themodes and scale matrices of the components in sucha way that the mixture density approximates the targetdensity p∗(θ), of which we only know a kernel func-tion p(θ) with θ ∈ Rk . Typically, p(θ) will be a pos-terior density kernel for a vector of model parametersθ . The AdMit strategy consists of the following steps:

Step 0. Initialization: compute the mode and scalematrix of the first component, and draw a samplefrom this Student-t distribution;

Step 1. Iterate on the number of components: add anew component that covers a part of the space of θwhere the previous mixture density was relativelysmall, relative to p(θ);Step 2. Optimize the mixing probabilities;Step 3. Draw a sample from the new mixture;Step 4. Evaluate the importance sampling weights:if the coefficient of variation of the weights,the standard deviation divided by the mean, hasconverged, then stop. Otherwise, return to step 1.

For more details, refer to Hoogerheide et al. (2007).The R package AdMit is available online (Ardia,Hoogerheide, & Van Dijk, 2008).

4. Quick Evaluation of Risk using Mixture of tapproximations (QERMit)

The QERMit approach basically consists of twosteps. First, the optimal importance or candidatedensity of Geweke (1989), qopt (.), is approximated bya ‘hybrid’ mixture of densities qopt (.). Second, thiscandidate density is used for the importance samplingestimation of the VaR or ES. In order to estimate the h-step-ahead 100α% VaR or ES, the QERMit algorithmproceeds as follows:

Step 1. Construct an approximation of the optimalimportance density:Step 1a. Obtain a mixture of Student-t densitiesq1,Mit (θ) that approximates the posterior density— given only the posterior density kernel — usingthe AdMit approach.Step 1b. Simulate a set of draws θ i (i = 1, . . . ,n) from the posterior distribution using theindependence chain MH algorithm with candidateq1,Mit (θ). Simulate corresponding future pathsy∗i ≡ {yi

T+1, . . . , yiT+h} (i = 1, . . . , n) given

the parameter values θ i and historical values y ≡{y1, . . . , yT }, i.e. from the density p(y∗|θ i , y).Compute a preliminary estimate V a R prelim as the100(1 − α)% quantile of the profit-loss valuesP L(y∗i ) (i = 1, . . . , n).Step 1c. Obtain a mixture of Student-t densitiesq2,Mit (θ, y∗) that approximates the conditionaljoint density of parameters θ and future returns y∗,given that P L(y∗) ≤ V a R prelim , using the AdMitapproach.


Step 2. Estimate the VaR or ES using importancesampling with the following mixture candidatedensity for θ, y∗:

qopt (θ, y∗) = 0.5 q1,Mit (θ) p(y∗|θ, y)

+ 0.5 q2,Mit (θ, y∗), (10)

where the weights 0.5 reflect the optimal 50%–50%division between ‘high loss’ draws and other draws.

The reason for the particular term q1,Mit (θ) p(y∗|θ, y)in this candidate (Eq. (10)) is that the 50% of drawscorresponding to the ‘whole’ distribution of (y∗, θ)can be generated more efficiently by using the den-sity p(y∗|θ, y) that is specified by the model and onlyapproximating the posterior of θ by q1,Mit (θ) than byapproximating the joint distribution of (y∗, θ). This re-duces the dimensions of the approximation process,which has a positive effect on the computing time.In step 1b we actually compute a somewhat ‘conser-vative’, not-too-negative estimate V a R prelim of the

VaR, since an overly extreme, negative V a R prelimmay yield an approximation of the distribution in step1c that does not cover all of the ‘high loss’ region (withP L ≤ V a R). This conservative V a R prelim can bebased on its NSE, or simply take a somewhat highervalue of α than the level of interest.8

The QERMit algorithm proceeds in an automaticfashion, in the sense that it only requires the posteriorkernel of θ , profit/loss as a function of y∗, and(evaluation of and simulation from) the density ofy∗ given θ to be programmed. The generation ofdraws (θ i , y∗i ) only requires simulation from Student-t distributions and the model itself, which is performedeasily and quickly. Notice that we focus on thedistribution of (θ, y∗), whereas the loss only dependson y∗. The obvious reason for this is that we typicallydo not have the predictive density of the future path y∗

as an explicit density kernel, so that we have to aim at(θ, y∗), of which we know the density kernel

p(θ, y∗|y) = p(θ |y) p(y∗|θ, y)

8 For this reason it does not make sense to use mixingprobabilities 0.5/α and (α − 0.5)/α that would lead to anexact 50%–50% division of ‘high loss’ draws and other drawsif V a R prelim and q2,Mit (θ, y∗) were perfect, rather than 0.5

and 0.5, in Eq. (10). Since V a R prelim is chosen ‘conservatively’and q2,Mit (θ, y∗) is an approximation, not all of the candidateprobability mass in q2,Mit (θ, y∗) will be focused on the ‘high loss’region in any case.

∝ π(θ) p(y|θ) p(y∗|θ, y)

with prior density kernel π(θ).We will now discuss the QERMit method in a sim-

ple, illustrative example of an ARCH(1) model. Weconsider the 1-day-ahead 99% VaR and ES for theS&P 500. That is, we assume that during a given day,one will keep a constant long position in the S&P 500index. We use daily observations yt (t = 1, . . . , T )on the log-return, 100 × the change in the logarithmof the closing price, from January 2, 1998, to April 14,2000. See Fig. 4, in which April 14, 2000, correspondsto the second negative ‘shock’ of approximately−6%.This particular day is chosen for illustrative purposes.We consider the ARCH model (Engle, 1982) for thedemeaned series yt :

yt = εt (ht )1/2,

εt ∼ N (0, 1), (11)

ht = α0 + α1 y2t−1.

We further impose the variance targeting constraintα0 = S2 (1 − α1), with S2 being the sample varianceof the yt (t = 1, . . . , T ), so that we have a model withonly one parameter, α1. We assume a flat prior on theinterval [0, 1).

Step 1a of the QERMit method is illustrated inFig. 5. The AdMit method constructs a mixture ofStudent-t distributions that approximates the poste-rior density, given only its kernel. It starts with aStudent-t density around the posterior mode q1(α1),then searches for the maximum of the weight func-tion w(α1) = p(α1)/q1(α1), where a new Student-tcomponent q2(α1) for the mixture distribution is spec-ified. The mixing probabilities are chosen to mini-mize the coefficient of variation of the IS weights,in this case yielding q1,Mit (α1) = 0.955 q1(α1) +

0.045 q2(α1), which only provides a minor improve-ment — a slightly more skewed importance density— over the original Student-t density q1(α1).9 There-fore, convergence is achieved after two steps. Note thatwe do not need an (almost) perfect approximation tothe posterior kernel, which would generally require ahuge amount of computing time. A reasonably good

9 See Hoogerheide et al. (2007) for examples in which thisimprovement is huge. For the QERMit approach to be useful, it isnot necessary for the posterior to have non-elliptical shapes.


Fig. 4. S&P 500 log-returns (100 × change of log-index): daily observations from the period 1998–2007.

Fig. 5. The QERMit method in an illustrative ARCH(1) model for S&P 500. Step 1a: the AdMit method iteratively constructs a mixture of tapproximation q1,Mit (.) to the posterior density, given only its kernel.

approximation is good enough. In this simple exam-ple, QERMit step 1a took only 1.2 s.10

In the QERMit method’s step 1b, we generate aset of n = 10,000 draws αi

1 (i = 1, . . . , n) usingthe independence chain MH algorithm with candidateq1,Mit (α1), and simulate n corresponding draws yi

T+1from the distribution N (0, S2

+ αi1(y

2T − S2)) =

N (0, 1.62 + 35.13αi1), since yT = −6.06. The

10 An Intel Centrino Duo Core processor was used.

100th of the ascendingly sorted percentage loss values,P L(y∗i ) = 100[exp(yi

T+1/100) − 1] (since yT+1 is100× the log-return, or a ‘conservatively’ chosen lessnegative value), is then the preliminary VaR estimateV a R prelim . In this simple example QERMit step 1btook only 3.4 s.

The left panels of Fig. 6 depict QERMit step 1c.We approximate the joint ‘high loss’ distribution of(α1, εT+1) rather than (α1, yT+1). The reason forthis is that in general it is easier to approximatethe ‘high loss’ distribution of (θ, ε∗), where ε∗ ≡


Table 1Estimates of 1-day-ahead 99% VaR and ES for S&P 500 in the ARCH(1) model (for demeaned series under ‘variance targeting’, given dailydata from January 1, 1998–April 14, 2000).

‘Direct approach’: Metropolis-Hastings (Student-tcandidate) for parameter draws + direct sampling forfuture returns paths given parameter draws

QERMit approach: Adaptive importancesampling using a mixture approximation ofthe optimal candidate distribution

Estimate NSE RNE Estimate NSE RNE99% VaR −5.744% 0.099% 0.92 −5.658% 0.020% 22.199% ES −6.592% 0.132% 0.86 −6.566% 0.024% 24.9

Time: - total 3.3 s 10.1 s- construction

candidate 6.6 s- sampling 3.3 s 3.5 s

Draws 10,000 10,000Time/draw 0.33 ms 0.35 ms

Required for % estimate with 1 digit of precision (with 95% confidence)For 99% VaR:

- number of draws 151,216 6408- computing time 49.9 s 8.8 s

For 99% ES:- number of draws 268,150 9036- computing time 88.5 s 9.8 s

{εT+1, . . . , εT+h}, by a mixture of Student-t distribu-tions than the ‘high loss’ distribution of (θ, y∗). Thismakes step 1c much faster, especially in GARCH-type models where the dependencies (of the clusteredvolatility) between future values yT+1, . . . , yT+h areobviously much more complex than between the in-dependent future values εT+1, . . . , εT+h . The ‘highloss’ subspace of parameters θ and future errors ε∗ issomewhat more complex than for θ and y∗. For exam-ple, in Fig. 6 the border line is described by εT+1 =

c/√

1.62+ 35.13αi1 instead of simply yT+1 = c (for

c = 100 log(1+V a R prelim/100)), but it is still prefer-able to focusing directly on the parameters and fu-ture realizations y∗. The middle and bottom left panelsshow the contour plots of the joint ‘high loss’ densityof (α1, εT+1) and its mixture of t approximation. Thisshows that a two-component mixture can provide use-ful approximations of the highly skewed shapes thatare typically present in such tail distributions. In thissimple example, QERMit step 1c took only 2.0 s.

The right panels of Fig. 6 show the result ofQERMit step 1, a ‘hybrid’ mixture approximationqopt (α1, εT+1) to the optimal importance densityqopt (α1, εT+1). Table 1 shows the results of QERMitstep 2, and compares the QERMit procedure to the ‘di-

rect approach’. For the ‘direct approach’, the series of10,000 profit/loss values is serially correlated, sincewe use the Metropolis-Hastings algorithm. Therefore,the numerical standard errors make use of the methodof Andrews (1991), using a quadratic spectral ker-nel and pre-whitening, as suggested by Andrews andMonahan (1992). Note the huge difference betweenthe NSEs. The RNE for the ‘direct approach’ is some-what smaller than 1, due to the serial correlation inthe Metropolis-Hastings draws, whereas the RNE forthe QERMit importance density is far above 1 — infact, it is not far from its theoretical boundary of 25.25(for α = 0.99). Notice that the fat-tailed approxima-tion to the optimal importance density for VaR esti-mation works even better for ES estimation, with aneven higher RNE. For a precision of 1 digit (with 95%confidence), i.e. 1.96 N SE < 0.05, we require farfewer draws and much less computing time using theQERMit approach than using the ‘direct approach’. Inmore complicated models the construction of a suit-able importance density will obviously require moretime. However, this bigger ‘investment’ of computingtime may obviously still be profitable, possibly evenmore so, as the ‘direct approach’ will then also requiremore computing time. In the next section we considera GARCH model with Student-t errors.


Fig. 6. The QERMit method in an illustrative ARCH(1) model for S&P 500. Step 1c: the AdMit method constructs a mixture of t approximationq2,Mit (.) to the joint ‘high loss’ density of the parameters and the future errors (contour plots in left panels). Step 2: use the approximationqopt (.) (bottom right panel) to the optimal importance density qopt (.) (middle right panel) for VaR or ES estimation.


Table 2Simulation results for the GARCH model with Student-t innovations (Eqs. (12)–(16)) for S&P 500 log-returns: estimated posterior means,posterior standard deviations and serial correlations in the Markov chains of draws.

Metropolis-Hastings Metropolis-Hastings[MH] [AdMit-MH](candidate = Student-t) (candidate = mixture of 2 Student-t)Mean St.dev. S.c. Mean St.dev. S.c.

µ 0.0483 0.0169 0.4322 0.0489 0.0177 0.4458α0 0.0086 0.0034 0.5463 0.0080 0.0033 0.5288α1 0.0713 0.0114 0.5081 0.0697 0.0108 0.4896β 0.9243 0.0118 0.5157 0.9262 0.0114 0.5106ν 10.0953 1.9717 0.6632 9.8086 1.6801 0.4791

Total time 47.2 s 65.4 sTime construction candidate 16.1 sTime sampling 47.2 s 49.3 sDraws 10,000 10,000Time/draw 4.7 ms 4.9 msAcceptance rate 53.9% 56.2%

5. Student-t GARCH model for S&P 500

In this section we consider the 10-day-ahead 99%VaR and ES for the S&P 500. We use T = 2514 dailyobservations yt (t = 1, . . . , T ) on log-returns fromJanuary 2, 1998, to December 31, 2007; see Fig. 4. Weconsider the GARCH model (Bollerslev, 1986; Engle,1982) with Student-t innovations:11

yt = µ+ ut (12)

ut = εt (%ht )1/2 (13)

εt ∼ Student-t(ν) (14)

% ≡ν − 2ν

(15)

ht = α0 + α1u2t−1 + βht−1 (16)

where Student-t(ν) is the standard Student-t distribu-tion with ν degrees of freedom, with variance ν

ν−2 forν > 2. The scaling factor % normalizes the varianceof the Student-t distribution such that the innovationut has variance ht . We specify flat priors for µ, α0,α1, and β on the parameter subspace with α0 > 0,

11 We also considered the GJR model (Glosten, Jaganathan, &Runkle, 1993) with Student-t innovations. However, the resultssuggested a negative α1 parameter for positive error values,suggesting that large positive shocks lead to a decrease involatility relative to modest positive innovations. This result maybe considered counterintuitive, and is a separate topic which doesnot fit within the scope of the current paper.

0 ≤ α1 ≤ 1, 0 ≤ β ≤ 1. These restrictions guaranteethat the conditional variance will be positive. We usea proper yet uninformative exponential prior distribu-tion for ν − 2; the restriction ν > 2 ensures that theconditional variance is finite.12

For the model (12)–(16), simulation results are inTable 2. Computing times refer to computations on anIntel Centrino Duo Core processor. The first MH ap-proach uses a Student-t candidate distribution aroundthe maximum likelihood estimator. The AdMit-MHapproach in step 1a of the QERMit algorithm requires16.1 s to construct a candidate distribution, which is amixture of two Student-t distributions in this example.The AdMit-MH draws have a slightly higher accep-tance rate and, for all parameters but µ, a somewhatlower serial correlation in the Markov chain of draws.However, the differences are small, reflecting the factthat the contours of the posterior are rather close to theelliptical shapes of the simple Student-t candidate.

We now compare the results of the ‘direct ap-proach’ and the QERMit method. Fig. 7 showsthe estimated profit/loss density, the density of thepercentage 10-day change in the S&P 500 index.Simulation results are in Table 3. In the QERMitapproach the construction of the candidate distribu-

12 Under a flat prior for ν the posterior would be improper, as forν →∞ the likelihood does not tend to 0, but to the likelihood underGaussian innovations — see Bauwens and Lubrano (1998).


Table 3Estimates of 10-day-ahead 99% VaR and ES for the S&P 500 in the Student-t GARCH model (given daily data from the period January1998–December 2007).

‘Direct’ approach: Metropolis-Hastings (Student-tcandidate) for parameter draws + direct sampling forfuture returns paths given parameter draws

QERMit approach: Adaptive ImportanceSampling using a mixture approximationof the optimal candidate distribution

Estimate NSE RNE Estimate NSE RNE

99% VaR −7.92% 0.19% 0.76 −8.27% 0.06% 7.3499% ES −9.51% 0.26% 0.58 −9.97% 0.07% 8.11

Time: - total 51.9 s 165.9 s- construction

candidate 103.8 s- sampling 51.9 s 62.1 s

Draws 10,000 10,000Time/draw 5.2 ms 67.2 ms

Required for % estimate with 1 digit of precision (with 95% confidence):For 99% VaR:

- number of draws 567,648 58,498- computing time 2946 s (= 49 min 6 s) 467 s (= 7 min 47 s)

For 99% ES:- number of draws 1,033,980 74,010- computing time 5366 s (= 89 min 26 s) 563 s (= 9 min 23 s)

tion requires 103.8 s. This ‘investment’ can be con-sidered quite ‘profitable’, as the NSEs of the VaR andES estimators — both based on 10,000 draws — aremuch smaller than the NSE of the estimators using the‘direct approach’. Suppose that we want to computeestimates of the VaR and ES (in %) with a precisionof 1 digit (with 95% confidence), i.e. 1.96 N SE ≤0.05, so that we can quote (for example) −8.3% and−10.0% as the VaR and ES estimates from this model.In the ‘direct approach’ we would then require over500,000 draws (over 49 minutes) for the VaR and over1,000,000 draws (over 89 minutes) for the ES. How-ever, in the QERMit approach we would require fewerthan 60,000 (75,000) draws (under 8 (10) min) for theVaR (ES). Fig. 8 illustrates that the investment of com-puting time in an appropriate candidate distribution isindeed very profitable if one desires estimates of VaRand ES with a reasonable level of precision.

Finally, notice that the RNE is much higher than1 for the QERMit approach, whereas it is somewhatbelow 1 for the ‘direct approach’. The reason for thelatter is again the serial correlation in the MH sequenceof parameter draws.13 The first phenomenon is in

13 One could consider using only one in k draws, e.g. k =5. However, this ‘thinning’ makes no sense in this application,

Fig. 7. Estimated profit/loss density: estimated density of 10 days% change in the S&P 500 (for the first 10 working days of January2008) based on the GARCH model with Student-t errors estimatedusing 1998–2007 data.

sharp contrast to the potential ‘struggle’ in importancesampling based Bayesian inference (for the estimation

since generating a parameter draw (evaluating the posterior densitykernel) takes at least as much time as generating a path of 10 futurelog-returns. The quality of the draws, i.e. the RNE, would increaseslightly, but the computing time required per draw would increasesubstantially.


Fig. 8. Precision (1/var) of the estimated VaR and ES, as a function of the amount of computing time for the ‘direct approach’ (- -) and theQERMit approach (—). The horizontal line corresponds to a precision of 1 digit (1.96 N SE ≤ 0.05). Here the QERMit approach requires103.8 s to construct an appropriate importance density, and after that soon generates far more precise VaR and ES estimates.

of posterior moments of non-elliptical distributions) tohave an RNE not too far below 1.

A phenomenon that may cause problems for theQERMit approach is the so-called ‘curse of dimen-sionality of importance sampling’. In this situation,the relative performance of the QERMit method maydecrease as the prediction horizon h increases. In ourexample, the clear victory witnessed for h = 10 isstill observed for h = 20. For very long horizons,e.g. h = 100, the preference for the QERMit approachvanishes. An adaptation of the QERMit algorithm, inwhich one adapts, in real time, a ‘prefabricated’ im-portance function that has been extensively trained oneither historical data or one of a set of canonical ex-amples, rather than starting from scratch, may providebetter results in such situations. This is left as an issuefor future research.

6. Concluding remarks

We conclude that the proposed QERMit approachcan yield far more accurate VaR and ES estimatesgiven the same amount of computing time, or,equivalently, requires less computing time to achievethe same numerical accuracy. This enables ‘real time’decision-making on the basis of these risk measuresin a simulation-based Bayesian framework basedon results with a higher accuracy. The proposedmethod can also be useful in the case of 1-step-aheadforecasting with a portfolio of several assets, as thesimulation of the future realizations is typically alsorequired then. Thus, the sensible application of theQERMit method is not restricted to multi-step-aheadforecasting of VaR and ES.

One merit of the traditional ‘direct approach’ is thatone can use the same algorithm for all risk levels (α)and prediction horizons (h) in which one is simulta-neously interested. In the QERMit approach, a differ-ent importance function has to be designed for each αand h one wishes to consider. However, as an alterna-tive one can also approximate the optimal importancefunction for the case with the largest relevant values ofα and h, and use this importance function for smallervalues of α and h as well. We intend to report on caseswith multiple values of interest for α and h in the fu-ture.

The examples in this paper only consider the caseof a single asset, the S&P 500 index. In that sense,the application is 1-dimensional. However, the 10-day-ahead forecasting of the price of a single asset has sim-ilarities with 1-day-ahead forecasting for a portfolio of10 assets. Further, the subadditivity of the ES measureimplies that ES measures of subportfolios may alreadybe useful: adding them yields a conservative risk mea-sure for the whole portfolio. Nonetheless, we intend toinvestigate portfolios of several assets and report on itin the near future. The application to portfolios of sev-eral assets whose returns’ distributions are captured ina multivariate GARCH model or a copula is also ofinterest. Having features which are clearly different tothose of the S&P 500 index, an application to electric-ity prices would also be of interest.

As another topic for further research, we men-tion the application of the approach for efficientsimulation-based computations in extreme value the-ory, e.g. efficient risk evaluation in the case of Paretodistributions.


Acknowledgements

A preliminary version of this paper was presentedat the 2008 ESEM Conference in Milan, at the NewYork Camp Econometrics IV at Lake Placid, andat seminars at the European University Institute inFlorence and the University of Pennsylvania. Help-ful comments from several participants have led tosubstantial improvements. The authors further thankDavid Ardia, Luc Bauwens, Tomasz Wozniak andtwo anonymous referees for multiple useful sugges-tions. Some results were obtained during a students’research project at Erasmus University Rotterdam in2008 by Jesper Boer, Rutger Brinkhuis and Jaapvan Dam, supervised by Sander Scheerders of SNSREAAL and the first author. The second authorgratefully acknowledges financial assistance from theNetherlands Organization of Research (grant 400-07-703).

References

Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelationconsistent covariance matrix estimation. Econometrica, 59(3),817–858.

Andrews, D. W. K., & Monahan, J. C. (1992). An improvedheteroskedasticity and autocorrelation consistent covariancematrix estimator. Econometrica, 60(4), 953–966.

Ardia, D. (2008). Financial risk management with Bayesian esti-mation of GARCH models. In Lecture Notes in Economics andMathematical Systems: Vol. 612. Springer.

Ardia, D., Hoogerheide, L. F., & Van Dijk, H. K. (2008). The‘AdMit’ package: Adaptive mixture of Student-t distributions.R Foundation for Statistical Computing. URL: http://cran.at.r-project.org/web/packages/AdMit/index.html.

Artzner, P., Delbaen, F., Eber, J. M., & Heath, D. (1999). Coherentmeasures of risk. Quantitative Finance, 9(3), 203–228.

Basel Committee on Banking Supervision (1995). An internalmodel-based approach to market risk capital requirements. TheBank for International Settlements, Basel, Switzerland.

Bauwens, L., & Lubrano, M. (1998). Bayesian inference onGARCH models using the Gibbs sampler. EconometricsJournal, 1, C23–C26.

Bollerslev, T. (1986). Generalized autoregressive conditionalheteroskedasticity. Journal of Econometrics, 31(3), 307–327.

Engle, R. F. (1982). Autoregressive conditional heteroskedasticitywith estimates of the variance of the United Kingdom inflation.Econometrica, 50(4), 987–1008.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbsdistributions, and the Bayesian restoration of images. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 6,721–741.

Geweke, J. (1989). Bayesian inference in econometric models usingMonte Carlo integration. Econometrica, 57, 1317–1339.

Glosten, L. R., Jaganathan, R., & Runkle, D. E. (1993). On the rela-tion between the expected value and the volatility of the nominalexcess return on stocks. Journal of Finance, 48(5), 1779–1801.

Hammersley, J. M., & Handscomb, D. C. (1964). Monte Carlomethods (1st ed.). London: Methuen.

Hastings, W. K. (1970). Monte Carlo sampling methods usingMarkov chains and their applications. Biometrika, 57, 97–109.

Hoogerheide, L. F., Kaashoek, J. F., & Van Dijk, H. K. (2007). Onthe shape of posterior densities and credible sets in instrumentalvariable regression models with reduced rank: An applicationof flexible sampling methods using neural networks. Journal ofEconometrics, 139(1), 154–180.

Hoogerheide, L. F., Van Dijk, H. K., & Van Oest, R. D. (2009).Simulation based Bayesian econometric inference: Principlesand some recent computational advances. In Handbook ofcomputational econometrics (pp. 215–280). Wiley.

Kloek, T., & Van Dijk, H. K. (1978). Bayesian estimates of equationsystem parameters: An application of integration by MonteCarlo. Econometrica, 46, 1–20.

McNeil, A. J., & Frey, R. (2000). Estimation of tail-related riskmeasures for heteroscedastic financial time series: An extremevalue approach. Journal of Empirical Finance, 7(3–4), 271–300.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A.H., & Teller, E. (1953). Equation of state calculations by fastcomputing machines. The Journal of Chemical Physics, 21,1087–1092.

Zeevi, A. J., & Meir, R. (1997). Density estimation through convexcombinations of densities; Approximation and estimationbounds. Neural Networks, 10, 99–106.

Lennart Hoogerheide is Assistant Professor of Econometrics atthe Econometric Institute of Erasmus University Rotterdam. Hisresearch is mainly focused on flexible models and computationalmethods for the analysis of risk and treatment effects in finance andmacroeconomics. His publications have appeared in internationaljournals such as Journal of Econometrics, International Journalof Forecasting, Journal of Forecasting, and Journal of StatisticalSoftware.

Herman K. van Dijk is Professor of Econometrics at theEconometric Institute of Erasmus University Rotterdam, anddirector of the Tinbergen Institute. His recent research has beenrelated to simulation-based Bayesian inference, with applicationsin macroeconomics and finance. His publications have appeared ininternational journals such as International Journal of Forecasting,Journal of Forecasting, Journal of Econometrics, Econometrica,European Economic Review, Journal of Applied Econometrics,Journal of Business and Economic Statistics, and Journal ofStatistical Software.

http://cran.at.r-project.org/web/packages/AdMit/index.html



bayesian forecasting of value at risk and expected shortfall using adaptive importance sampling

Documents