deep learning for macroeconomistsjesusfv/deep-learning.pdf · 2021. 5. 21. · deep learning for...

Deep Learning for Macroeconomists

Jesus Fernandez-Villaverde1

May 21, 2021

1University of Pennsylvania

Background

• Presentation based on joint work with different coauthors:

1. Solving High-Dimensional Dynamic Programming Problems using Deep Learning, with Galo Nuno,

George Sorg-Langhans, and Maximilian Vogler.

2. Exploiting Symmetry in High-Dimensional Dynamic Programming, with Mahdi Ebrahimi Kahou, Jesse

Perla, and Arnav Sood.

3. Financial Frictions and the Wealth Distribution, with Galo Nuno and Samuel Hurtado.

4. Structural Estimation of Dynamic Equilibrium Models with Unstructured Data, with Sara Casella and

Stephen Hansen.

• These slides are available at https://www.sas.upenn.edu/~jesusfv/deep-learning.pdf

• Examples and code at https://github.com/jesusfv/financial-frictions

1

https://www.sas.upenn.edu/~jesusfv/deep-learning.pdf

https://github.com/jesusfv/financial-frictions

Deep learning for estimation

• Most economists have seen deep learning as applied to statistics/econometrics problems.

• For example, given some data x = {x1, x2, ..., xN}, we want to estimate a function:

y = h(x)

such as a conditional expectation, a quantile function, an index, a classifier, ...

• Why deep learning?

1. We do not know much about the functional form of h(·). In particular, it can be highly nonlinear.

2. x is highly dimensional.

3. x is non-structured (photos, social media, text,...).

2

Deep learning for solving models

• Solving models in macro (and in I.O., international trade, finance, game theory, ...) shares a very

similar structure.

• For example, given some states x = {x1, x2, ..., xN}, we want to compute a function (or, more

generally, an operator):

y = h(x)

such as a value function, a best response function, a pricing kernel, an equilibrium allocation, ...

• Again, we might not know much about the functional form of h(·) or x is highly dimensional.

• Common idea: we need to find a function/operator that satisfies some equilibrium conditions and

deep learning is a great functional approximation tool.

• For example, it can tackle multiplicity of equilibria.

3

A basic model

• Take the stochastic neoclassical growth model:

maxE0

∞∑t=0

βtu (ct)

ct + kt+1 = eztkαt + (1− δ) kt , ∀ t > 0

zt = ρzt−1 + σεt , εt ∼ N (0, 1)

• The first-order condition:

u′ (ct) = βEt

{u′ (ct+1)

(1 + αezt+1kα−1

t+1 − δ)}

4

Example I: Decision rules

• There is a decision rule (a.k.a. policy function) that gives the optimal choice of consumption and

capital tomorrow given the states today:

h =

{h1 (kt , zt) = cth2 (kt , zt) = kt+1

• Then:

u′(h1 (kt , zt)

)= βEt

{u′(h1(h2 (kt , zt) , zt+1

)) (1 + αezt+1

(h2 (kt , zt)

)α−1 − δ)}

• If we find h, and check that a transversality condition is satisfied, we are done!

5

Example II: Conditional expectations

• Let us go back to our Euler equation:

u′ (ct) = βEt

{u′ (ct+1)

(1 + αezt+1kα−1

t+1 − δ)}

• Define now:

h =

{h1 (kt , zt) = ct

h2 (kt , zt) = Et

{u′ (ct+1)

(1 + αezt+1kα−1

t+1 − δ)}

• Why? Example: ZLB.

• Then:

u′(h1 (kt , zt)

)= βh2 (kt , zt)

6

Example III: Value functions

• There is a recursive problem associated with the previous sequential problem:

V (kt , zt) = maxkt+1

{u (ct) + βEtV (kt+1, zt+1)}

ct = eztkαt + (1− δ) kt − kt+1, ∀ t > 0

zt = ρzt−1 + σεt , εt ∼ N (0, 1)

• Then:

h (kt , zt) = V (kt , zt)

and

h (kt , zt) = maxkt+1

{u (ct) + βEth (kt+1, zt+1)}

7

A paremeterized solution

• General idea: substitute h (x) by hj (x, θ) where θ is a vector of coefficients to be determined by

satisfying some criterium indexed by j .

• For example, in econometrics, if we want to approximate:

Et(yt |xt)

we can specify the (linear) polynomial:

yt ≈ θ0 + θ1 ∗ xt

and find the values θ0 and θ1 that minimize the squared errors of the approximation.

• Similarly, when we solve a model, we can approximate a decision rule

yt = h(xt)

by specifying the (linear) polynomial:

yt ≈ θ0 + θ1 ∗ xtand find the values θ0 and θ1 that minimize the squared errors between the left-hand and right-hand

side of an Euler equation. 8

Classical approaches

• Two classical approaches based on addition of functions:

1. Perturbation methods:

hPE (x, θ) = θ0 + ~θ1(x− x0) + (x− x0)′ ~θ2(x− x0) + H.O.T .

We use implicit-function theorems to find θ.

2. Projection methods:

hPR (x, θ) = θ0 +M∑

m=1

θmφm (x)

where φm is, for example, a Chebyshev polynomial.

We pick a basis {φm (x)}∞i=0 and “project” the equilibrium conditions against that basis.

9

A neural network

• An artificial neural network (ANN) is an approximation to h(x) built as a linear combination of M

generalized linear models of x of the form:

y ∼= hNN (x; θ) = θ0 +M∑

m=1

θmϕ

(θ0,m +

N∑n=1

θn,mxn

)

where ϕ(·) is an arbitrary activation function.

• M is known as the width of the model.

• We can select θ such that hNN (x; θ) is as close to h(x) as possible given some relevant metric.

• This is known as “training” the network (a lot of details need to be filled in here!).

10

-5 -4 -3 -2 -1 0 1 2 3 4 50

1

2

3

4

5

6Examples of activation functions

ReLUSoftplus

11

Comparison

• Compare:

y ∼= hNN (x; θ) = θ0 +M∑

m=1

θmϕ

(θ0,m +

N∑n=1

θn,mxn

)with a standard Chebyshev polynomial projection:

y ∼= hCP (x; θ) = θ0 +M∑

m=1

θmφm (x)

• Thus, we could think about an ANN as a nonlinear projection where exchange the rich

parameterization of coefficients for the parsimony of basis functions: composition vs. addition.

• How we determine the coefficients will also be different (Monte Carlo vs. quadrature), but this is

somewhat less important.

• Why this often a good idea? Theoretical and practical reasons.

15

Two classic (yet remarkable) results

Universal approximation theorem: Hornik, Stinchcombe, and White (1989)

A neural network with at least one hidden layer can approximate any Borel measurable function mapping

finite-dimensional spaces to any desired degree of accuracy.

• Assume, as well, that we are dealing with the class of functions for which the Fourier transform of

their gradient is integrable.

Breaking the curse of dimensionality: Barron (1993)

A one-layer neural network achieves integrated square errors of order O(1/M), where M is the number

of nodes. In comparison, for series approximations, the integrated square error is of order O(1/(M2/N))

where N is the dimensions of the function to be approximated.

• We actually rely on more general theorems by Leshno et al. (1993) and Bach (2017).

16

Practical reasons

• Algorithms that are:

1. Easy to code.

2. State of the art libraries.

3. Stable.

4. Scalable through massive parallelization.

5. Dedicated hardware.

17

Deep learning

• A deep learning network is an acyclic multilayer composition of J > 1 neural networks:

y ∼= hDL (x; θ) = hNN(1)(hNN(2)

(...; θ(2)

); θ(1)

)where the M(1),M(2), ... and ϕ1(·), ϕ2(·), ... are possibly different across each layer of the network.

• Sometimes known as deep feedforward neural networks or multilayer perceptrons.

• “Feedforward” comes from the fact that the composition of neural networks can be represented as a

directed acyclic graph, which lacks feedback. We can have more general recurrent structures.

• J is known as the depth of the network. The case J = 1 is a standard neural network.

• As before, we can select θ such that hDL (x; θ) approximates a target h(x) as closely as possible under

a relevant metric.

20

Dynamic programming

Solving high-dimensional dynamic programming problems using Deep Learning

• Solving High-Dimensional Dynamic Programming Problems using Deep Learning, with Galo Nuno,

George Sorg-Langhans, and Maximilian Vogler.

• Our goal is to solve the recursive continuous-time Hamilton-Jacobi-Bellman (HJB) equation globally:

ρV (x) = maxα

r(x,α) +∇xV (x)f (x,α) +1

2tr(σ(x))T∆xV (x)σ(x))

s.t. G (x,α) ≤ 0 and H(x,α) = 0,

• Think about the cases where we have many state variables.

• Why continuous time?

• Alternatives for this solution?

22

Neural networks

• We define four neural networks:

1. V (x;ΘV ) : RN → R to approximate the value function V (x).

2. α(x;Θα) : RN → RM to approximate the policy function α.

3. µ(x;Θµ) : RN → RL1 , and λ(x;Θλ) : RN → RL2 to approximate the Karush-Kuhn-Tucker (KKT)

multipliers µ and λ.

• To simplify notation, we accumulate all weights in the matrix Θ = (ΘV ,Θα,Θµ,Θλ).

• We could think about the approach as just one large neural network with multiple outputs.

23

Error criterion I

• The HJB error:

errHJB(x;Θ) ≡ r(x, α(s;Θα)) +∇x V (x;ΘV )f (x, α(x;Θα))+

+1

2tr [σ(x)T∆x V (x;ΘV )σ(x)]− ρV (x;ΘV )

• The policy function error:

errα(x;Θ) ≡∂r(x, α(x;Θα)

∂α+ Dαf (x, α(x;Θα))T∇x V (x;ΘV )

− DαG (x, α(x;Θα))T µ(x;Θµ)− DαH(x, α(x;Θα))λ(x;Θλ),

where DαG ∈ RL1×M , DαH ∈ RL2×M , and Dαf ∈ RN×M are the submatrices of the Jacobian

matrices of G , H and f respectively containing the derivatives with respect to α.

24

Error criterion II

• The constraint error is itself composed of the primal feasibility errors:

errPF1 (x;Θ) ≡ max{0,G (x, α(x;Θα))}errPF2 (x;Θ) ≡ H(x, α(x;Θα)),

the dual feasibility error:

errDF (x;Θ) = max{0,−µ(x;Θµ},

and the complementary slackness error:

errCS(x;Θ) = µ(x;Θ)TG (x, α(x;Θα)).

• We combine these four errors by using the squared error as our loss criterion:

E(x;Θ) ≡∣∣∣∣errHJB(x;Θ)

∣∣∣∣22

+∣∣∣∣errα(x;Θ)

∣∣∣∣22

+∣∣∣∣errPF1 (x;Θ)

∣∣∣∣22+

+∣∣∣∣errPF2 (x;Θ)

∣∣∣∣22

+∣∣∣∣errDF (x;Θ)

∣∣∣∣22

+∣∣∣∣errCS(x;Θ)

∣∣∣∣22

25

Training

• We train our neural networks by minimizing the above error criterion through mini-batch gradient

descent over points drawn from the ergodic distribution of the state vector.

• The efficient implementation of this last step is the key to the success of our algorithm.

• We start by initializing our network weights and we perform K learning steps called epochs, where K

can be chosen in a variety of ways.

• For each epoch, we draw I points from the state space by simulating from the ergodic distribution.

• Then, we randomly split this sample into B mini-batches of size S . For each mini-batch, we define

the mini-batch error, by averaging the loss function over the batch.

• Finally, we perform mini-batch gradient descent for all network weights, with ηk being the learning

rate in the k-th epoch.

26

(a) Value with closed-form policy27

(c) Consumption with closed-form policy28

(e) HJB error with closed-form policy29

A coda

• Exploiting Symmetry in High-Dimensional Dynamic Programming, with Mahdi Ebrahimi Kahou,

Jesse Perla, and Arnav Sood.

• We introduce the idea of permutation-invariant dynamic programming.

• Intuition.

• Solution has a symmetry structure we can easily exploit using representation theorems, concentration

of measure, and neural networks.

• More in general: how do we tackle models with heterogenous agents?

30

Models with heterogeneous

agents

Models with heterogeneous agents

• Economic (individuals, firms, ...) agents are heterogeneous along important dimensions:

1. Age.

2. Locations (spatial or economical).

3. Productivity.

4. Wealth.

5. Information.

6. Beliefs and expectations.

7. ....

• Even within narrowly defined subgroups, we observe large individual heterogeneity in behavior.

31

When is heterogeneity important? I

• Questions that are inherently about heterogeneity:

1. What mechanisms account for changes in income and wealth inequality?

2. What are the consequences of changes in tax progressivity?

3. What are the consequences of changes in social security and welfare programs?

4. What are the consequences of changes in educational policies?

5. What are the consequences of bankruptcy regulations?

6. What are the consequences of skill-biased technological change?

7. What are the consequences of non-convex investment adjustment costs?

8. What are the consequences of entry-exit in models of industry dynamics?

9. Political-economy of all previous questions.

32

When is heterogeneity important? II

• Questions about aggregation bias:

1. Does heterogeneity matter for aggregate quantities and prices along the balanced growth path?

2. Does heterogeneity matter for aggregate quantities and prices over the business cycles?

3. And for the welfare cost of business cycles?

4. What is the relation of heterogeneity and asset prices?

5. What are the (aggregate and distributional) effects of temporary tax cuts?

6. What are the (aggregate and distributional) effects of monetary policy?

7. How does price stickiness matter for the business cycle?

8. What is the relation between wealth inequality and financial frictions?

9. Political-economy of all previous questions.

33

The challenge

• To compute and take to the data models with heterogeneous agents, we need to deal with:

1. The distribution of agents Gt .

2. The operator H(·) that characterizes how Gt evolves:

Gt+1 = H(Gt , St)

or∂Gt

∂t= H(Gt , St)

given the other aggregate states of the economy St .

• Why?

• The problems above define a Chapman-Kolmogorov equation (discrete time) or a Kolmogorov

forward equation (continuous time).

34

Discrete types

• With N discrete types, Gt is a vector and H(·,St) is determined by the equilibrium dynamics of the

economy.

• We need to track of N − 1 weights.

• When N is small (e.g., N = 2 or N = 3), this is not particularly difficult.

• However, we need to solve for the associated H(·, St), which can be a highly nonlinear function of St .

• When we have a large number of discrete types (e.g., N > 3), we run into the “curse of

dimensionality”: even tracking N − 1 weights can become prohibitively costly.

• Here, and later on, speed and memory requirements are more important than they seem.

35

Continuous types

• Situation is more challenging when we have a continuous (or mixed) distribution: infinite-dimensional

object that we cannot store in a computer.

• Furthermore, since H(·) is a nonlinear operator, we can only manipulate the Chapman-Kolmogorov or

Kolmogorov forward equation numerically.

• Resorting to ad hoc behavioral rules is often unsatisfactory, if only because we want to know the RE

solution as a benchmark.

36

A common approach

• If we are dealing with N discrete types, we keep track of N − 1 weights.

• If we are dealing with continuous types, we extract a finite number of features from distribution Gt :

1. Moments.

2. Q-quantiles.

3. Weights in a mixture of normals...

• We stack either the weights or features of the distribution in a vector µt .

• We assume µt follows the operator h(µt ,St).

• We parametrize h(µt ,St) as h(µt ,St ; θ).

• We determine the unknown coefficients θ such that an economy where µt follows h(µt ,St ; θ)

replicates as well as possible the behavior an economy where Gt follows H(·).

37

Example: Basic Krusell-Smith model

• Neoclassical growth model with finite number of aggregate productivity shocks st and heterogenous

households due to idyosincratic labor shocks and incomplete markets.

• Two aggregate variables: aggregate productivity shock and household distribution Gt(a, z):∫Gt(a, z)da = Kt∫Gt(a, z)dh = Z

where the last line comes from a law of large numbers regarding idyosincratic labor shocks.

• We summarize Gt(a, ·) with the log of its mean: µt = logKt (extending to higher moments is simple,

but tedious).

• We parametrize logKt+1︸︷︷︸µt+1

= θ0(st) + θ1(st) logKt︸︷︷︸h(µt ,st ;θ)

.

• We determine {θ0(st), θ1(st)} by simulation.38

Problems

• No much guidance regarding feature and parameterization selection in general cases.

• Yes, keeping track of the log of the mean and a linear functional form work well for the basic model.

But, what about an arbitrary model?

• Method suffers from “curse of dimensionality”: difficult to implement with many state variables or

higher moments.

• Lack of theoretical foundations (Does it converge? Under which metric?).

39

How can deep learning help?

• Deep learning addresses challenges:

1. How to extract features from an infinite-dimensional object efficiently.

2. How to parametrize the non-linear operator mapping how distributions evolve.

3. How to tackle the “curse of dimensionality.”

• Given time limitations, today I will discuss the last two points.

40

The general idea

• Recall that we want to approximate an unknown function:

y = h(x)

where x = {x1, x2, ..., xN} a vector.

• In our notation:

1. y = µt+1. To save on notation, we will assume µt is a scalar.

2. x = (µt , St).

41

The algorithm in “Financial Frictions and the Wealth Distribution”

• Model with distribution of households over asset holding and labor productivity.

• We keep track, as in Krusell-Smith, of moments of distribution.

• Algorithm:

1) Start with h0, an initial guess for h.

2) Using current guess hn, solve for agent’s problem.

3) Construct time series x = {x1, x2, ..., xJ} and y = {y1, y2, ..., yJ} for aggregate variables by simulating J

periods the cross-sectional distribution of agents (starting at a steady state and with a burn-in).

4) Use (y, x) to train hn+1, a new guess for h.

5) Iterate steps 2)-4) until hn+1 is sufficiently close to hn.

42

1 1.5 2 2.5 3 3.5-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.5 1 1.5 2 2.5 3-0.15

-0.1

-0.05

0

0.05

0.1

-0.15

-0.1

-0.05

0

1

0.05

0.1

1.53

2 2.52

2.5 1.5

43

1 1.5 2 2.5 3 3.5-0.15

-0.1

-0.05

0

0.05

0.1

0.5 1 1.5 2 2.5 3-0.1

-0.05

0

0.05

0.1

-0.15

-0.1

-0.05

0

1

0.05

0.1

1.53

2 2.52

2.5 1.544

1.5-0.1

-0.08

2

-0.06

-0.04

-0.02

0

0.02

2.6

0.04

0.06

0.08

2.4

0.1

2.2 2.52 1.8 1.6 1.4 31.2 1 0.8

45

-5 -4 -3 -2 -1 0 1 2 3 4 5

10-3

0

500

1000

1500

46

Structural estimation with

unstructured data

New data

• Unstructured data: Newspaper articles, business reports, congressional speeches, FOMC meetings

transcripts, satellite data, ...

47

Motivation

• Unstructured data carries information on:

1. Current state of the economy (Thorsrud, 2017, Bybee et al., 2019).

2. Beliefs about current and future states of the economy.

• An example:

From the minutes of the FOMC meeting of September 17-18, 2019

Participants agreed that consumer spending was increasing at a strong pace. They also expected that, in

the period ahead, household spending would likely remain on a firm footing, supported by strong labor

market conditions, rising incomes, and accommodative financial conditions. [...]

Participants judged that trade uncertainty and global developments would continue to affect firms’

investment spending, and that this uncertainty was discouraging them from investing in their businesses.

[...]

48

Our goal

• Since:

1. This information might go over and above observable macro series (e.g., agents’ expectations and

sentiment).

2. And it might go further back in history, is available for developing countries, or in real time.

• How do we incorporate unstructured data in the estimation of structural models?

• Potential rewards:

1. Determine more accurately the latent structural states.

2. Reconcile agents’ behavior and macro time series.

3. Could change parameters values (medium-scale DSGE models typically poorly identified).

49

Application and data

• Our application:

Text Data: Federal Open Market Committee (FOMC) meeting transcripts.

Model: New Keynesian dynamic stochastic general equilibrium (NK-DSGE) model.

• Our strategy:

Right Now: 1. Latent Dirichlet Allocation (LDA) for dimensionality reduction ⇒ from words to topic

shares.

2. Cast the linearized DSGE solution in a state-space form.

3. Use LDA output as additional observables in the measurement equation.

4. Estimation with Bayesian techniques.

Going Forward: Model the data generating process for text and macroeconomic data jointly.

50

Preliminary findings

1. Using FOMC data for estimation sharpens the likelihood.

2. Posterior distributions more concentrated.

3. Especially true for parameters related to the hidden states of the economy and to fiscal policy.

4. FOMC data carries extra information about fiscal policy and government intentions

51

Latent Dirichlet Allocation (LDA) LDA Details

• How does it work?

• LDA is a Bayesian statistical model.

• Idea: (i) each document is described by a distribution of K (latent) topics and (ii) each topic is

described by a distribution of words.

• Use word co-occurrence patterns + priors to assign probabilities.

• Key of LDA dimensionality reduction topic shares ϕt ⇒ amount of time document spends talking about

each topic k.

• Why do we like it?

• Tracks well attention people devote to different topics.

• Automated and easily scalable.

• Bayesian model natural to combine with structural models.

52

DSGE state space representation

Log-linearized DSGE model solution has the form of a generic state-space model:

• Transition equation:

st+1︸︷︷︸Structural States

= Φ1(θ)st + Φε(θ)εt , εt ∼ N(0, I )

• Measurement equation:

Yt︸︷︷︸Macroeconomic Observables

= Ψ0(θ) + Ψ1(θ)st

• θ vector that stacks all the structural parameters.

System Matrices

53

Topic dynamic factor model

• Allow the topic time series ϕt to depend on the model states:

ϕt︸︷︷︸Topic Shares

= T0 + T1 st︸︷︷︸Structural States

+Σ ut︸︷︷︸Measurement Error

, ut ∼ N(0, I )

• Interpretable as a dynamic factor model in which the structure of the DSGE model is imposed on

the latent factors.

• Akin to Boivin and Giannoni (2006) and Kryshko (2011).

54

Augmented measurement equation

• Augmented measurement equation(Yt

ϕt

)︸︷︷︸

New vector of observables

=

(Ψ0(θ)

T0

)+

(Ψ1(θ)

T1

)st +

(04×4 04×k

0k×4 Σ

)ut , ut ∼ N(0, I )

• If text data carries relevant information, should make the estimation more efficient.

• General approach: any numerical machine learning output and structural model (DSGE, IO, labor...)

will work.

55

FOMCs transcripts data

• Available for download from the Federal Reserve website.

• Provide a nearly complete account of every FOMC meeting from the mid-1970s onward.

• Transcripts are divided into two parts:

• FOMC1: members talk about their reading of the current economic situation.

• FOMC2: talk about monetary policy strategy.

• We are interested in the information on the current state of the economy and the beliefs that

policymakers have on it ⇒ focus on FOMC1 section from 1987 to 2009.

• Total of 180 meetings.

56

Topic composition

• Example of two topics with K = 20.

(a) Topic 8 (b) Topic 1457

Topic shares

1990 1995 2000 2005

Time

-4

-3

-2

-1

0

1

2

Norm

aliz

ed topic

share

Topic 14 "Inflation"

Topic 18 "Fiscal Policy"

58

New Keynesian DSGE model

Conventional New Keynesian DSGE model with price rigidities:

Agents: • Representative household.

• Perfectly competitive final good producer.

• Continuum of intermediate good producers with market power.

• Fiscal authority that taxes and spends.

• Monetary authority that sets the nominal interest rate.

States: • 4 Exogenous states.

• Demand, productivity, government expenditure, monetary policy.

• Modelled as AR(1)s

Parameters: • 12 Structural parameters to estimate.

Equilibrium Equations Exogenous Processes Structural Parameters Recap

59

Bayesian estimation

• We estimate two models for comparison:

• New Keynesian DSGE model alone (standard).

• New Keynesian DSGE Model + measurement equation with topic shares (new).

• Total parameters to estimate: 12 structural parameters (θ) + 120 topic dynamic factor model

parameters (T0, T1, Σ assuming covariances are 0).

• Pick standard priors on structural parameters.

• Priors on topic related parameters?

• Use MLE to get an idea of where they are.

• Use conservative approach: center all parameters quite tightly around 0.

• Random-walk Metropolis Hastings to obtain draws from the posterior.

Priors

60

Likelihood comparison

0.9 0.92 0.94 0.96 0.98

110

120

130

15

20

25

30

0 10 20 30 40 50

124

126

128

130

132

28

29

30

31

0 0.1 0.2 0.3 0.4 0.5

120

125

130

26

28

30

P

1 1.5 2 2.5

115

120

125

130

20

25

30

0.3 0.32 0.34 0.36 0.38 0.4

124

126

128

130

26

28

30

g

0.8 0.85 0.9 0.95 1

115

120

125

130

20

25

30D

0.8 0.85 0.9 0.95 1

115

120

125

130

20

25

30A

0.8 0.85 0.9 0.95 1

124

126

128

130

28

29

30

G

0 0.05 0.1 0.15 0.2

80

100

120

-20

0

20

D

0 0.02 0.04 0.06 0.08

100

110

120

130

10

20

30A

0 0.01 0.02 0.03 0.04 0.05

110

120

130

15

20

25

30G

0 0.02 0.04 0.06 0.08 0.1

110

120

130

10

20

30M

No Topics (left) With Topics (right) 61

Posterior distributions for structural parameters

0.92 0.94 0.96 0.980

100

200

0.8 1 1.2 1.40

2

4

0.1 0.2 0.3 0.4 0.5 0.60

5

10

P

1.5 2 2.50

2

4

0.25 0.3 0.35 0.4 0.450

20

40

g

0.4 0.5 0.6 0.7 0.8 0.90

20

40

D

0.85 0.9 0.950

100

200

A

0.4 0.6 0.80

20

40

G

0.05 0.1 0.15 0.2 0.250

20

40

60

D

0.01 0.02 0.03 0.040

200

400

A

0.01 0.02 0.030

200

400

600

800

G

0.01 0.02 0.03 0.04 0.050

200

400

M

No Topics With Topics 62

Posterior distributions for selected topic parameters

-1 0 10

0.5

1

1.5

De

ma

nd

Sh

ock

International Trade

-1 0 10

0.5

1

1.5

TF

P

-1 0 10

0.5

1

Go

vt

Exp

en

ditu

re

-1 0 10

0.5

1

Mo

ne

tary

Sh

ock

-1 0 10

0.5

1

1.5

Consumer Credit

-1 0 10

0.5

1

-1 0 10

0.5

1

-1 0 10

0.5

1

-1 0 10

0.5

1

1.5

Inflation

-1 0 10

0.5

1

-1 0 10

0.5

1

-1 0 10

0.5

1

-1 0 10

0.5

1

1.5

Fiscal Policy

-1 0 10

0.5

1

1.5

-1 0 10

0.5

1

-1 0 10

0.5

1

Prior Prior Mean Posterior Posterior Mean 63

Going forward: Joint model

• Extend the model in two ways:

1. Model the dependence of latent topics on the hidden structural states directly.

2. Allow for autocorrelation among topic shares.

• Instead of first creating ϕt and then using it for estimation, want to model the generating process of

both text and macroeconomic observables together.

• Why a joint model?

• Conjecture the topic composition and topic share will be more precise and more interpretable as a result.

• Properly take into account the uncertainty around the topic shares.

64

Going forward: Joint model

• New vector of augmented states:

st =

(stϕt−1

)

• New vector of observables:

Yt =

(Yt

wt

)

• Stacks both the “traditional” observables Yt and the text wt .

65

Joint state space representation

• Transition equation (linear):

st+1 = Φ(θ)st + Φε(θ)ε ← same as before

ϕt = Tsst + Tϕϕt−1 + Σeet ← new

• Measurement equation (non-linear):

Yt ∼ p(Yt |st

)=

(Ψ1(θ)stp(wt |st)

)

• Challenges: algorithm for estimation, impact of choice of priors.

Details on p(wt |st )

66

LDA details Back

LDA assumes the following generative process for each document W of length N:

1. Choose topic proportions ϕ ∼ Dir(ϑ). Dimensionality K of the Dirichlet distribution. Thus,

dimensionality of the topic variable is assumed to be known and fixed.

2. For each word n = 1, ...,N:

2.1 Choose one of K topics zn ∼ Multinomial (ϕ).

2.2 Choose a term wn from p (wn|zn, β), a multinomial probability conditioned on the topic zn. β is a K ×V

matrix where βi,j = p(w j = 1|z i = 1

). We assign a symmetric Dirichlet prior with V dimensions and

hyperparameter η to each βk .

Posterior:

p(ϕ, z , β|W , ϑ, η) =p(ϕ, z ,W , β|ϑ, η)

p(W |ϑ, η)

Model equilibrium equations (log-linearized) Back

ct − dt = Et{ct+1 − dt+1 − Rt + Πt+1}

Πt = κ(

(1 + φc) ct + φggt − (1 + φ) At

)+ βEtΠt+1

Rt = γΠt + mt

lt = yt − At = cct + ggt − At

Exogenous States Processes Back

log dt = ρd log dt−1 + σdεd,t

logAt = ρA logAt−1 + σAεA,t

log gt = ρg log gt−1 + σgεg ,t

mt = σmεmt

Parameters Recap Back

Param Description

Steady-state-related parameters

β Discount factor

g SS govt expenditure/GDP

Endogenous propagation parameters

φ Inverse Frisch elasticity

ζP Fraction of fixed prices

γ Taylor rule elasticity

Param Description

Exogenous shocks parameters

ρD Persistence demand shock

ρA Persistence TFP

ρG Persistence govt expenditure

σD s.d. demand shock innovation

σA s.d. TFP shock

σG s.d. govt expenditure shock

σG s.d. monetary shock

State space matrices Back

Law of motion for the states of the economy is:dt+1

At+1

gt+1

mt+1

︸︷︷︸

st+1

=

ρd 0 0 0

0 ρA 0 0

0 0 ρg 0

0 0 0 0

︸︷︷︸

Φ1(θ)

dtAt

gtmt

︸︷︷︸

st

+

σd 0 0 0

0 σA 0 0

0 0 σg 0

0 0 0 σm

︸︷︷︸

Φε(θ)

εd,t+1

εA,t+1

εg ,t+1

εm,t+1

︸︷︷︸

εt

and for observables:log ctlog Πt

logRt

log lt

︸︷︷︸

Yt

=

log (1− g)

0

− log β

0

︸︷︷︸

Ψ0(θ)

+

a1 a2 a3 a4

b1 b2 b3 b4

γb1 γb2 γb3 1 + b4

ca1 ca2 − 1 1 + ca3 ca4

︸︷︷︸

Ψ1(θ)

dtAt

gtmt

︸︷︷︸

st

Prior distribution I Back

Param Description Domain Density Param 1 Param 2

Steady-state-related parameters

β Discount factor (0, 1) Beta 0.95 0.05

g SS govt expenditure/GDP (0, 1) Beta 0.35 0.05

Endogenous propagation parameters

φ Inverse Frisch elasticity R+ Gamma 1 0.1

ζP Fraction of fixed prices (0, 1) Beta 0.4 0.1

γ Taylor rule elasticity R+ Gamma 1.5 0.25

Prior distribution II Back

Exogenous shocks parameters

ρD Persistence demand shock (0, 1) Uniform 0 1

ρA Persistence TFP (0, 1) Uniform 0 1

ρG Persistence govt expenditure (0, 1) Uniform 0 1

σD s.d. demand shock innovation R+ InvGamma 0.05 0.2

σA s.d. TFP shock R+ InvGamma 0.05 0.2

σG s.d. govt expenditure shock R+ InvGamma 0.05 0.2

σG s.d. monetary shock R+ InvGamma 0.05 0.2

Topic parameters

T0,k Topic baseline R Normal 0 0.1

T1,k,s Topic elasticity to states R Normal 0 0.4

σk s.d. measurement error R+ InvGamma 0.2 0.1

Topics ranked by pro-cyclicality

Details on p(wt |st) Back

We assume the following generative process for a document:

1. Draw ϕt |ϕt−1, st ,Tϕ,Ts ,Σe ∼ N(ϕ0 + Tϕϕt−1 + Tsst ,Σ

e).

2. To map this representation of topic shares into the simplex, use the softmax function:

f (ϕt) = expϕt,i/∑

j expϕt,j .

3. For each word n = 1, ...,N:

3.1 Draw topic zn,t ∼ Mult(f (ϕt)).

3.2 Draw term wn,t ∼ Mult(βt,z).

Then, the distribution of text conditional on the states, p(wt |st), is given by:

p(wt |st) =

∫p(ϕt |st)

N∏n=1

∑zn,t

p(zn,t |ϕt)p(wn,t |zn,t , βt)

dϕt

Conclusions

• Deep learning has tremendous potential for macroeconomics.

• Great theoretical and practical reasons for it.

• Many potential applications.

deep learning for macroeconomistsjesusfv/deep-learning.pdf · 2021. 5. 21. · deep learning for...

Documents