machine learning for semi linear pdes - fime-lab.org · machine learning for semi linear pdes...

30
Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam *† Joseph Mikael § Xavier Warin September 17, 2018 Abstract Some recent machine learning algorithms existing in the literature to solve some semi-linear PDEs are improved by using some new parametrization and some special neural networks. These algorithms are compared to a new machine learning algorithm combining the feature of some recent methodologies. This new algorithm appears to be competitive in term of accuracy with the best existing machine algorithms. Key words. Semilinear PDEs, Monte-Carlo methods, Machine Learning. 1 Introduction This paper is to devoted to the resolution in high dimension of some equation -t u -Lu = f (t, x, u, Du); uT = g, t < T, x R d , (1) with a non-linearity f (t, x, y, z) in the solution and its gradient, g a bounded terminal condition g and L a diffusion generator satisfying Lu := 1 2 Tr ( σσ (t, x)D 2 u(t, x) ) + μ(t, x).Du(t, x). (2) μ is a function defined on R × R d with values in R d , σ is a function defined on R × R d with values in M d the set of d × d matrix and L is the generator associated to the forward process: Xt = x + t 0 μ(s, Xs)ds + t 0 σ(s, Xs)dWs, (3) with Wt a d-dimensional Brownian motion. Traditional deterministic methods to solve numerically non linear Partial Differential Equations suffer from the curse of dimensionality and one cannot hope to solve equations of dimension greater than 4 or 5. Until recently, the mostly used method available to solve non linear PDEs in dimension above 4 was based on the resolution of the BSDE associated to the PDE first exhibited in [PP90]: using the time discretization scheme proposed in [BT04], some effective algorithms based on regression have been developed [G+05; L+06] and has led to a lot of research as shown for example in the reference in [GT16]. This resolution based on regression have been generalized to full non linear equations in [FTW11] using the Second Order Backward Equation framework proposed in [Che+07]. However this regression technique uses some basis functions that can be either some global polynomial as in [LS01] or some local polynomials as proposed in [BW12]: therefore this methodology still faces the curse of dimensionality and can only solve some problems in dimension below 7 or 8. A first step to solve high dimensional PDEs was achieved recently in [Hen+16; Bou+17; BTW17; War17] by a use of a branching method and a time step randomization applied to the Feyman-Kac representation of the PDE. In the case of semi-linear PDE’s, a differentiation technique using some Malliavin weights as proposed in [Fou+99] permits to estimate the gradient Du of the solution. However the branching techniques are only limited to small maturities, some small non linearities and mainly to non linearities polynomial in u and Du. Even more recently three other methods have been developed to solve this difficult problem : * EDF R&D [email protected] EDF R&D § [email protected] EDF R&D & FiME, Laboratoire de Finance des Marchés de l’Energie [email protected] 1

Upload: vankhue

Post on 11-Oct-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

Machine Learning for semi linear PDEsQuentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶‖

September 17, 2018

Abstract

Some recent machine learning algorithms existing in the literature to solve some semi-linear PDEs areimproved by using some new parametrization and some special neural networks. These algorithms arecompared to a new machine learning algorithm combining the feature of some recent methodologies. Thisnew algorithm appears to be competitive in term of accuracy with the best existing machine algorithms.

Key words. Semilinear PDEs, Monte-Carlo methods, Machine Learning.

1 IntroductionThis paper is to devoted to the resolution in high dimension of some equation

−∂tu− Lu = f(t, x, u,Du); uT = g, t < T, x ∈ Rd, (1)

with a non-linearity f(t, x, y, z) in the solution and its gradient, g a bounded terminal condition g and L adiffusion generator satisfying

Lu := 12 Tr

(σσ>(t, x)D2u(t, x)

)+ µ(t, x).Du(t, x). (2)

µ is a function defined on R×Rd with values in Rd, σ is a function defined on R×Rd with values in Md theset of d× d matrix and L is the generator associated to the forward process:

Xt = x+∫ t

0µ(s,Xs)ds+

∫ t

0σ(s,Xs)dWs, (3)

with Wt a d-dimensional Brownian motion.Traditional deterministic methods to solve numerically non linear Partial Differential Equations suffer fromthe curse of dimensionality and one cannot hope to solve equations of dimension greater than 4 or 5.Until recently, the mostly used method available to solve non linear PDEs in dimension above 4 was basedon the resolution of the BSDE associated to the PDE first exhibited in [PP90]: using the time discretizationscheme proposed in [BT04], some effective algorithms based on regression have been developed [G+05; L+06]and has led to a lot of research as shown for example in the reference in [GT16]. This resolution based onregression have been generalized to full non linear equations in [FTW11] using the Second Order BackwardEquation framework proposed in [Che+07].However this regression technique uses some basis functions that can be either some global polynomial as in[LS01] or some local polynomials as proposed in [BW12]: therefore this methodology still faces the curse ofdimensionality and can only solve some problems in dimension below 7 or 8.A first step to solve high dimensional PDEs was achieved recently in [Hen+16; Bou+17; BTW17; War17] bya use of a branching method and a time step randomization applied to the Feyman-Kac representation of thePDE. In the case of semi-linear PDE’s, a differentiation technique using some Malliavin weights as proposedin [Fou+99] permits to estimate the gradient Du of the solution. However the branching techniques are onlylimited to small maturities, some small non linearities and mainly to non linearities polynomial in u andDu.Even more recently three other methods have been developed to solve this difficult problem :

∗EDF R&D†[email protected]‡EDF R&D§[email protected]¶EDF R&D & FiME, Laboratoire de Finance des Marchés de l’Energie‖[email protected]

1

Page 2: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

• A very simple technique based only on nesting Monte Carlo applied on the Feyman-Kac representationof the PDE has been developed and shown to be convergent to solve the equation (1) in [War; War18].As shown by the estimators in [War18], this technique is far more effective if the Lipschitz coefficientsassociated to the non linearity and the maturity of the problem are not too high.

• In [E+17; E+16; HK17], the authors developed an algorithm based on Picard iterations, multi-leveltechniques and automatic differentiation once again applied to the Feyman-Kac representation of thePDE. The time integration appearing in this representation is achieved by quadrature and the authorshave been able to solve some PDEs in very high dimension. The tuning of the algorithm may behowever difficult due to the number of methodologies involved in the resolution. In a very recentarticle [Hut+18] combining the ideas of [E+17; E+16; HK17] and [Hen+16; War18], the authors showthat some modified Picard iteration algorithm for non linearities in u applied to the heat equation canbe solved with a polynomial complexity with both the dimension and the reciprocal of the accuracyrequired. However no numerical results are given to confirm the result.

• At last deep learning techniques dubbed “Deep BSDE” (DBSDE) have been recently proposed tosolve semi-linear PDEs in [HJW17; EHJ17] and the methodology has been extended to full non linearequations in [BEJ17a]. This approach is based on an Euler discretization of the forward underlyingSDE with solution Xt and of the BSDE associated to the problem. The algorithm tries to learn thevalues u and z = σ>Du at each time step of the Euler schemes so that a forward simulation till maturityT matches the target g(XT ). This algorithm has been modified in [Rai18] to solve forward backwardstochastic equations: in this version of the algorithm the network tries to learning u calculating Duby automatic differentiation and incorporates the constraints associated to the Euler discretization ofthe BSDE in the loss function. These deep techniques seem to be very effective but no current resultjustifies their convergence. It is then difficult to know their limitations.

The different previously described methods are all interesting but we will focus on machine learning algorithmand nesting Monte Carlo.The objectives of this article are:• to give an improved version of the Deep BSDE algorithm using different Network and parametrization,• to develop some new Deep Learning algorithm, mixing some features coming from [War18] and [HK17],

We numerically show that the new algorithm is competitive with the best Deep BSDE algorithm found interm of accuracy.

2 Existing Deep BSDE solver algorithmsThe DBSDE algorithm proposed in [HJW17] starts from the BSDE representation of (1) first proposed in[PP90]:

u(t,Xt) = u(0, X0)−∫ t

0f(s, xs, u(s,Xs), Du(s,Xs))ds+

∫ t

0Du(s,Xs)>σ(s,Xs)dWs (4)

For a set of time steps t0 = 0 < t1 < . . . < tN = T , we use an Euler scheme to approximate (Xti )i=1...Nsolution of equation (3) by:

Xti+1 ≈ Xti + µ(ti, Xti )(ti+1 − ti) + σ(ti, Xti )(Wti+1 −Wti ).

In the same way, an approximation of equation (4) is obtained by the Euler scheme:

u(ti+1, Xti+1 ) =u(ti, Xti )− f(ti, Xti , u(ti, Xti ), Du(ti, Xti ))(ti+1 − ti)+

Du(ti, Xti )>σ(ti, Xti )(Wti+1 −Wti ).

In the machine learning language, these independent realizations of (Xti )i=1...N represent the training data.In DBSDE algorithms, neural networks are supposed to output an approximate of

κti := σ(ti, Xti )>Du(ti, Xti )

from the feature Xti . The parameters θ of the neural networks are estimated by doing a stochastic gradientdescent which objective is to minimize the loss

`(θ) := E[(u(T,XT )− g(XT ))2],

as g(XT ) corresponds to the target of u(T,XT ) due to the terminal condition u(T,XT ) = g(XT ). Thisarchitecture described in [HJW17; EHJ17] in Figure (1) needs to build N − 1 feed forward neural networksto estimate (κti )i=1,...N−1. This gives rise to a number of weights to be estimated of order N × nb layers×layer size. By construction, there is no link between the gradient of two successive and possibly close timesteps. We will see in Section 3 how we can add a global structure to the architecture ensuring a consistencybetween gradient of two close time steps.

2

Page 3: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

Xt0

u( , )t0 Xt0u( , )t1 Xt1

κt0

Xt1

u( , )t2 Xt2

κt1

Network 1

Xt2 ⋯

⋯ u( , )tN−1 XtN−1

XtN−1

u( , )tN XtN

κtN−1

Network N − 1

XtN

Loss

Figure 1: [HJW17] original graph. The parameters of the graph are represented in red.

To help the neural network [FTT17] propose to incorporate a guess on the gradient to be learned. Theguess is derived from an asymptotic expansion of first order.

In [Rai18], the author proposes to approximate directly the function u with a neural network, enforcingthe Euler discretization scheme softly in the loss function:

`(θ) := E

[N−1∑i=0

Φ(ti, ti+1, Xti , Yti , Yti+1 ,Wti+1 −Wti ) + (g(XT )− YT )2

](5)

Φ(ti, ti+1, Xti , Yti , Yti+1 ,Wti+1 −Wti ) =(Yti+1 − Yti + f(ti, Xti , Yti , σ

>(ti, Xti )Zti )(ti+1 − ti)

−Z>tiσ(ti, Xti )(Wti+1 −Wti ))2

where Yti should approximate u(ti, Xti ), with Yti := N (ti, Xti ) where N is a neural network, and Zti :=DYti where D is the automatic differentiation operator relative to Xti in TensorFlow such that Zti shouldapproximate Du(ti, Xti ).

Overall:1. The algorithms [EHJ17; Rai18] can solve semilinear PDEs along trajectories with a fixed initial condi-

tion X0. In [HJW17], the authors point out that we can adapt the algorithm to be able to cope withvariable X0’s by adding a supplementary neural network X0 7→ Y0 in the beginning of the algorithm.

2. These algorithms are capable of returning values of Yti and Zti on the discretization timesteps, giventrajectories Wt0 , · · · ,WtN−1 , as we show in the following.

3. The algorithm [EHJ17] is not capable of returning u(t, x), Du(t, x) for any t, x outside of a giventrajectory. The algorithm in [Rai18] could provide u(t, x), Du(t, x) for any (t, x), but we have littleguarantee that the value would be exact for instance for a t that does not correspond to any of theti’s, or a combination (t, x) never seen by the algorithm along trajectories.

4. Finally, all the algorithms use the Euler discretization scheme to approximate Yti , either as hardconstraints [EHJ17] or soft constraints [Rai18] as shown in (5).

5. A discretized scheme might not be necessary for computing Xti from Wti ’s, when the SDE (3) can besolved exactly.

The algorithms [EHJ17] and [Rai18] solve the same problem with the same discretized integration scheme,but with different formulations that could result in different results. We chose to use the formulation in[EHJ17] since the integration constraint are enforced, and thus the neural network might learn a less complexfunction. Moreover, we find the convergence is overall faster. Finally, this formulation allow us to userecurrent schemes, as described in the following Section.

3 Motivations for some other neural networks architecturesMost breakthroughs in machine learning using neural networks have been made with the use of specificnetwork architectures. For instance, the use of convolutional neural networks [LBH15] started the widesuccess of neural networks in computer vision, and the use of recurrent neural networks with memoryenabled neural networks to be efficiently used in various fields such as natural language processing. The useof specific structures enable to limit the growth of the number of parameters in the networks when the depthincreases, as well as grasp specific properties such as invariance and symmetry. Moreover, these specificarchitectures are made such that the optimization algorithms work better, by reducing the occurrence ofnumerical instabilities known in machine learning as “vanishing gradients” or “exploding gradients”.

3

Page 4: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

Other recent works increase dramatically the efficiency of the learning process: indeed, as used in [HJW17;EHJ17], the use of batch normalization [IS15] enables networks to learn faster and better, but this techniqueis difficult to adapt to recurrent neural networks (or if so, only work for stationary signals by not sharingthe moving statistics through time, as in [Coo+16] for instance). The ELU (exponential linear unit) activa-tion function [CUH15] improves the learning process and also reduces the need for regularization or batchnormalization. Finally, residual learning [He+16] consists in making identity shortcut connections betweenseveral hidden layers, and proved to efficiently reduce numerical instabilities in the learning process.

To our knowledge, the DBSDE solver have only been used with standard fully-connected (FC) feedforward neural networks. We believe that using specific structures may yield substantial improvements tothis algorithm. We would also like to point out that the functions we wish to approximate using our neuralnetwork seem reasonably simple compared to other tasks in deep learning: thus, we believe that a smallnumber of layers could prove sufficient in most cases, and increasing the number of layers further mightrather degrade the convergence of our algorithms.

In the following, our neural networks will learn κti ' Du(ti, Xti ) instead of σ>(ti, Xti )Du(ti, Xti ) as in[HJW17; EHJ17]. We found this does not change significantly the behaviour of the algorithm, while enablingsome equations to be formulated more conveniently.

3.1 One different network for every timestepThe DBSDE solvers in [HJW17; EHJ17; BEJ17b] use a different fully-connected feed forward neural networkat each time step ti to estimate the gradients. Each of these networks take as input Xti . Formally, each κtiis estimated with a specific network Ni : x ∈ Rd 7→ κ ∈ Rd with parameters θi:

κti := N θii (Xti ). (6)

The corresponding networks in our comparative study are presented in Figure 2.

FC(ReLU+BN)

FC(ReLU+BN)

FC(ReLU+BN)

FC (identity)

Xt

κt

hid

den

laye

rsh

Onenetwork for

each t

FC(ELU)

FC(ELU)

FC(ELU)

FC (identity)

Xt

κt

hid

den

laye

rsh

Onenetwork for

each t

FC(ELU)

FC(ELU)

FC(ELU)

FC (identity)

Xt g( )Xt

κt

hid

den

laye

rsh

+

Onenetwork for

each t

a. FC DBSDE b. FC ELU c. FC Residual

Figure 2: Using one different network for every timestep. Each network is composed of h hidden layers and anoutput layer with identity activation. The network a. is the one used in [HJW17; EHJ17], with fully-connected(FC) hidden layers with ReLU activation function and batch normalization (BN), but we found that not usingBN on the output layer yielded better results. Network b. is the same but replacing the combination of usingthe ReLU activation function and batch normalization with only using the ELU activation function. Finally,network c. also takes as input g(Xt) and adds residual connections every 2 hidden layers starting from theoutput of the first hidden layer (if the number of hidden layers h is even, then the last residual connection onlyskips one hidden layer). Adding Yt as an input to these networks made the optimization algorithm unstable.

3.2 Sharing parameters through timeWe acknowledge the fact that the gradient is not stationary in time ; yet we believe that in most cases, thegradient between adjacent time steps could be similar for a given x. Thus, similarly as [Rai18] (but on thegradient), we propose to share the parameters of the network for each time step, i.e. we use a single networkN θ : (t, x) ∈ Rd+1 7→ κ ∈ Rd, as shown in Figure 3.

κti := N θ(ti, Xti ). (7)

This architecture has less expressiveness than the previous one, but should be easier to optimize, sincethe parameters are linked more closely to the loss function. Note that we can not use batch normalizationwith this formalism, as the distribution of X is likely to be non stationary. We will refer this architectureas the Merged Deep BSDE.

4

Page 5: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

Xt0

u( , )t0 Xt0u( , )t1 Xt1

κt0

Xt1

u( , )t2 Xt2

κt1

Singlenetwork

Xt2 ⋯

⋯ u( , )tN−1 XtN−1

XtN−1

u( , )tN XtN

κtN−1

XtN

Loss

Figure 3: Graph with common parameters. The parameters of the graph are represented in red.

Moreover, we find it helpful to feed the neural networks not only with Xt, but also other variables knownat instant t, such as Yt and g(Xt): thus, we write

κti := N θ(ti, Xti , Yti , g(Xti )). (8)

Note that if we consider Yti to be the output of the neural structure at time ti, then the previousformulation is a recurrent neural network: Yti depends directly on the output of a previous call to the neuralnetwork and is fed as input to the following call of the neural network.

The corresponding networks in our comparative study are presented in Figure 4.

FC(ELU)

FC(ELU)

FC(ELU)

FC (identity)

tYt

Xt g( )Xt

κt

hid

den

laye

rsh

FC(ELU)

FC(ELU)

FC (identity)

tYt

Xt g( )Xt

κt

FC(ELU)

hid

den

laye

rsh

FC(ELU)

FC(ELU)

FC(ELU)

FC (identity)

tYt Xt g( )Xt

κt

hid

den

laye

rsh

+

d. FC Merged e. FC Merged Shortcut f. FC Merged Residual

Figure 4: Sharing the same network parameters for every timestep. Each network is composed of h hiddenlayers and an output layer with identity activation. Network d. takes as input t, Xt, g(Xt) and Yt, and usesELU activation functions. Network b. is the same but injects the input layer as a supplementary input to eachhidden layer – we call these shortcut connections. Finally, network f. is the same as e. but adding residualconnections every several hidden layers, like in network c..

3.3 Adding a temporal structure with LSTM networksDue to Euler discretization, the target g(XT ) cannot be reached and the loss function cannot be zeroed. Bysearching some values of κt non only depending on variables at date t but taking into account some longterm dependencies, we may counteract the effects of discretization and find some strategies with a smallerloss than with simple full-connected feed forward network.

Recurrent neural networks with memory, LSTM networks for instance [HS97; Ola15], are networks thatuse an internal state to build long-term dependencies when applied to a sequence. These networks provedvery efficient in performing tasks on sequences [Kar15]. Formally, if mt ∈ Rp is the state at time t, theequation would write

(κti ,mti ) := N θ(t,Xti ,mti−1 ) (9)

where m0 is a training variable. Using these types of networks might enable the network to build its owninput feature through mt, as well as compensating for long term effects such as the discretization error.Similarly, we can feed the neural network with other variables at instant ti :

(κti ,mti ) := N θ(t,Xti , Yti , g(Xti ),mti−1 ). (10)

The corresponding networks in our comparative study are presented in Figure 5.

5

Page 6: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

LSTM (tanh)

LSTM (tanh)

LSTM (tanh)

FC (identity)

t

m(1)

t−1

m(k)

t−1

m(h)

t−1 m(h)t

m(k)t

m(1)t

Xt

κt

hid

den

laye

rsh

LSTM (tanh)

LSTM (tanh)

LSTM (tanh)

FC (identity)

tYt

m(1)

t−1

m(k)

t−1

m(h)

t−1 m(h)t

m(k)t

m(1)t

Xt g( )Xt

κt

hid

den

laye

rsh

LSTM (tanh)

FC(ELU)

FC(ELU)

FC (identity)

tYt

m(1)

t−1 m(1)t

Xt g( )Xt

κt

hid

den

laye

rsh

+

LSTM (tanh)

LSTM (tanh)

LSTM (tanh)

FC (identity)

tYt

m(1)

t−1

m(k)

t−1

m(h)

t−1 m(h)t

m(k)t

m(1)t

Xt g( )Xt

κt

hid

den

laye

rsh

+

g. LSTM h. Augmented LSTM i. Hybrid LSTM j. Residual LSTM

Figure 5: LSTM-based algorithms. Network g. is composed of h stacked LSTM, which takes t and Xt as input,with a state mt = (m(1)

t , · · · ,m(h)t ), and an output layer with identity activation function. Network h. is the

same network but taking Yt and g(Xt) as supplementary inputs. Network i. is a combination of the Mergednetwork f., but replacing the first hidden layer by a LSTM – thus the network is composed of 1 hidden LSTMlayer and h− 1 hidden FC (ELU) layers. Finally, network j. is the same as h. but adding residual connectionsevery few hidden layers, as described for network f..

4 Some new machine learning algorithmsThe previously developed algorithms are based on an Euler scheme with a time step discretization and tryto estimate the function value and its derivative at these discrete values.We propose some algorithms not relying on an Euler scheme for the BSDE but trying to estimate the globalfunction u as a functional of t ∈ [0, T ] and x ∈ Rd. Note that the Euler scheme is still necessary to calculatethe forward process (3) when it cannot be exactly simulated.In this section ρ(x) = λe−λx is the density of a random variable with an exponential law.Denote

F (t) :=∫ ∞t

ρ(s)ds = e−λt = 1− F (t),

so that F is the cumulative distribution function of a random variable with density ρ.Denoting by Et,x the expectation operator conditional on Xt = x at time t, from the Feynman-Kac

formula the representation of the solution u valid under regularity assumptions on the terminal function andthe coefficients of equation (3) is given by:

u(t, x) =Et,x[F (T − t) g(XT )

F (T − t)+∫ T

t

f(s,Xs, u(s,Xs), Du(s,Xs))ρ(s− t) ρ(s− t)ds

](11)

Introducing a random variable τ of density ρ we get

u(t, x) =Et,x[φ(t, t+ τ,Xt+τ , u(t+ τ,Xt+τ ), Du(t+ τ,Xt+τ )

)],

with

φ(s, t, x, y, z) :=1{t≥T}F (T − s)

g(x)+1{t<T}ρ(t− s)f(t, x, y, z).

We then propose two schemes to solve the problem. An informative diagram is presented in Figure 6.

4.1 A first schemeFollowing the idea of [Hen+16; War] then we explain how to calculate Du(t, x) using Malliavin weights.We isolate two cases:• If the coefficients µ and σ are constant, then the process Xt+τ solution of equation (3) is given by

Xt+τ = x+ µτ + σ(Wt+τ −Wt) (12)

and we define the antithetic variable :

Xt+τ = x+ µτ − σ(Wt+τ −Wt). (13)

• When the coefficients are not constant an Euler Scheme with a step ∆t is still necessary. In this case,we denote J = b τ∆tc.If τ < ∆t then equations (12) and (13) are used , otherwise

Xt+(i+1)∆t =Xt+i∆t + µ(t+ i∆t,Xt+i∆t)∆t+ σ(t+ i∆t,Xt+i∆t)(Wt+(i+1)∆t −Wt+i∆t), i = 0, J − 1Xt+τ =Xt+J∆t + µ(t+ J∆t,Xt+J∆t)(τ − J∆t) + σ(t+ J∆t,Xt+J∆t)(Wt+τ −Wt+J∆t) (14)

6

Page 7: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

and X is defined by

Xt+∆t =x+ µ(t, x)∆t− σ(t, x)(Wt+∆t −Wt), (15)

Xt+(i+1)∆t =Xt+i∆t + µ(t+ i∆t, Xt+i∆t)∆t+ σ(t+ i∆t, Xt+i∆t)(Wt+(i+1)∆t −Wt+i∆t), i = 1, J − 1

Xt+τ =Xt+J∆t + µ(t+ J∆t, Xt+J∆t)(τ − J∆t) + σ(t+ J∆t, Xt+J∆t)(Wt+τ −Wt+J∆t). (16)

As defined in [War18], an estimator of the gradient of u is given by:

Du(t, x) =Et,x[σ−>

W(t+(τ ∧∆t))∧T −Wt

τ ∧ (T − t) ∧∆t12(φ(t, t+ τ,Xt+τ , u(t+ τ,Xt+τ ), Du(t+ τ,Xt+τ ))−

φ(t, t+ τ, Xt+τ , u(t+ τ, Xt+τ ), Du(t+ τ, Xt+τ )))]. (17)

Remark 4.1 When the process can be exactly simulated, the Euler scheme can be avoided: the Malliavinweights have to be modified accordingly as shown for example in the numerical example in [War18] for anOrnstein Uhlenbeck process. In this case the previous equation (17) is adapted by taking ∆t = τ .For two metric spaces E, F , we introduce Lip(E,F ) the set of Lipschitz continuous functions defined on Ewith values in F .Following the ideas of [HK17], we introduce the operator T : Lip([0, T ]×Rd,Rd+1) −→ Lip([0, T ]×Rd,Rd+1),such that to (u, v) ∈ (Lip([0, T ]× Rd,R)× Lip([0, T ]× Rd,Rd)) is associated (u, v) solution of

u =12Et,x

[φ(t, t+ τ,Xt+τ , u(t+ τ,Xt+τ ), v(t+ τ,Xt+τ )

)+ φ(t, t+ τ, Xt+τ , u(t+ τ, Xt+τ ), v(t+ τ, Xt+τ )

)]v =Et,x

[σ−>

W(t+(τ∧∆t))∧T −Wt

τ ∧ (T − t) ∧∆t12(φ(t, t+ τ,Xt+τ , u(t+ τ,Xt+τ ), v(t+ τ,Xt+τ ))−

φ(t, t+ τ, Xt+τ , u(t+ τ, Xt+τ ), v(t+ τ, Xt+τ )))]. (18)

Instead of trying to solve the problem with some fixed point iteration as in [HK17], we propose to solve theproblem with a machine learning technique by defining the loss function ` for U ∈ Lip([0, T ]×Rd,Rd+1) by:

`(U) = E[||U − TU ||2] (19)

Notice that the operator T necessitates to calculate an expectation involved in the loss function. Thisexpectation may be calculated with only a few thousand samples ninner by a Monte Carlo approximation.The τ and W Brownian increments involved in the Euler schemes (14) and (16) being sampled once for all,we get the discrete version of equation (18):

u =12

1ninner

ninner∑i=1

[φ(t, t+ τ i, Xi

t+τi , u(t+ τ i, Xit+τi ), v(t+ τ i, Xi

t+τi ))+ (20)

φ(t, t+ τ i, Xi

t+τi , u(t+ τ i, Xit+τi ), v(t+ τ, Xi

t+τi ))]

v =12

1ninner

ninner∑i=1

[σ−>

W i(t+(τi∧∆t))∧T −W

it

τ i ∧ (T − t) ∧∆t(φ(t, t+ τ i, Xi

t+τi , u(t+ τ i, Xit+τi ), v(t+ τ i, Xi

t+τi ))−

φ(t, t+ τ i, Xit+τi , u(t+ τ i, Xi

t+τi ), v(t+ τ, Xit+τi ))

)](21)

The number of samples is limited because the number of terms in the loss function grows up linearly with thenumber of samples and this leads to an increase of the computing time with TensorFlow at each iteration:the automatic differentiation computing cost to calculate the gradient grows up at least linearly with thenumber of terms.However this representation of the solution permits to get a solution u,Du on [0, T ] × Rd. Once the con-vergence with the machine learning algorithm is achieved a post processing is achieved with a very highnumber of particles neval � ninner calculating u(0, x) and its derivative Du(0, x) by replacing ninner by nevalin equation (19). Because of the number of terms involved in the loss function the computing time cannotcompete is the ones obtained by [H+17], but we hope it may be more accurate. In all the examples, we willuse a feed forward network.

4.2 A second schemeIn the second scheme, the gradient is not calculated anymore by the equation (17) but directly by differentiat-ing the u function estimated using TensorFlow. Noting Du the TensorFlow differentiating operator operatingon a function u, the operator T : Lip([0, T ]×Rd,R) −→ Lip([0, T ]×Rd,R) associates u ∈ Lip([0, T ]×Rd,R)to u, solution of

u =12Et,x

[φ(t, t+ τ,Xt+τ , u(t+ τ,Xt+τ ), Du(t+ τ,Xt+τ )

)+

φ(t, t+ τ, Xt+τ , u(t+ τ, Xt+τ ), Du(t+ τ, Xt+τ )

)](22)

7

Page 8: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

As in the first scheme, we try to solve the the equation (19) using a feed forward network. As in the previousalgorithm, the equation (22) is discretized with a given number of samples τ , Wt chosen once for all.Once the loss function is minimized and an estimation of u is achieved, a more accurate estimation estimationof u at date 0 is achieved by solving equation (22) with a very high number of simulations as in the firstalgorithm.

Onl

y fo

r tra

inin

g

t, xInner

propagation

τ,W

Neuralnetwork

u(t, ),Du(t, )Xt Xt

t + τ, Xt+τ

(t, ), (t, )u⎯⎯

Xt v⎯⎯

Xt

Normdifference

Loss

Input

Output with neval

during training during evaluation

ninner

neval

with ninner

Figure 6: Our new machine learning algorithm. First, some (t, x) are given as inputs. During training, thediscretized operator T is computed using a number ninner of realizations of τ,W . The loss is the norm differencebetween (u, v), and (u,Du). During evaluation, the discretized operator T is computed using neval realizationsof τ,W and the corresponding (u, v) are outputted.

4.3 Solving our new fixed point problem in practiceThe networks we tested used are represented in Figure 7. In practice, we find that choosing C. and keepusing the loss function (19) with

v =Et,x[σ−>

W(t+(τ∧∆t))∧T −Wt

τ ∧ (T − t) ∧∆t12(φ(t, t+ τ,Xt+τ , u(t+ τ,Xt+τ ), Du(t+ τ,Xt+τ ))−

φ(t, t+ τ, Xt+τ , u(t+ τ, Xt+τ ), Du(t+ τ, Xt+τ )))]

(23)

as an estimator for Du yield the best results.Similarly, we investigated the influence of the parameters ninner and λ. It follows that increasing neval

does indeed improve the precision of the results, but increasing ninner over several thousands and changingλ do not have a significant effect. In the following, we choose ninner = 10000 and λ = 1.0 or λ = 0.5.

FC(tanh)

FC(tanh)

FC(tanh)

FC (identity)

t, x

u(t, x)

hid

den

laye

rsh

FC(tanh)

FC(tanh)

FC(tanh)

FC (identity)

Du(t, x)

FC(tanh)

FC(tanh)

FC(tanh)

FC (identity)

t, x

(u(t, x), Du(t, x))

hid

den

laye

rsh

FC(tanh)

FC(tanh)

FC(tanh)

FC (identity)

t, x

u(t, x)

hid

den

laye

rsh

Du(t, x)

Automatic differentiation

A Separated B. Shared C. Automatic differentiation

Figure 7: Networks used for our new algorithm. A. and B. correspond to the first scheme and C. to the secondscheme.

8

Page 9: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

5 Experiments & results on DBSDE algorithms>We chose to test our algorithms on the following PDEs, which details can be found in the Appendix:• The Hamilton-Jacobi-Bellman equation A.1.3 corresponding to a control problem, presented in [EHJ17;

HJW17], with a non-bounded terminal condition g(x) = 0.5 log(1 + ‖x‖2

).

• An equation A.1.5 similar to the previous one, from [Ric10], with a non-Lipschitz terminal conditiong(x) =

∑d

i=1 (max {0,min [1, xi]})α.• A Black-Scholes Barenblatt equation A.1.2 from [Rai18].

• A toy example A.1.4 with an oscillating solution u(t, x) = exp(a(T−t)) cos(∑d

i=1 xi) and a non-linearity

in(y∑d

i=1 zi

)2.

• A toy example A.1.7 with an oscillating solution and a non-linearity in y/(∑d

i=1 zi).• A toy example A.1.6 with an oscillating solution and a CIR model for X.

In the whole section δt stands for the size of the time step.

5.1 Influence of the hyper parameters and methodology5.1.1 Learning parametersBatch size As in the previous works [HJW17; EHJ17; Rai18], we decided to use Adam optimizer, avariant on batch gradient descent, which proves most efficient for neural network training compared to otheralgorithms [Rud16]. The batch size is a key parameter of this algorithm. The intuition is, plain gradientdescent is not adapted to the problem since the loss function is not convex, and thus the algorithm wouldvery easily be stuck in local minima. Stochastic gradient descent (SDG) tackles this issue by computingthe gradient on each training sample separately, but learning is slow. Batch SGD computes gradients onsmall batches, such that the good properties of SGD are kept and the algorithm is faster overall by usingvectorized operations. Recent work [Goy+17] advocates the usage of small batches in classical machinelearning problems ; we found that in our case, using large batches speeds the algorithm up without givingup much learning efficiency. In the following, we choose a batch size of M = 300.

Learning rate As described in [Ben12] the learning rate is arguably the training hyper parameter thathas the biggest influence on training. Theoretical work indicate that the optimal learning rate for a givenproblem is close to “the biggest value before the algorithm diverges” up to a factor 2. One also advocates theuse of learning rate schedules, i.e. changing the learning rate during training, to achieve lower losses whena loss floor is reached. Moreover, as shown in [HE16], lower learning rates enable reaching lower losses andstabilizing the learning process in the final stages. We chose not to tune the learning rate for each of ournetworks and hyper parameter choice, but rather use an adaptive strategy: we initially choose a learningrate of η = 0.01, which we find to be a reasonable starting value, albeit 10 times larger than the one proposedin the original article on Adam [KB14] – some of our algorithms would diverge during training for η = 0.02.Working with periods of 1000 iterations (gradient descent updates), we keep trace of the mean test loss overthe period. Between two periods, we check if the percentage of decrease of the mean of the losses is lessthan 5%. If it was not the case, we assume that we reached a loss plateau, and we divide the learning rateby 2 for the next period. This adaptive strategy enables us to explore various learning rate scales duringtraining, achieving lower losses and avoiding fluctuations at the end of training.

Initialization In our neural networks, we use Xavier initialization [GB10] for the weights and a normalinitialization for the biases. We find that a good initialization of Y0 is also necessary: if the initial guess isfar from the optimum, the algorithm would converge very slowly or get stuck in local optima, especially forour LSTMs. In order to make a reasonable guess, we initialize with Y0 := E[g(XT )], which is the solutioncorresponding to f := 0.

Regularization In machine learning, regularizing the optimization process by adding a term in the losspenalizing high weights often help the network. We find that in our case, adding L2 regularization on theweights of the neurons (not on the biases) degrade the convergence and the precision of the algorithm: weexplain this as the fact that our data is not noisy nor redundant, and thus the network do not experienceoverfitting. We thus do not regularize our network.

Centering and rescaling Neural networks experience convergence issues when their inputs are notscaled and centered, and perform best when the inputs follow a normal distribution, especially with ourLSTMs and tanh activation functions. In our case, the inputs t and Y are not Gaussian as it is thecase for X. Since we will use the same network for each time step, we scale and center all the inputs(ti, Xti , Yti , g(Xti )) for all ti’s with the same coefficients, so that they take values in ∼ [−1, 1], as describedbelow.

9

Page 10: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

Number of hidden layers and hidden layer sizes We investigated the influence of the numberof hidden layers and the hidden layer sizes on the convergence of the algorithms, and the precision of theresults. We will denote by h the number of hidden layers and w the size of the hidden layers. In practice,the optimal w and h’s depend on the equation under consideration and the network used. We find that formost networks, for a fixed w, setting h over 2 increases greatly the difficulty of the problem degrades theconvergence speed – moreover, this does not increase significantly the precision of the results. Increasing thehidden layer size w does not have any impact over a certain value – we find that suitable values are h = 2and w ' d or 2d.

5.1.2 Standard training procedureWhen hyper parameters are not mentioned, we will use the following during the training procedure:• If applicable, initialize Y0 to E[g(XT )], weights with Xavier initialization and biases along a normal

initialization.• Use centered and rescaled neural network inputs t, X, Y , g(X), defined as

t = t− (T − δt)/2(T − δt)/2

, X = X −Xmean

Xstd, Y = Y − Ymean

|Ymean|, g(X) = g(X)− Ymean

|Ymean|,

where, if M is a number of samples (we took M = 10000), we compute beforehand:

Xmean = meani = 0..Nm = 1..M

X(m)ti

, Xstd = stdi = 0..Nm = 1..M

X(m)ti

, Ymean = meani = 0..Nm = 1..M

g(X(m)ti

).

• Use the Adam optimizer [KB14] with the recommended parameters, but with a decreasing learningrate: set initially η ← η0 = 10−2, and if `j is the test loss evaluated at iteration j and

k = mean

j=1000..1000(k+1)−1`j ,

if at a step j = 1000(k + 1), k − k+1

k

< 5%

then η ← η/2.• Use a batch size of 300 for the Adam optimizer, and evaluate the test loss each 100 iterations during

training with a separate test set of size 1000.• For our new algorithm, we also use λ = 1.0 and ninner = 10000 during training.

Error measures After 16000 iterations, we retain the set of parameters that generated the lowest testloss. Finally, we compute the deterministic quantities:

Relative error on Y0 = |Y0 − Y0,ref||Y0,ref|

, (24)

Relative error on Z0 =‖Z0 − Z0,ref‖22‖Z0,ref‖22

, (25)

and we compute a final test loss and the means of the following quantities using a final test set of size 1500:

Integral error on Y = E

[δt

(|Y0 − Y0,ref|+ |YT − YT,ref|

2 +N−1∑i=1

|Yti − Yti,ref|

)], (26)

Integral error on Z = E

[δt

(‖Z0 − Z0,ref‖22 + ‖ZT − ZT,ref‖22

2 +N−1∑i=1

‖Zti − Zti,ref‖22

)]. (27)

We also compute the errors on Y and on Z for each time step. For our new algorithm, we use neval = 100000during evaluation to compute the integral errors and neval = 1000000 during evaluation to compute therelative errors on Y0 and Z0.

In the following results, all means and quantiles are computed with 5 independent runs for each simula-tion.

10

Page 11: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

5.2 Numerical results: Deep BSDE5.2.1 Influence of the number of time stepsWe investigate the influence of the number of time steps on the convergence and on the precision of theresults. In the following, we fix h = 2 and w = 2d for all networks.

On equation A.1.5 (d = 10, Figure 8), our LSTM networks and Merged networks perform better overallthan the other networks. Increasing the number of time steps benefit networks with shared parameters,especially LSTMs, while it degrades strongly the convergence of networks a., b., c..

On equation A.1.2 (d = 100, Figure 9), the networks a., b., c., g. and i. fail to converge (the lossremains constant at a high value). Overall, The LSTM with a residual structure j. and the merged networkwith a shortcut structure e. perform noticeably better than the other networks, and their precision increasewith the number of time steps.

On equation A.1.6 (d = 100, Figure 10), the networks a., b., c. perform worse than the Merged andLSTM networks. The other networks show very similar performance, while the network i. shows someinstabilities.

Overall, our Merged and LSTM networks show significant improvement on the results of Deep BSDEcompared to the other architectures. Increasing the number of time steps generally improve the results ofthe Merged and LSTM networks, while degrading or having to effect on the standard FC networks.

Network A.1.5 (d = 10) A.1.2 (d = 100) A.1.6 (d = 100)a. FC DBSDE 60b. FC ELU 60c. FC Residual 60 20 60d. FC Merged 150 100 200+e. FC Merged Shortcut 200+ 200 200+f. FC Merged Residual 200+ 200 200+g. LSTM 200+h. Augmented LSTM 200+ 100 200+i. Hybrid LSTM 200+ 200+j. Residual LSTM 200+ 200 200+

Table 1: Number of time steps that yielded the better results (among 20, 40, 60, 80, 100, 150, 200 or 200+ forlarger values). The cell is filled in red if the algorithm did not converge, in green if it yielded the best results,in orange if it is unstable.

11

Page 12: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1. 25 50 75 100 125 150 175 200Number of timesteps

10 4

10 3

10 2

Rela

tive

erro

r on

Y0

2. 25 50 75 100 125 150 175 200Number of timesteps

10 4

10 3

Rela

tive

erro

r on

Z0

a. FC DBSDEb. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualg. LSTMh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

3. 25 50 75 100 125 150 175 200Number of timesteps

10 1

2 × 10 2

3 × 10 2

4 × 10 2

6 × 10 2

Inte

gral

erro

r on

Y

4. 25 50 75 100 125 150 175 200Number of timesteps

10 2

10 1

Inte

gral

erro

r on

Z

a. FC DBSDEb. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualg. LSTMh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

5. 25 50 75 100 125 150 175 200Number of timesteps

10 2

10 1

Fina

l los

s

a. FC DBSDEb. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualg. LSTMh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 2

10 1

100

Test

loss

b. FC ELU

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 2

10 1

100

Test

loss

j. Residual LSTM

Nb of timesteps: 20Nb of timesteps: 40Nb of timesteps: 60Nb of timesteps: 80Nb of timesteps: 100Nb of timesteps: 150Nb of timesteps: 200

Figure 8: Influence of the number of time steps on equation A.1.5 (d = 10) on 5 different runs, with thestandard learning hyper parameters (see 5.1.2). 1., 2.: we represent the mean of the integral error on Y (26)and the integral error on Z (27) and the 5−95% confidence intervals. Our LSTM-based networks perform betterthan the other networks. Merged network perform better than FCs. 3., 4.: we represent the final relative erroron Y0 (24) and the relative error on Z0 (25) (corresponding to the lowest test loss obtained during training).5. We represent the final test loss. Our LSTM-based networks and the merged network perform better whenthe number of time steps increase, whereas this hurts the convergence of not merged networks, that tend toperform worse: this is shown in 6. and 7. where we see the LSTM-based network achieves lower losses with ahigher number of time steps, whereas the other network can not.

12

Page 13: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1. 25 50 75 100 125 150 175 200Number of timesteps

10 5

10 4

10 3

Rela

tive

erro

r on

Y0

2. 25 50 75 100 125 150 175 200Number of timesteps

10 4

10 3

10 2

Rela

tive

erro

r on

Z0 c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMj. Residual LSTM

3. 25 50 75 100 125 150 175 200Number of timesteps

2 × 10 1

3 × 10 1

4 × 10 1

6 × 10 1

Inte

gral

erro

r on

Y

4. 25 50 75 100 125 150 175 200Number of timesteps

100

101In

tegr

al e

rror o

n Z c. FC Residual

d. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMj. Residual LSTM

5. 25 50 75 100 125 150 175 200Number of timesteps

10 1

100

Fina

l los

s

c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMj. Residual LSTM

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 1

100

101

Test

loss

c. FC Residual

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 1

100

101

Test

loss

e. FC Merged Shortcut

Nb of timesteps: 20Nb of timesteps: 40Nb of timesteps: 60Nb of timesteps: 80Nb of timesteps: 100Nb of timesteps: 150Nb of timesteps: 200

Figure 9: Influence of the number of time steps on equation A.1.2 (d = 100) on 5 different runs, with thestandard learning hyper parameters (see 5.1.2). We do not represent networks a., b., g., i., which losses aremuch higher and did not decrease during training. 1., 2.: we represent the final relative error on Y0 (24) and therelative error on Z0 (25) (corresponding to the lowest test loss obtained during training). 3., 4.: we representthe mean of the integral error on Y (26) and the integral error on Z (27) and the 5− 95% confidence intervals.5. We represent the final test losses for c. and e. in 6. and 7. and show the merged network see betterconvergence with a higher number of time steps, while the not merged network does not.

13

Page 14: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1. 25 50 75 100 125 150 175 200Number of timesteps

10 3

10 2

10 1

Rela

tive

erro

r on

Y0

2. 25 50 75 100 125 150 175 200Number of timesteps

10 4

10 3

Rela

tive

erro

r on

Z0 c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMj. Residual LSTM

3. 25 50 75 100 125 150 175 200Number of timesteps

2 × 10 2

3 × 10 2

4 × 10 2

Inte

gral

erro

r on

Y

4. 25 50 75 100 125 150 175 200Number of timesteps

10 2

10 1

100

Inte

gral

erro

r on

Z c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMj. Residual LSTM

5. 25 50 75 100 125 150 175 200Number of timesteps

10 3

Fina

l los

s

c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMj. Residual LSTM

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 2

10 1

Test

loss

c. FC Residual

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 3

10 2

10 1

Test

loss

j. Residual LSTM

Nb of timesteps: 20Nb of timesteps: 40Nb of timesteps: 60Nb of timesteps: 80Nb of timesteps: 100Nb of timesteps: 150Nb of timesteps: 200

Figure 10: Influence of the number of time steps on equation A.1.6 (d = 100) on 5 different runs, with thestandard learning hyper parameters (see 5.1.2). We do not represent network g., which losses are much higherand did not decrease during training. Note that networks f. and i. proved unstable during training with alarge number of time steps. 1., 2.: we represent the final relative error on Y0 (24) and the relative error onZ0 (25) (corresponding to the lowest test loss obtained during training). 3., 4.: we represent the mean of theintegral error on Y (26) and the integral error on Z (27) (on all the trajectory) and the 5 − 95% confidenceintervals. 5. We represent the final test losses for c. and e. in 6. and 7. and show the merged network seebetter convergence with a higher number of time steps, while the not merged network does not.

14

Page 15: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

5.2.2 Influence of the nonlinearity in the driverIn this Section, we investigate the influence of the nonlinearity in the driver on the results of our algorithms.We use the equations A.1.4 and A.1.7, in which the parameter r is the rescale factor before f : the nonlinearityincreases with r.

Generally speaking, the final test loss and the integral error on Y tend to decrease with r, while the othererrors increase with r.

On equation A.1.4 (d = 10, Figure 11), the networks a. and g. did not converge (remained at a high lossduring training). The Merged networks perform better than the other networks overall – the LSTM networksyield similar performance but show slightly higher errors for high r’s. The final losses of the standard FCnetworks increase greatly with r, showing these networks have difficulties to converge.

On equation A.1.7 (d = 10, Figure 12) the network a. does not converge – the standard FC networks b.and c. have lower errors on the initial conditions and greater integral errors than the Merged and LSTMnetworks, which show similar performance.

Overall, our new architectures seem to be more resilient to the increased non-linearity.

Network A.1.4 (d = 10) A.1.7 (d = 10)a. FC DBSDEb. FC ELU The final loss increases exponentially with rc. FC Residual The final loss increases exponentially with rd. FC Mergede. FC Merged Shortcutf. FC Merged Residualg. LSTMh. Augmented LSTM Higher error for r ≥ 2.0i. Hybrid LSTM Higher error for r ≥ 2.0j. Residual LSTM Higher error for r ≥ 2.0

Table 2: Influence of the non-linearity r on the results. The cells are filled in red if the algorithm did notconverge, in green if it yielded the best stable results.

15

Page 16: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 4

10 3

10 2

10 1

100

Rela

tive

erro

r on

Y0

2. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 2

10 1

Rela

tive

erro

r on

Z0

3. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

2 × 10 2

3 × 10 2

4 × 10 2

6 × 10 2

Inte

gral

erro

r on

Y

4. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 3

10 2

10 1

Inte

gral

erro

r on

Zb. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

5. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 3

10 2

Fina

l los

s

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 3

10 2

10 1

100

Test

loss

c. FC Residual

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 3

10 2

10 1

100

Test

loss

j. Residual LSTM

r: 0.1r: 0.5r: 1.0r: 1.5r: 2.0r: 2.5r: 3.0

Figure 11: Influence of the non-linearity on f (parameter r) on equation A.1.4 (d = 10, N = 100). Werepresent 1. the relative error on Y0 (24), 2. the relative error on Z0 (25), 3. the integral error on Y (26), 4.the integral error on Z (27), 5. the final test loss. The mean and the 5% and 95% quantiles are computed on5 independent runs and represented with the lines and error bars. Networks a. and g. do not converge on thisexample (their losses do not decrease during training and remain significantly higher than the other networks)and are not represented here. Interestingly, the final test loss decreases with r while the other error measurestend to increase.

16

Page 17: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 3

10 2

Rela

tive

erro

r on

Y0

2. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 2

10 1

Rela

tive

erro

r on

Z0

3. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

2 × 10 1

3 × 10 1

Inte

gral

erro

r on

Y

4. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 2

10 1

Inte

gral

erro

r on

Zb. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualg. LSTMh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

5. 0.0 0.5 1.0 1.5 2.0 2.5 3.0r

10 1

Fina

l los

s

b. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualg. LSTMh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 1

100

101

102

Test

loss

c. FC Residual

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 1

100

101

102

Test

loss

j. Residual LSTM

r: 0.1r: 0.5r: 1.0r: 1.5r: 2.0r: 2.5r: 3.0

Figure 12: Influence of the non-linearity on f (parameter r) on equation A.1.7 (d = 10, N = 100). Werepresent 1. the relative error on Y0 (24), 2. the relative error on Z0 (25), 3. the integral error on Y (26),4. the integral error on Z (27), 5. the final test loss. The mean and the 5% and 95% quantiles are computedon 5 independent runs and represented with the lines and error bars. The network a. do not converge on thisexample (its loss does not decrease during training and remain significantly higher than the other networks)and is not represented here.

17

Page 18: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

5.2.3 Influence of the maturityIn order to assess the influence of the maturity, we kept a constant time step δt = 0.1, i.e. we takeN = 100× T .

Conclusions are similar on equations A.1.4 and A.1.7 (d = 10): the errors seem to increase linearly withT , as shown in Figures 13 and 14. We point out that networks a. and g. fail to converge and networks b.and c. perform slightly worse than the other networks. Finally, all the networks show some instabilities forT ≥ 2.5 on equation A.1.4 (with both constant N and constant δt) with our standard training procedure.This phenomenon is less visible on equation A.1.7, except for network h.. These instabilities could be solvedby tuning the training hyper parameters further.

1. 0.5 1.0 1.5 2.0 2.5 3.0T

10 4

10 3

10 2

10 1

100

101

Rela

tive

erro

r on

Y0

2. 0.5 1.0 1.5 2.0 2.5 3.0T

10 3

10 2

10 1

100

Rela

tive

erro

r on

Z0

3. 0.5 1.0 1.5 2.0 2.5 3.0T

10 2

10 1

100

Inte

gral

erro

r on

Y / T

4. 0.5 1.0 1.5 2.0 2.5 3.0T

10 3

10 2

10 1

100

101

Inte

gral

erro

r on

Z / T b. FC ELU

c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

5. 0.5 1.0 1.5 2.0 2.5 3.0T

10 3

10 2

10 1

100

Fina

l los

s

b. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 4

10 3

10 2

10 1

100

101

Test

loss

c. FC Residual

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 4

10 3

10 2

10 1

100

101

Test

loss

j. Residual LSTM

T: 0.2T: 0.6T: 1.0T: 1.4T: 1.8T: 2.2T: 2.6

Figure 13: Influence of the maturity T using equation A.1.4 (d = 10, N = 100) on 1. the relative error on Y0(24), 2. the relative error on Z0 (25), 3. the integral error on Y (26), 4. the integral error on Z (27), 5. thefinal test loss. We represent convergence losses for networks c. and j. in 6. and 7.. The mean and the 5% and95% quantiles are computed on 5 independent runs and represented with the lines and error bars. Networks a.and g. do not converge on this example (their losses do not decrease during training and remain significantlyhigher than the other networks) and are not represented here. Above T = 2.5 (N = 250), only networks e. andh. are stable using the standard learning parameters.

18

Page 19: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1. 0.5 1.0 1.5 2.0 2.5 3.0T

10 4

10 3

10 2

10 1

Rela

tive

erro

r on

Y0

2. 0.5 1.0 1.5 2.0 2.5 3.0T

10 2

10 1

Rela

tive

erro

r on

Z0

3. 0.5 1.0 1.5 2.0 2.5 3.0T

10 1

100

101

Inte

gral

erro

r on

Y / T

4. 0.5 1.0 1.5 2.0 2.5 3.0T

10 2

10 1

100

101

Inte

gral

erro

r on

Z / T b. FC ELU

c. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

5. 0.5 1.0 1.5 2.0 2.5 3.0T

10 1

100

101

102

Fina

l los

s

b. FC ELUc. FC Residuald. FC Mergede. FC Merged Shortcutf. FC Merged Residualh. Augmented LSTMi. Hybrid LSTMj. Residual LSTM

6. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 1

100

101

102

103

Test

loss

c. FC Residual

7. 0 2000 4000 6000 8000 10000 12000 14000 16000Nb of gradient descent updates

10 1

100

101

102

103

Test

loss

j. Residual LSTM

T: 0.2T: 0.6T: 1.0T: 1.4T: 1.8T: 2.2T: 2.6T: 3.0

Figure 14: Influence of the maturity T using equation A.1.7 (d = 10, N = 100) on 1. the relative error on Y0(24), 2. the relative error on Z0 (25), 3. the integral error on Y (26), 4. the integral error on Z (27), 5. thefinal test loss. We represent convergence losses for networks c. and j. in 6. and 7.. The mean and the 5% and95% quantiles are computed on 5 independent runs and represented with the lines and error bars. Networks a.and g. do not converge on this example (their losses do not decrease during training and remain significantlyhigher than the other networks) and are not represented here. Above T = 2.5 (N = 250), the network h. showinstabilities.

19

Page 20: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

5.3 Numerical results: our new algorithmIn this section we compare the second version of our algorithm noted C. AutoDiff in section 4.2 with thebest neural networks developed in section 3. The memory needed by the algorithm prevents us to use it onGPU for high ninner values in high dimension and only CPU with large memory are sometimes used: it hasa direct impact on the computational time as shown on tables 3 and 4. Then integrals on Y (26), and Z(27) are computed with a few samples.

Method architecture time in secondsf. GPU 1700j. GPU 2640

C. AutoDiff GPU 21000

Table 3: Computational time for 16000 iterations for test case A.1.4 in dimension 10. C. AutoDiff usesninner = 10000.

Method architecture time in secondsf. GPU 1800j. GPU 4500

C. AutoDiff CPU 115000

Table 4: Computational time for 16000 iterations for test case A.1.2 in dimension 100. C. AutoDiff usesninner = 4000.

5.3.1 Influence of the nonlinearity in the driverWe investigate the influence of r on equation A.1.4 and compare the results with Deep BSDE with networksf. and j.. The results are presented in Figure 15. Indeed, increasing neval help achieving lower errors duringevaluation, but the gain decreases when r increases. Overall, our new algorithm yields similar precision asDeep BSDE on the initial condition, with better integral error.

5.3.2 Influence of the maturityWe investigate the influence of T on equation A.1.4. We compare the results with Deep BSDE usingN = 100× T – for consistency, we compute the integral error for our new algorithm using the same rule forthe number of time steps, even though this is not a parameter of the algorithm. The results are presentedin Figure 16. The error on the initial condition is similar to Deep BSDE, slightly higher on Z0, whereas ournew algorithm shows better integral errors.

5.3.3 Other examplesWe ran our new algorithm on some other equations A.1.5 (square root terminal condition) in dimension d =10 and A.1.3 (Hamilton-Jacobi-Bellman problem), A.1.2 (Black-Scholes-Barenblatt) in dimension d = 100.Typical results are presented in Table 5 – these show comparable performance overall. Note that the erroron Y0 is higher than the one obtained by the network f..

Finally, we compared qualitatively the shape of the errors on the trajectories using equations A.1.3 andA.1.2 in dimension d = 100. These are represented in Figure 17. The shape of the errors are not similar, asthe error increases when t increases for Deep BSDE, and it seems not to be the case for our new fixed pointalgorithm.

20

Page 21: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1.0.0 0.5 1.0 1.5 2.0 2.5 3.0

r

10 3

10 2

10 1

Rela

tive

erro

r on

Y0

2.0.0 0.5 1.0 1.5 2.0 2.5 3.0

r

10 5

10 4

10 3

10 2

10 1

Rela

tive

erro

r on

Z0

C. AutoDiff (n_eval = 1000000)f. FC Merged Residualj. Residual LSTM

3.0.0 0.5 1.0 1.5 2.0 2.5 3.0

r

10 3

10 2

Inte

gral

erro

r on

Y

4.0.0 0.5 1.0 1.5 2.0 2.5 3.0

r

10 3

10 2

10 1In

tegr

al e

rror o

n Z

C. AutoDiff (n_eval = 10000)C. AutoDiff (n_eval = 100000)f. FC Merged Residualj. Residual LSTM

5.0.0 0.2 0.4 0.6 0.8 1.0

Time

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Y

Sample trajectory (r = 3.0, f. FC Merged Residual)

Y (exact solution)Y (f. FC Merged Residual)Z (1st dim) (exact solution)Z (1st dim) (f. FC Merged Residual)

1.6

1.4

1.2

1.0

0.8

0.6

0.4

Z (fi

rst d

imen

sion)

6.0.0 0.2 0.4 0.6 0.8 1.0

Time

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Y

Sample trajectory (r = 3.0, C. AutoDiff)

Y (exact solution)Y (n_eval = 10000)Y (n_eval = 100000)Z (1st dim) (exact solution)Z (1st dim) (n_eval = 10000)Z (1st dim) (n_eval = 100000)

1.6

1.4

1.2

1.0

0.8

0.6

0.4

Z (fi

rst d

imen

sion)

Figure 15: Influence of the non-linearity r using our new algorithm C. AutoDiff on equation A.1.4 (d = 10,λ = 0.5) compared to networks f. and j. (our best algorithms using Deep BSDE). We represent 1. the relativeerror on Y0 (24), 2. the relative error on Z0 (25), 3. the integral error on Y (26), 4. the integral error onZ (27). Integral errors are computed using N = 100 time steps. The final errors increase as r increases, butremains acceptable. The convergence curves (not represented) are similar for all r, with final losses between10−3 and 10−4. We represent a sample trajectory in 5. and 6..

21

Page 22: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

1.0.5 1.0 1.5 2.0 2.5 3.0

T

10 3

10 2

10 1

100

101

Rela

tive

erro

r on

Y0

2.0.5 1.0 1.5 2.0 2.5 3.0

T

10 5

10 4

10 3

10 2

10 1

100

Rela

tive

erro

r on

Z0

C. AutoDiff (n_eval = 1000000)f. FC Merged Residualj. Residual LSTM

3.0.5 1.0 1.5 2.0 2.5 3.0

T

10 3

10 2

10 1

100

Inte

gral

erro

r on

Y / T

4.0.5 1.0 1.5 2.0 2.5 3.0

T

10 3

10 2

10 1

100

101In

tegr

al e

rror o

n Z

/ T

C. AutoDiff (n_eval = 10000)C. AutoDiff (n_eval = 100000)f. FC Merged Residualj. Residual LSTM

5.0.0 0.5 1.0 1.5 2.0 2.5 3.0

Time

2

1

0

1

2

Y

Sample trajectory (T = 3.0, f. FC Merged Residual)

Y (exact solution)Y (f. FC Merged Residual)Z (1st dim) (exact solution)Z (1st dim) (f. FC Merged Residual)

4

3

2

1

0

Z (fi

rst d

imen

sion)

6.0.0 0.5 1.0 1.5 2.0 2.5 3.0

Time

2

1

0

1

2

Y

Sample trajectory (T = 3.0, C. AutoDiff)

Y (exact solution)Y (n_eval = 10000)Y (n_eval = 100000)Z (1st dim) (exact solution)Z (1st dim) (n_eval = 10000)Z (1st dim) (n_eval = 100000)

4

3

2

1

0

Z (fi

rst d

imen

sion)

Figure 16: Influence of the maturity T using our new algorithm C. AutoDiff on equation A.1.4 (d = 10, λ = 0.5)compared to networks f. and j. (our best algorithms using Deep BSDE). We represent 1. the relative erroron Y0 (24), 2. the relative error on Z0 (25), 3. the integral error on Y (26), 4. the integral error on Z (27).Integral errors are computed using N = 100×T time steps. The final errors increase as r increases, but remainsacceptable. The final loss (not represented) increases with T from ∼ 10−4 to ∼ 10−3. We represent a sampletrajectory in 5. and 6..

22

Page 23: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

f. Abs error Y0 Norm error Z0 Integral error Y Integral error ZA.1.5 (d = 10) 6.821155× 10−3 4.986833× 10−2 4.484242× 10−2 1.621381× 10−2

A.1.3 (d = 100) 6.856918× 10−4 1.510367× 10−3 9.022277× 10−3 2.065553× 10−3

A.1.2 (d = 100) 7.350158× 10−2 2.887233× 10−1 2.009426× 10−1 3.74562× 100

C. AutoDiff Abs error Y0 Norm error Z0 Integral error Y Integral error ZA.1.5 (d = 10) 1.054191× 10−2 2.332632× 10−2 2.162184× 10−2 4.105479× 10−3

A.1.3 (d = 100) 9.320735× 10−3 9.937924× 10−4 1.322869× 10−2 1.715425× 10−3

A.1.2 (d = 100) 2.002487× 10−1 1.114723× 100 2.545520× 10−1 1.114723× 100

Table 5: Comparison of Deep BSDE and our new fixed point algorithm on other examples, for one typical run.We measure the initial errors |Y0 − Y0,ref|, ‖Z0 − Z0,ref‖, and the mean integral errors on Y (26) and Z (27).It should be noted that the latter integral measures are the computed using 1500 simulations for f. and 10simulations for C. AutoDiff. For C. AutoDiff, in dimension d = 10, we used ninner = 10000 and stopped thetraining process after 16000 iterations – in dimension d = 100, we used ninner = 4000 and stopped the trainingprocess after 10000 iterations. We then evaluated the errors using neval = 1000000 for the initial errors andneval = 100000 for the integral errors.

1.0.0 0.2 0.4 0.6 0.8 1.0

Time

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

Abso

lute

erro

r on

Y

Comparison of the absolute errors on Y

0.0 0.2 0.4 0.6 0.8 1.0Time

0.00

0.02

0.04

0.06

0.08

0.10

Norm

erro

r on

Z

Comparison of the norm errors on Z

C. AutoDifff. FC Merged Residual

2.0.0 0.2 0.4 0.6 0.8 1.0

Time

0.0

0.2

0.4

0.6

0.8

1.0

Abso

lute

erro

r on

Y

Comparison of the absolute errors on Y

0.0 0.2 0.4 0.6 0.8 1.0Time

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Norm

erro

r on

Z

Comparison of the norm errors on Z

C. AutoDifff. FC Merged Residual

Figure 17: We represent |Y − Yref| and ‖Z − Zref‖ function of the time horizon, for a typical run of our algorithmsC. AutoDiff and f., with N = 100 time steps. We represent the mean and the 5% and 95% quantiles on 1500trajectories for f. and 10 trajectories for C. AutoDiff. We used equations 1. A.1.3 (d = 100) and 2. A.1.2(d = 100). For C. AutoDiff, we used ninner = 4000 and we stop the training process after 10000 iterations. Wethen compute the error measures using neval = 100000.

23

Page 24: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

6 DiscussionIn our tests, we could not find a case in which our two algorithms fail, i.e. a case in which the loss wouldseem small while the solution would be incorrect. In most difficult cases, the algorithms would rather divergeor explicitly not converge, while in other cases they would give results and computation times comparableto the state-of-the-art. Thus, the loss function seems to be a reasonably robust indicator of precision, aslower losses are likely to correspond to lower error on trajectories. Most divergence cases are explained by:using a too high learning rate, initializing parameters incorrectly, encountering “vanishing” or “exploding”gradient issues when using very high numbers of time steps.

Deep BSDE type algorithm We found that using Merged or LSTM architectures in Deep BSDEsignificantly improve the precision of the results and the stability of the algorithms, especially since theyenable to use more time steps to further increase the performance, and enable to use generic learning ratestrategies and batch size values. In our tests, we also found that the number and width of the hidden layerscould be set to typical values of 2 layers of size 2d. Otherwise, tuning further these hyper parameters hasto be done for each combination of network and equation.

We found that the algorithms are resilient to increasing the non-linearity r and the maturity T . Indeed,increasing these parameters make the resolution more difficult, but we found that the solutions remain veryacceptable.

Our new algorithm We found our new algorithm to be capable of solving the same range of problemsas Deep BSDE, in dimensions d = 10 and d = 100, with tractable computation times (∼ 1 − −10 hours ona consumer grade computer). A potential limitation is the fact that this algorithm’s memory usage growswith ninner and d, yet we found that typical values of ninner = 4000 yield good results while fitting in astandard computer’s memory. Our algorithm do not discretize the time horizon but rather discretizes aconditional expectation operator, which lead to different error shapes. We found these errors to be of thesame magnitude as our best algorithms based on Deep BSDE. Finally, we found our algorithm to be slightlymore resilient to increasing r and T than our best algorithms based on Deep BSDE, while keeping hyperparameters constant (we did not increase ninner when increasing T for instance). As it is the case with DeepBSDE methods, this algorithm could be generalized to second order BSDE.

Special cases Finally, we would like to point out that we found equations with ill-defined derivatives onthe terminal condition, for instance the equation A.1.1, to be problematic. The terminal condition is

g(x) = minixi

and its derivative as computed by TensorFlow is Dg(x) = 1 if i = argmin, 0 else. In this case, we do nothave an analytical solution – the algorithm and parameters used in [HJW17] lead to a slow convergence anda final loss of ' 26, showing that the algorithm could not find a way to replicate exactly the input flow.In fact, we found that the neural networks did not learn well in this case (some κti remaining constant forevery ti). We solved the same equation with our Merged network f. and our new algorithm C. AutoDiff ,and results presented in Figure 18 show that the solutions found are quite different, yet the solution fromour new algorithm C. AutoDiff seems more coherent. Further analysis of such cases remain to be conducted.

0.0 0.2 0.4 0.6 0.8 1.0Time

0.0

0.1

0.2

0.3

0.4

0.5

Z (fi

rst d

imen

sion)

Comparison of the distributions of Z (first dimension)

C. AutoDiff (n_eval=100000)f. FC Merged Residual

Figure 18: Comparison of the distributions of the Z (first dimension) on a sample run of our algorithms usingf. and C. on equation A.1.1. The means and 5% and 95% quantiles (computed using 1500 trajectories for f.and 10 trajectories for C.) are represented. The final losses (not comparable) were 16.84 for f. and 3.87 for C..The resulting Y0 were 57.11 for f. and 57.37 for C..

24

Page 25: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

A Supplementary materialsA.1 Some test PDEsWe give the examples of semilinear PDEs used:

−∂tu(t, x)− Lu(t, x) = f(t, x, u(t, x), σ>(t, x)∇u(t, x))u(T, x) = g(x)

where Lu(t, x) := 12 Tr

(σ>σ(t, x)∇2u(t, x)

)+ µ(t, x)>∇u(t, x). For each example, we thus give the corre-

sponding µ, σ, f and g.

A.1.1 A Black-Scholes equation with default riskFrom [EHJ17; HJW17]. If not otherwise stated, the parameters take values: µ = 0.02, σ = 0.2, δ = 2/3,R = 0.02, γh = 0.2, γl = 0.02, vh = 50, vl = 70. We use the initial condition X0 = (100, · · · , 100).

µ : (t, x) 7→ µx

σ : (t, x) 7→ σ diag({xi}i=1..d)

f : (t, x, y, z) 7→ −(1− δ)×min{γh, max

{γl,

γh − γlvh − vl

(y − vh) + γh

}}y −Ry

g : x 7→ mini=1..d

xi

We used the closed formula for the SDE dynamic:

Xt = Xs exp[(

(µ− σ2

2 ))

(t− s) + σ(Wt −Ws)]

∀t > s

Baseline (from [HJW17]):Y0 47.300

A.1.2 A Black-Scholes-Barenblatt equationFrom [Rai18]. If not otherwise stated, the parameters take values: σ = 0.4, r = 0.05. We use the initialcondition X0 = (1.0, 0.5, 1.0, · · · ).

µ : (t, x) 7→ 0σ : (t, x) 7→ σ diag({xi}i=1..d)

f : (t, x, y, z) 7→ −r

(y − 1

σ

d∑i=1

zi

)g : x 7→ ‖x‖2

We used the closed formula:

Xt = Xs exp[− σ

2

2 (t− s) + σ(Wt −Ws)]

∀t > s

Exact solution:

u(t, x) = exp((r + σ2)(T − t))g(x)

A.1.3 A Hamilton-Jacobi-Bellman equationFrom [EHJ17; HJW17]. If not otherwise stated: λ = 1.0 and X0 = (0, · · · , 0).

µ : (t, x) 7→ 0

σ : (t, x) 7→√

2 Idf : (t, x, y, z) 7→ −0.5 λ ‖z‖2

g : x 7→ log(0.5[1 + ‖x‖2

])Monte-Carlo solution:

u(t,Xt) = − 1λ

log(E[

exp(−λg(Xt +

√2BT−t

)])∀j, ∇ju(t,Xt) =

(E[exp{−λg(Xt +

√2BT−t)

}])−1E[∂g

∂xj(Xt +

√2BT−t) exp

{−λg(Xt +

√2BT−t)

}]where ∂g

∂xj(x) = 2xj

1 + ‖x‖2

Baseline (computed using 10 million Monte-Carlo realizations, d = 100):

25

Page 26: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

Y0 4.590119548591171217

Z0 −0.00006060718806111

for all dimensions.

A.1.4 An oscillating example with a square non-linearityFrom [War]. If not otherwise stated, the parameters take values: µ0 = 0.2, σ0 = 1.0, a = 0.5, r = 0.1.The intended effect of the min and max in f is to make f Lipschitz. We used the initial condition X0 =(1.0, 0.5, 1.0, · · · ).

µ : (t, x) 7→ µ0/d

σ : (t, x) 7→ σ0/√d Id

f : (t, x, y, z) 7→ φ(t, x) + r

(max

[− exp(2a(T − t)),min

{1

σ0√dy

d∑i=1

zi, exp(2a(T − t))

}])2

where φ : (t, x) 7→ cos

(d∑i=1

xi

)(a+ σ2

02

)exp(a(T − t)) + sin

(d∑i=1

xi

)µ0 exp(a(T − t))

− r

(cos

(d∑i=1

xi

)sin

(d∑i=1

xi

)exp(2a(T − t))

)2

g : x 7→ cos

(d∑i=1

xi

)Exact solution:

u(t, x) = cos

(d∑i=1

xi

)exp(a(T − t))

∀j, ∂u

∂xj(t, x) = − sin

(d∑i=1

xi

)exp(a(T − t))

A.1.5 A non-Lipschitz terminal conditionFrom [Ric10]. If not otherwise stated: α = 0.5, X0 = (0, · · · , 0).

µ : (t, x) 7→ 0σ : (t, x) 7→ Id

f : (t, x, y, z) 7→ −0.5 ‖z‖2

g : x 7→d∑i=1

(max {0,min [1, xi]})α

Monte-Carlo solution:

u(t,Xt) = log (E [ exp (g(Xt +BT−t))])

∀j, ∇ju(t,Xt) =(E[exp{g(Xt +

√2BT−t)

}])−1E[g′(Xt +

√2BT−t) exp

{g(Xt +

√2BT−t)

}]where g′(x) =

{0 if x ≤ 0 or x ≥ 1αxα−1 else

Baseline (computed using 10 million Monte-Carlo realizations, d = 10):Y0 4.658493663928657Z0 0.3795303954478772

A.1.6 An oscillating example with Cox-Ingersoll-Ross propagationFrom [War18]. If not otherwise stated: a = 0.1, α = 0.2, T = 1.0, k = 0.1, m = 0.3, σ = 0.2. We used theinitial condition X0 = (0.3, · · · , 0.3). Note that we have 2km > σ2 so that X remains positive.

µ : (t, x) 7→ k (m− x)

σ : (t, x) 7→ σdiag{√

x}

26

Page 27: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

f : (t, x, y, z) 7→ φ(t, x) + ay

(d∑i=1

zi

)

where φ : (t, x) 7→ cos

(d∑i=1

xi

)(−α+ σ2

2

)exp(−α(T − t)) + sin

(d∑i=1

xi

)exp(−α(T − t))

d∑i=1

k (m− xi)

+ a cos

(d∑i=1

xi

)sin

(d∑i=1

xi

)exp(−2α(T − t))

d∑i=1

σ√xi

g : x 7→ cos

(d∑i=1

xi

)Exact solution:

u(t, x) = cos

(d∑i=1

xi

)exp(−α(T − t))

∀j, ∇ju(t, x) = − sin

(d∑i=1

xi

)exp(−α(T − t))

A.1.7 An oscillating example with inverse non-linearityIf not stated otherwise, we took parameters µ0 = 0.2, σ0 = 1.0, a = 0.5, r = 0.1 and the initial conditionX0 = (1.0, 0.5, 1.0, · · · ).

µ : (t, x) 7→ µ0/d1d

σ : (t, x) 7→ σ0 Id

f : (t, x, y, z) 7→ φ(t, x) + rdy∑d

i=1 zi

g : x 7→ 2d∑i=1

xi + cos

(d∑i=1

xi

)

where φ : (t, x) 7→ 2ad∑i=1

xi exp(a(T − t)) + cos

(d∑i=1

xi

)(a+ dσ2

02

)exp(a(T − t))− µ0

[2− sin

(d∑i=1

xi

)]exp(a(T − t))

− r2∑d

i=1 xi + cos(∑d

i=1 xi

)σ0

[2− sin

(∑d

i=1 xi

)]Exact solution:

u(t, x) =

[2

d∑i=1

xi + cos

(d∑i=1

xi

)]exp(a(T − t))

∀j, ∇ju(t, x) =

[2− sin

(d∑i=1

xi

)]exp(a(T − t))

27

Page 28: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

References[PP90] Etienne Pardoux and Shige Peng. “Adapted solution of a backward stochastic differential

equation”. In: Systems & Control Letters 14.1 (1990), pp. 55–61.[HS97] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-termMemory”. In: 9 (Dec. 1997),

pp. 1735–80.[Fou+99] Eric Fournié et al. “Applications of Malliavin calculus to Monte Carlo methods in fi-

nance”. In: Finance and Stochastics 3.4 (1999), pp. 391–412.[LS01] Francis A Longstaff and Eduardo S Schwartz. “Valuing American options by simulation:

a simple least-squares approach”. In: The review of financial studies 14.1 (2001), pp. 113–147.

[BT04] Bruno Bouchard and Nizar Touzi. “Discrete-time approximation and Monte-Carlo sim-ulation of backward stochastic differential equations”. In: Stochastic Processes and theirapplications 111.2 (2004), pp. 175–206.

[G+05] Emmanuel Gobet, Jean-Philippe Lemor, Xavier Warin, et al. “A regression-based MonteCarlo method to solve backward stochastic differential equations”. In: The Annals ofApplied Probability 15.3 (2005), pp. 2172–2202.

[L+06] Jean-Philippe Lemor, Emmanuel Gobet, Xavier Warin, et al. “Rate of convergence ofan empirical regression method for solving generalized backward stochastic differentialequations”. In: Bernoulli 12.5 (2006), pp. 889–916.

[Che+07] Patrick Cheridito et al. “Second-order backward stochastic differential equations andfully nonlinear parabolic PDEs”. In: Communications on Pure and Applied Mathematics60.7 (2007), pp. 1081–1110.

[GB10] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of training deep feed-forward neural networks”. In: Proceedings of the thirteenth international conference onartificial intelligence and statistics. 2010, pp. 249–256.

[Ric10] Adrien Richou. “Étude théorique et numérique des équations différentielles stochastiquesrétrogrades”. Thèse de doctorat dirigée par Hu, Ying et Briand, Philippe Mathématiqueset applications Rennes 1 2010. PhD thesis. 2010.

[FTW11] Arash Fahim, Nizar Touzi, and Xavier Warin. “A probabilistic numerical method for fullynonlinear parabolic PDEs”. In: The Annals of Applied Probability (2011), pp. 1322–1364.

[Ben12] Yoshua Bengio. “Practical Recommendations for Gradient-Based Training of Deep Ar-chitectures”. In: Neural Networks: Tricks of the Trade: Second Edition. Ed. by GrégoireMontavon, Geneviève B. Orr, and Klaus-Robert Müller. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2012, pp. 437–478. isbn: 978-3-642-35289-8. doi: 10.1007/978-3-642-35289-8_26. url: https://doi.org/10.1007/978-3-642-35289-8_26.

[BW12] Bruno Bouchard and Xavier Warin. “Monte-Carlo valuation of American options: factsand new algorithms to improve existing methods”. In: Numerical methods in finance.Springer, 2012, pp. 215–255.

[KB14] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980 (2014).

[CUH15] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast and accurate deepnetwork learning by exponential linear units (elus)”. In: arXiv preprint arXiv:1511.07289(2015).

[IS15] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift”. In: Proceedings of the 32Nd InternationalConference on International Conference on Machine Learning - Volume 37. ICML’15.Lille, France: JMLR.org, 2015, pp. 448–456. url: http://dl.acm.org/citation.cfm?id=3045118.3045167.

[Kar15] Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/. Blog. 2015.

[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: Nature 521.7553(2015), p. 436.

[Ola15] Christopher Olah. Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs/. Blog. 2015.

28

Page 29: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

[Coo+16] Tim Cooijmans et al. “Recurrent batch normalization”. In: arXiv preprint arXiv:1603.09025(2016).

[E+16] Weinan E et al. “On multilevel Picard numerical approximations for high-dimensionalnonlinear parabolic partial differential equations and high-dimensional nonlinear back-ward stochastic differential equations”. In: arXiv preprint arXiv:1607.03295 46 (2016).

[GT16] Emmanuel Gobet and Plamen Turkedjiev. “Linear regression MDP scheme for discretebackward stochastic differential equations under general conditions”. In: Mathematics ofComputation 85.299 (2016), pp. 1359–1391.

[HE16] Jiequn Han and Weinan E. “Deep Learning Approximation for Stochastic Control Prob-lems”. In: arXiv preprint arXiv:1611.07422 (2016).

[He+16] K. He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR). June 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.

[Hen+16] Pierre Henry-Labordere et al. “Branching diffusion representation of semilinear PDEsand Monte Carlo approximation”. In: arXiv preprint arXiv:1603.01727 (2016).

[Rud16] Sebastian Ruder. “An overview of gradient descent optimization algorithms”. In: arXivpreprint arXiv:1609.04747 (2016).

[BEJ17a] Christian Beck, Weinan E, and Arnulf Jentzen. “Machine learning approximation al-gorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations”. In: arXiv preprint arXiv:1709.05963(2017).

[BEJ17b] Christian Beck, Weinan E, and Arnulf Jentzen. “Machine learning approximation al-gorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations”. In: CoRR abs/1709.05963 (2017).

[BTW17] Bruno Bouchard, Xiaolu Tan, and Xavier Warin. “Numerical approximation of generalLipschitz BSDEs with branching processes”. In: arXiv preprint arXiv:1710.10933 (2017).

[Bou+17] Bruno Bouchard et al. “Numerical approximation of BSDEs using local polynomialdrivers and branching processes”. In:Monte Carlo Methods and Applications 23.4 (2017),pp. 241–263.

[EHJ17] Weinan E, Jiequn Han, and Arnulf Jentzen. “Deep learning-based numerical methodsfor high-dimensional parabolic partial differential equations and backward stochasticdifferential equations”. In: Communications in Mathematics and Statistics 5.4 (2017),pp. 349–380.

[E+17] Weinan E et al. “Linear scaling algorithms for solving high-dimensional nonlinear parabolicdifferential equations”. In: SAM Research Report 2017 (2017).

[FTT17] Masaaki Fujii, Akihiko Takahashi, and Masayuki Takahashi. “Asymptotic Expansion asPrior Knowledge in Deep Learning Method for high dimensional BSDEs”. In: (2017).

[Goy+17] Priya Goyal et al. “Accurate, large minibatch SGD: training imagenet in 1 hour”. In:arXiv preprint arXiv:1706.02677 (2017).

[HJW17] Jiequn Han, Arnulf Jentzen, and E Weinan. “Overcoming the curse of dimensional-ity: Solving high-dimensional partial differential equations using deep learning”. In:arXiv:1707.02568 (2017).

[H+17] Jiequn Han, Arnulf Jentzen, et al. “Solving high-dimensional partial differential equationsusing deep learning”. In: arXiv preprint arXiv:1707.02568 (2017).

[HK17] Martin Hutzenthaler and Thomas Kruse. “Multi-level Picard approximations of high-dimensional semilinear parabolic differential equations with gradient-dependent nonlin-earities”. In: arXiv preprint arXiv:1711.01080 (2017).

[War17] Xavier Warin. “Variations on branching methods for non linear PDEs”. In: arXiv preprintarXiv:1701.07660 (2017).

[Hut+18] Martin Hutzenthaler et al. “Overcoming the curse of dimensionality in the numericalapproximation of semilinear parabolic partial differential equations”. In: arXiv preprintarXiv:1807.01212 (2018).

[Rai18] Maziar Raissi. “Forward-Backward Stochastic Neural Networks: Deep Learning of High-dimensional Partial Differential Equations”. In: arXiv preprint arXiv:1804.07010 (2018).

29

Page 30: Machine Learning for semi linear PDEs - fime-lab.org · Machine Learning for semi linear PDEs Quentin Chan-Wai-Nam ∗† Joseph Mikael ‡§ Xavier Warin ¶k September 17, 2018 Abstract

[War18] Xavier Warin. “Monte Carlo for high-dimensional degenerated Semi Linear and Full NonLinear PDEs”. In: arXiv preprint arXiv:1805.05078 (2018).

[War] Xavier Warin. “Nesting Monte Carlo for high-dimensional Non Linear PDEs”. In: toappear in "Monte Carlo Methods and Applications ().

30