the neural networks survival kit for...

Myth Buster Geometric Interpretation of Deep Learning

The Neural Networks Survival Kit for Quants

Matthew Dixon

Illinois TechChicago

September 2018


Overview

Which of the following statements are true?

A Machine learning or AI is different from statistics: Manymachine learning methods are ’black-boxes’

B The output from neural network classifiers are theprobabilities of an input belonging to a class

C We need multiple hidden layers in the neural network tocapture non-linearity

D Time series modeling (ARIMA, smoothing etc) are completelyunrelated to neural networks

E Using tools like TensorFlow, combined with best practices inSilicon Valley, we can be confident that neural networks toolswork


Traditional Statistical Modeling

Figure: Statistical models assume that a model generated the data. What modelgenerates alternative data?


Stats versus ML

Property Statistical Inference Supervised Machine Learning

Goal Causal models with explanatory power Prediction performance, often with limited explanatory powerData The data is generated by a model The data generation process is unknownFramework Probabilistic AlgorithmicExpressability Typically linear Non-linearModel selection Based on information criteria Numerical optimizationScalability Limited to lower dimensional data Scales to higher dimensional input dataRobustness Prone to over-fitting Designed for out-of-sample performance

Supervised Machine learning is a generalization of statistics tomore general data representations.


Taxonomy of Most Popular Neural Network Architectures

feed forward auto-encoder convolution

recurrent Long / short term memory neural Turing machines

Figure: Most commonly used deep learning architectures for modeling. Source:http://www.asimovinstitute.org/neural-network-zoo

http://www.asimovinstitute.org/neural-network-zoo


Geometric Interpretation of Neural Networks

x1 x2 x1 x2 x1 x2

No hidden layers One hidden layer Two hidden layers



Half-Moon Dataset

No hidden layers One hidden layer Two hidden layers


Why do we need more Neurons?

x1 x2



25 hidden units 50 hidden units 75 hidden units

Figure: The number of hidden units is adjusted according to therequirements of the classification problem and can he very high for datasets which are difficult to separate.



Figure: Hyperplanes defined by three neurons in the hidden layer, each with ReLU activation functions.


• Consider a one layer neural network binary classifier with’probabilistic’ output:

P(Y |X ) = σ(u) =1

1 + e−u, u = WX+b, Y ∈ 0, 1, X ∈ Rp

• By Bayes’ Law, we know that our posteriors must be given bythe likelihood, prior and evidence:

P(Y |X ) =P(X |Y )P(Y )

P(X )=

1

1 + e−(

log(

P(X |Y )

P(X |Y ′)

)+log

(P(Y )

P(Y ′)

))• So the outputs are only really ’true’ posterior probabilities when

the weights and bias are

wj =P(xj |Y )P(xj |Y ′) , ∀j ∈ 1, . . . , p, b = log

(P(Y )P(Y ′)

)The x ′j s must be conditionally independent from each other,otherwise the outputs from the network are not posteriorprobabilities.


So Why Deep Learning?

• One hidden layer, with many neurons, provides strong (andlikely enough) discriminatory power

• Extra hidden layers are only needed when the inputs areconditionally dependent - extra layers give posteriors evenwhen the inputs are not conditionally independent

• Big data, i.e. high dimensional data, typically hasconditionally correlated inputs.

• Hence deep learning is a great tool for big and alternativedata, but not just for the reasons we’ve been told


Recurrent Neural Networks (p=5)

Z 1t−5

Z jt−5

ZHt−5

Xt−5

Z 1t−4

Z jt−4

ZHt−4

Xt−4

Z 1t−3

Z jt−3

ZHt−3

Xt−3

Z 1t−2

Z jt−2

ZHt−2

Xt−2

Z 1t−1

Z jt−1

ZHt−1

Xt−1

Z 1t

Z jt

ZHt

Xt

Yt

. . . . . .


Non-linear Predictors

• Input-output pairs D = Xt ,YtNt=1 are auto-correlatedobservations of X and Y at times t = 1, . . . ,N

• Construct a nonlinear times series predictor, Yt(Xt), of anoutput, Y , using a high dimensional input matrix of p + 1length sub-sequences Xt :

Yt = F (Xt) where Xt = seqp(Xt) = (Xt−p, . . . ,Xt)

• Xt−j is a j th lagged observation of Xt , Xt−j = Lj [Xj ], forj = 0, . . . , p.


Recurrent Neural Networks

• For each time step s = t − p, . . . , t, a function Fh generates ahidden state Zs :

Zs = Fh(Xs ,Zs−1) := σ(WhXs+UhZs−1+bh), Wh ∈ RH×d ,Zh ∈ RH×H

• When the output is continuous, the model output from thefinal hidden state is given by:

Yt = Fy (Zt) = WyZt + by , Wy ∈ RK×H

• When the output is categorical, the output is given by

Yt = Fy (Zt) = softmax(WyZt + by )

• Goal: find the weight matrices W = (Wh,Uh,Wy ) and biasesb = (bh, by ).


Univariate Example: Recurrent Neural Networks areNon-linear Autoregressive Models

• Consider the univariate time series prediction Yt = F (Xt−1),using p previous observations Yt−ipi=1.

• Because this is a special case when no input is available attime t (since we are predicting it), we form the hidden statesto time Zt−1.

• The simplest case of a RNN with one hidden unit, H = 1, noactivation function and the dimensionality of the input vectoris d = 1.


Univariate Example

• If Wh = Uh = φ, |φ| < 1, Wy = 1 and bh = by = 0.

• Then we can show that Yt = F (Xt−1) is a zero-driftauto-regressive, AR(p), model with geometrically decayingweights:

Zt−p = φYt−p,

Zt−p+1 = φ(Zt−p + Yt−p+1),

. . . = . . .

Zt−1 = φ(Zt−2 + Yt−1),

Yt = Zt−1,

where

Yt = (φL + φL2 + · · ·+ φpLp)[Yt ].


Quick Multiple Choice Question

Identify the following correct statements:

• A linear recurrent neural network with a memory of p lags isan autoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, areguaranteed to be stationary, for any choice of weights.

• The amount of memory in a shallow recurrent networkcorresponds to the number of times a single perceptron layeris unfolded.

• The amount of memory in a deep recurrent networkcorresponds to the number of perceptron layers.


Quick Multiple Choice Question: Answers

Identify the following correct statements:

• A linear recurrent neural network with a memory of p lags isan autoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, areguaranteed to be stationary, for any choice of weights.

• The amount of memory in a shallow recurrent networkcorresponds to the number of times a single perceptron layeris unfolded.

• The amount of memory in a deep recurrent networkcorresponds to the number of perceptron layers.

Answer: 1, 3.


High Frequency Trading

Figure: A space-time diagram showing the limit order book. The contemporaneous depths imbalances at eachprice level, xi,t , are represented by the color scale: red denotes a high value of the depth imbalance and yellow theconverse. The limit order book are observed to polarize prior to a price movement.


Predictive Performance Comparisons

Features MethodY = −1 Y = 0 Y = 1

precision recall f1 precision recall f1 precision recall f1

Liquidity ImbalanceLogistic 0.010 0.603 0.019 0.995 0.620 0.764 0.013 0.588 0.025

Kalman Filter 0.051 0.540 0.093 0.998 0.682 0.810 0.055 0.557 0.100RNN 0.037 0.636 0.070 0.996 0.673 0.803 0.040 0.613 0.075

Order FlowLogistic 0.042 0.711 0.079 0.991 0.590 0.740 0.047 0.688 0.088

Kalman Filter 0.068 0.594 0.122 0.996 0.615 0.751 0.071 0.661 0.128RNN 0.064 0.739 0.118 0.995 0.701 0.823 0.066 0.728 0.121

Spatio-temporalElastic Net 0.063 0.754 0.116 0.986 0.483 0.649 0.058 0.815 0.108

RNN 0.084 0.788 0.153 0.999 0.729 0.843 0.075 0.818 0.137FFWD NN 0.066 0.758 0.121 0.999 0.657 0.795 0.065 0.796 0.120

White Noise 0.004 0.333 0.007 0.993 0.333 0.499 0.003 0.333 0.007


Algorithmic Trading Example: Predicting portfolio returns

• Consider a portfolio of positions in n stocks

• The portfolio is assumed to be equally weighted so that thereturns are

rP(t) =n∑

i=1

wi ri (t), wi =1

n

• Goal: Learn the relationship between all the previous stockreturns and directional change in the portfolio returns

• Key idea is to maximize the capacity to predict the next dayportfolio returns given cross-sectional market data


Algorithmic Trading Example: Prediction formulation

• The observed data consists of historical returns XtTt=1

where Xt = r1(t), . . . , rn(t)

• The data is labeled with a categorical variable

Yt =

1, rP(t + 1) > ε

0, |rP(t + 1)| ≤ ε−1, rP(t + 1) < ε

• Goal is to learn the map Yt = FW ,b(Xt)

• ε is a threshold chosen from the training data to balance theobservations.


Algorithmic Trading Example: Performance of logisticregression

in-sample out-of-sample

label precision recall f1-score support precision recall f1-score support

-1 0.91 0.88 0.89 1222 0.26 0.32 0.29 820 0.90 0.90 0.90 1347 0.38 0.35 0.36 1211 0.89 0.92 0.91 1271 0.37 0.34 0.35 97

avg / total 0.90 0.90 0.90 3840 0.34 0.34 0.34 300

Table: The bias-variance tradeoff is characterized by the difference between thein-sample and out-of-sample performance.


Algorithmic Trading Example: Performance of deep neuralnetworks

in-sample out-of-sample

label precision recall f1-score support precision recall f1-score support

-1 0.94 0.95 0.94 1222 0.35 0.35 0.35 820 0.97 0.93 0.95 1347 0.44 0.52 0.48 1211 0.93 0.94 0.93 1271 0.41 0.32 0.36 97

avg / total 0.95 0.94 0.94 3840 0.41 0.41 0.41 300

Table: The bias-variance tradeoff is characterized by the difference between thein-sample and out-of-sample performance.


Summary

A Neural networks aren’t themselves ’black-boxes’, althoughthey do treat the data generation process as a black box

B The output from neural network classifiers are onlyprobabilities if the features are conditionally independent (orthere are enough layers)

C One layer is typically sufficient to capture the non-linearity inmost financial applications (but multiple layers are needed forprobabilistic output)

D Recurrent neural networks are non-parametric, non-linear,extensions of classical time series methods

E TensorFlow doesn’t check that fitted Recurrent NeuralNetworks are stationary


References

• M.F. Dixon, A High Frequency Trade Execution Model forSupervised Learning, High Frequency, arXiv:1710.03870, 2018.

• M.F. Dixon, N. Polson and V. Sokolov, Deep Learning forSpatio-Temporal Modeling: Dynamic Traffic Flows and HighFrequency Trading, Applied Stochastic Methods in Businessand Industry, arXiv:1705.09851, 2018.

• M.F. Dixon, Sequence Classification of the Limit Order Bookusing Recurrent Neural Networks, J. Computational Science,Special Issue on Topics in Computational and AlgorithmicFinance, arXiv:1707.05642, 2018.

Extra Slides

Autoregressive Processes AR(p)

• The pth order autoregressive process of a variable y dependsonly on the previous values of the variable plus a white noisedisturbance term

yt = µ+

p∑i=1

φiyt−i + ut

• Defining the polynomial functionφ(L) := (1− φ1L− φ2L

2 − · · · − φpLp), the AR(p) processcan be expressed in the more compact form

φ(L)yt = µ+ ut

Key point: an AR(p) process has a geometrically decaying acf.

Extra Slides

Autoregressive Classifiers

• Suppose we have conditionally i.i.d. Bernoulli r.v.s Xt withpt := P[Xt = 1|Ωt ] representing a binary event

• E[Xt | Ω] = 0 · (1− pt) + 1 · pt = pt

• V[Xt | Ω] = pt(1− pt)

• Under an ARMA model:

ln

(pt

1− pt

)= φ−1(L)(µ+ θ(L)ut)

Extra Slides

High Frequency Trading Example: Prediction Model

• The response isY = ∆ptt+h (1)

• ∆ptt+h is the forecast of discrete mid-price changes from timet to t + h, given measurement of the predictors up to time t.

• The predictors are embedded

x = x t = vec

x1,t−k . . . x1,t...

...xn,t−k . . . xn,t

(2)

• n is the number of quoted price levels, k is the number oflagged observations, and xi ,t ∈ [0, 1] is the relative depth,representing liquidity imbalance, at quote level i :

xi ,t =dai ,t

dai ,t + db

i ,t

. (3)

Extra Slides

Algorithmic Trading Example: Strategy

• A simple strategy S(Yt) chooses whether to hold a long, shortor neutral position in all stocks over the next period.

wt+1 = S(Yt) =

1n ,Yt = 1

0,Yt = 0

− 1n ,Yt = −1

Extra Slides

Algorithmic Trading Example: Portfolio returns

rP rGSPCµ 0.064 1.330σ 0.016 1.317

The selected portfolio consists of a subset of the S&P 500 andoutperforms it over a 15 year period.

Extra Slides

Algorithmic Trading Example: Data preparation

• Use all symbols of the S&P 500 on 2013-7-3.

• Extract historical daily adjusted close prices from Yahoofinance from 1998-1-2 to 2013-7-3

• Remove symbols which have missing prices rather thantruncate the time series.

• Use a training horizon of 3840 days, a test horizon of 30 days,and retrain every 30 days over 300 test observations.

• Avoid look ahead bias by only normalizing input with momentestimation from the training data.

the neural networks survival kit for...

Documents