the neural networks survival kit for...

33
Myth Buster Geometric Interpretation of Deep Learning The Neural Networks Survival Kit for Quants Matthew Dixon Illinois Tech Chicago September 2018

Upload: others

Post on 20-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

The Neural Networks Survival Kit for Quants

Matthew Dixon

Illinois TechChicago

September 2018

Page 2: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Overview

Which of the following statements are true?

A Machine learning or AI is different from statistics: Manymachine learning methods are ’black-boxes’

B The output from neural network classifiers are theprobabilities of an input belonging to a class

C We need multiple hidden layers in the neural network tocapture non-linearity

D Time series modeling (ARIMA, smoothing etc) are completelyunrelated to neural networks

E Using tools like TensorFlow, combined with best practices inSilicon Valley, we can be confident that neural networks toolswork

Page 3: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Traditional Statistical Modeling

Figure: Statistical models assume that a model generated the data. What modelgenerates alternative data?

Page 4: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Stats versus ML

Property Statistical Inference Supervised Machine Learning

Goal Causal models with explanatory power Prediction performance, often with limited explanatory powerData The data is generated by a model The data generation process is unknownFramework Probabilistic AlgorithmicExpressability Typically linear Non-linearModel selection Based on information criteria Numerical optimizationScalability Limited to lower dimensional data Scales to higher dimensional input dataRobustness Prone to over-fitting Designed for out-of-sample performance

Supervised Machine learning is a generalization of statistics tomore general data representations.

Page 5: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Taxonomy of Most Popular Neural Network Architectures

feed forward auto-encoder convolution

recurrent Long / short term memory neural Turing machines

Figure: Most commonly used deep learning architectures for modeling. Source:http://www.asimovinstitute.org/neural-network-zoo

Page 6: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Geometric Interpretation of Neural Networks

x1 x2 x1 x2 x1 x2

No hidden layers One hidden layer Two hidden layers

Page 7: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Geometric Interpretation of Neural Networks

Half-Moon Dataset

No hidden layers One hidden layer Two hidden layers

Page 8: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Why do we need more Neurons?

x1 x2

Page 9: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Geometric Interpretation of Neural Networks

25 hidden units 50 hidden units 75 hidden units

Figure: The number of hidden units is adjusted according to therequirements of the classification problem and can he very high for datasets which are difficult to separate.

Page 10: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Geometric Interpretation of Neural Networks

Figure: Hyperplanes defined by three neurons in the hidden layer, each with ReLU activation functions.

Page 11: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

• Consider a one layer neural network binary classifier with’probabilistic’ output:

P(Y |X ) = σ(u) =1

1 + e−u, u = WX+b, Y ∈ 0, 1, X ∈ Rp

• By Bayes’ Law, we know that our posteriors must be given bythe likelihood, prior and evidence:

P(Y |X ) =P(X |Y )P(Y )

P(X )=

1

1 + e−(

log(

P(X |Y )

P(X |Y ′)

)+log

(P(Y )

P(Y ′)

))• So the outputs are only really ’true’ posterior probabilities when

the weights and bias are

wj =P(xj |Y )P(xj |Y ′) , ∀j ∈ 1, . . . , p, b = log

(P(Y )P(Y ′)

)The x ′j s must be conditionally independent from each other,otherwise the outputs from the network are not posteriorprobabilities.

Page 12: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

So Why Deep Learning?

• One hidden layer, with many neurons, provides strong (andlikely enough) discriminatory power

• Extra hidden layers are only needed when the inputs areconditionally dependent - extra layers give posteriors evenwhen the inputs are not conditionally independent

• Big data, i.e. high dimensional data, typically hasconditionally correlated inputs.

• Hence deep learning is a great tool for big and alternativedata, but not just for the reasons we’ve been told

Page 13: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Recurrent Neural Networks (p=5)

Z 1t−5

Z jt−5

ZHt−5

Xt−5

Z 1t−4

Z jt−4

ZHt−4

Xt−4

Z 1t−3

Z jt−3

ZHt−3

Xt−3

Z 1t−2

Z jt−2

ZHt−2

Xt−2

Z 1t−1

Z jt−1

ZHt−1

Xt−1

Z 1t

Z jt

ZHt

Xt

Yt

. . . . . .

Page 14: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Non-linear Predictors

• Input-output pairs D = Xt ,YtNt=1 are auto-correlatedobservations of X and Y at times t = 1, . . . ,N

• Construct a nonlinear times series predictor, Yt(Xt), of anoutput, Y , using a high dimensional input matrix of p + 1length sub-sequences Xt :

Yt = F (Xt) where Xt = seqp(Xt) = (Xt−p, . . . ,Xt)

• Xt−j is a j th lagged observation of Xt , Xt−j = Lj [Xj ], forj = 0, . . . , p.

Page 15: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Recurrent Neural Networks

• For each time step s = t − p, . . . , t, a function Fh generates ahidden state Zs :

Zs = Fh(Xs ,Zs−1) := σ(WhXs+UhZs−1+bh), Wh ∈ RH×d ,Zh ∈ RH×H

• When the output is continuous, the model output from thefinal hidden state is given by:

Yt = Fy (Zt) = WyZt + by , Wy ∈ RK×H

• When the output is categorical, the output is given by

Yt = Fy (Zt) = softmax(WyZt + by )

• Goal: find the weight matrices W = (Wh,Uh,Wy ) and biasesb = (bh, by ).

Page 16: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Univariate Example: Recurrent Neural Networks areNon-linear Autoregressive Models

• Consider the univariate time series prediction Yt = F (Xt−1),using p previous observations Yt−ipi=1.

• Because this is a special case when no input is available attime t (since we are predicting it), we form the hidden statesto time Zt−1.

• The simplest case of a RNN with one hidden unit, H = 1, noactivation function and the dimensionality of the input vectoris d = 1.

Page 17: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Univariate Example

• If Wh = Uh = φ, |φ| < 1, Wy = 1 and bh = by = 0.

• Then we can show that Yt = F (Xt−1) is a zero-driftauto-regressive, AR(p), model with geometrically decayingweights:

Zt−p = φYt−p,

Zt−p+1 = φ(Zt−p + Yt−p+1),

. . . = . . .

Zt−1 = φ(Zt−2 + Yt−1),

Yt = Zt−1,

where

Yt = (φL + φL2 + · · ·+ φpLp)[Yt ].

Page 18: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Quick Multiple Choice Question

Identify the following correct statements:

• A linear recurrent neural network with a memory of p lags isan autoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, areguaranteed to be stationary, for any choice of weights.

• The amount of memory in a shallow recurrent networkcorresponds to the number of times a single perceptron layeris unfolded.

• The amount of memory in a deep recurrent networkcorresponds to the number of perceptron layers.

Page 19: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Quick Multiple Choice Question: Answers

Identify the following correct statements:

• A linear recurrent neural network with a memory of p lags isan autoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, areguaranteed to be stationary, for any choice of weights.

• The amount of memory in a shallow recurrent networkcorresponds to the number of times a single perceptron layeris unfolded.

• The amount of memory in a deep recurrent networkcorresponds to the number of perceptron layers.

Answer: 1, 3.

Page 20: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

High Frequency Trading

Figure: A space-time diagram showing the limit order book. The contemporaneous depths imbalances at eachprice level, xi,t , are represented by the color scale: red denotes a high value of the depth imbalance and yellow theconverse. The limit order book are observed to polarize prior to a price movement.

Page 21: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Predictive Performance Comparisons

Features MethodY = −1 Y = 0 Y = 1

precision recall f1 precision recall f1 precision recall f1

Liquidity ImbalanceLogistic 0.010 0.603 0.019 0.995 0.620 0.764 0.013 0.588 0.025

Kalman Filter 0.051 0.540 0.093 0.998 0.682 0.810 0.055 0.557 0.100RNN 0.037 0.636 0.070 0.996 0.673 0.803 0.040 0.613 0.075

Order FlowLogistic 0.042 0.711 0.079 0.991 0.590 0.740 0.047 0.688 0.088

Kalman Filter 0.068 0.594 0.122 0.996 0.615 0.751 0.071 0.661 0.128RNN 0.064 0.739 0.118 0.995 0.701 0.823 0.066 0.728 0.121

Spatio-temporalElastic Net 0.063 0.754 0.116 0.986 0.483 0.649 0.058 0.815 0.108

RNN 0.084 0.788 0.153 0.999 0.729 0.843 0.075 0.818 0.137FFWD NN 0.066 0.758 0.121 0.999 0.657 0.795 0.065 0.796 0.120

White Noise 0.004 0.333 0.007 0.993 0.333 0.499 0.003 0.333 0.007

Page 22: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Algorithmic Trading Example: Predicting portfolio returns

• Consider a portfolio of positions in n stocks

• The portfolio is assumed to be equally weighted so that thereturns are

rP(t) =n∑

i=1

wi ri (t), wi =1

n

• Goal: Learn the relationship between all the previous stockreturns and directional change in the portfolio returns

• Key idea is to maximize the capacity to predict the next dayportfolio returns given cross-sectional market data

Page 23: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Algorithmic Trading Example: Prediction formulation

• The observed data consists of historical returns XtTt=1

where Xt = r1(t), . . . , rn(t)

• The data is labeled with a categorical variable

Yt =

1, rP(t + 1) > ε

0, |rP(t + 1)| ≤ ε−1, rP(t + 1) < ε

• Goal is to learn the map Yt = FW ,b(Xt)

• ε is a threshold chosen from the training data to balance theobservations.

Page 24: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Algorithmic Trading Example: Performance of logisticregression

in-sample out-of-sample

label precision recall f1-score support precision recall f1-score support

-1 0.91 0.88 0.89 1222 0.26 0.32 0.29 820 0.90 0.90 0.90 1347 0.38 0.35 0.36 1211 0.89 0.92 0.91 1271 0.37 0.34 0.35 97

avg / total 0.90 0.90 0.90 3840 0.34 0.34 0.34 300

Table: The bias-variance tradeoff is characterized by the difference between thein-sample and out-of-sample performance.

Page 25: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Algorithmic Trading Example: Performance of deep neuralnetworks

in-sample out-of-sample

label precision recall f1-score support precision recall f1-score support

-1 0.94 0.95 0.94 1222 0.35 0.35 0.35 820 0.97 0.93 0.95 1347 0.44 0.52 0.48 1211 0.93 0.94 0.93 1271 0.41 0.32 0.36 97

avg / total 0.95 0.94 0.94 3840 0.41 0.41 0.41 300

Table: The bias-variance tradeoff is characterized by the difference between thein-sample and out-of-sample performance.

Page 26: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

Summary

A Neural networks aren’t themselves ’black-boxes’, althoughthey do treat the data generation process as a black box

B The output from neural network classifiers are onlyprobabilities if the features are conditionally independent (orthere are enough layers)

C One layer is typically sufficient to capture the non-linearity inmost financial applications (but multiple layers are needed forprobabilistic output)

D Recurrent neural networks are non-parametric, non-linear,extensions of classical time series methods

E TensorFlow doesn’t check that fitted Recurrent NeuralNetworks are stationary

Page 27: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Myth Buster Geometric Interpretation of Deep Learning

References

• M.F. Dixon, A High Frequency Trade Execution Model forSupervised Learning, High Frequency, arXiv:1710.03870, 2018.

• M.F. Dixon, N. Polson and V. Sokolov, Deep Learning forSpatio-Temporal Modeling: Dynamic Traffic Flows and HighFrequency Trading, Applied Stochastic Methods in Businessand Industry, arXiv:1705.09851, 2018.

• M.F. Dixon, Sequence Classification of the Limit Order Bookusing Recurrent Neural Networks, J. Computational Science,Special Issue on Topics in Computational and AlgorithmicFinance, arXiv:1707.05642, 2018.

Page 28: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Extra Slides

Autoregressive Processes AR(p)

• The pth order autoregressive process of a variable y dependsonly on the previous values of the variable plus a white noisedisturbance term

yt = µ+

p∑i=1

φiyt−i + ut

• Defining the polynomial functionφ(L) := (1− φ1L− φ2L

2 − · · · − φpLp), the AR(p) processcan be expressed in the more compact form

φ(L)yt = µ+ ut

Key point: an AR(p) process has a geometrically decaying acf.

Page 29: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Extra Slides

Autoregressive Classifiers

• Suppose we have conditionally i.i.d. Bernoulli r.v.s Xt withpt := P[Xt = 1|Ωt ] representing a binary event

• E[Xt | Ω] = 0 · (1− pt) + 1 · pt = pt

• V[Xt | Ω] = pt(1− pt)

• Under an ARMA model:

ln

(pt

1− pt

)= φ−1(L)(µ+ θ(L)ut)

Page 30: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Extra Slides

High Frequency Trading Example: Prediction Model

• The response isY = ∆ptt+h (1)

• ∆ptt+h is the forecast of discrete mid-price changes from timet to t + h, given measurement of the predictors up to time t.

• The predictors are embedded

x = x t = vec

x1,t−k . . . x1,t...

...xn,t−k . . . xn,t

(2)

• n is the number of quoted price levels, k is the number oflagged observations, and xi ,t ∈ [0, 1] is the relative depth,representing liquidity imbalance, at quote level i :

xi ,t =dai ,t

dai ,t + db

i ,t

. (3)

Page 31: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Extra Slides

Algorithmic Trading Example: Strategy

• A simple strategy S(Yt) chooses whether to hold a long, shortor neutral position in all stocks over the next period.

wt+1 = S(Yt) =

1n ,Yt = 1

0,Yt = 0

− 1n ,Yt = −1

Page 32: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Extra Slides

Algorithmic Trading Example: Portfolio returns

rP rGSPCµ 0.064 1.330σ 0.016 1.317

The selected portfolio consists of a subset of the S&P 500 andoutperforms it over a 15 year period.

Page 33: The Neural Networks Survival Kit for Quantsmypages.iit.edu/~mdixon7/presentations/RavenPack_AI_Dixon.pdfBThe output from neural network classi ers are the probabilities of an input

Extra Slides

Algorithmic Trading Example: Data preparation

• Use all symbols of the S&P 500 on 2013-7-3.

• Extract historical daily adjusted close prices from Yahoofinance from 1998-1-2 to 2013-7-3

• Remove symbols which have missing prices rather thantruncate the time series.

• Use a training horizon of 3840 days, a test horizon of 30 days,and retrain every 30 days over 300 test observations.

• Avoid look ahead bias by only normalizing input with momentestimation from the training data.