the neural networks survival kit for...
TRANSCRIPT
Myth Buster Geometric Interpretation of Deep Learning
The Neural Networks Survival Kit for Quants
Matthew Dixon
Illinois TechChicago
September 2018
Myth Buster Geometric Interpretation of Deep Learning
Overview
Which of the following statements are true?
A Machine learning or AI is different from statistics: Manymachine learning methods are ’black-boxes’
B The output from neural network classifiers are theprobabilities of an input belonging to a class
C We need multiple hidden layers in the neural network tocapture non-linearity
D Time series modeling (ARIMA, smoothing etc) are completelyunrelated to neural networks
E Using tools like TensorFlow, combined with best practices inSilicon Valley, we can be confident that neural networks toolswork
Myth Buster Geometric Interpretation of Deep Learning
Traditional Statistical Modeling
Figure: Statistical models assume that a model generated the data. What modelgenerates alternative data?
Myth Buster Geometric Interpretation of Deep Learning
Stats versus ML
Property Statistical Inference Supervised Machine Learning
Goal Causal models with explanatory power Prediction performance, often with limited explanatory powerData The data is generated by a model The data generation process is unknownFramework Probabilistic AlgorithmicExpressability Typically linear Non-linearModel selection Based on information criteria Numerical optimizationScalability Limited to lower dimensional data Scales to higher dimensional input dataRobustness Prone to over-fitting Designed for out-of-sample performance
Supervised Machine learning is a generalization of statistics tomore general data representations.
Myth Buster Geometric Interpretation of Deep Learning
Taxonomy of Most Popular Neural Network Architectures
feed forward auto-encoder convolution
recurrent Long / short term memory neural Turing machines
Figure: Most commonly used deep learning architectures for modeling. Source:http://www.asimovinstitute.org/neural-network-zoo
Myth Buster Geometric Interpretation of Deep Learning
Geometric Interpretation of Neural Networks
x1 x2 x1 x2 x1 x2
No hidden layers One hidden layer Two hidden layers
Myth Buster Geometric Interpretation of Deep Learning
Geometric Interpretation of Neural Networks
Half-Moon Dataset
No hidden layers One hidden layer Two hidden layers
Myth Buster Geometric Interpretation of Deep Learning
Why do we need more Neurons?
x1 x2
Myth Buster Geometric Interpretation of Deep Learning
Geometric Interpretation of Neural Networks
25 hidden units 50 hidden units 75 hidden units
Figure: The number of hidden units is adjusted according to therequirements of the classification problem and can he very high for datasets which are difficult to separate.
Myth Buster Geometric Interpretation of Deep Learning
Geometric Interpretation of Neural Networks
Figure: Hyperplanes defined by three neurons in the hidden layer, each with ReLU activation functions.
Myth Buster Geometric Interpretation of Deep Learning
• Consider a one layer neural network binary classifier with’probabilistic’ output:
P(Y |X ) = σ(u) =1
1 + e−u, u = WX+b, Y ∈ 0, 1, X ∈ Rp
• By Bayes’ Law, we know that our posteriors must be given bythe likelihood, prior and evidence:
P(Y |X ) =P(X |Y )P(Y )
P(X )=
1
1 + e−(
log(
P(X |Y )
P(X |Y ′)
)+log
(P(Y )
P(Y ′)
))• So the outputs are only really ’true’ posterior probabilities when
the weights and bias are
wj =P(xj |Y )P(xj |Y ′) , ∀j ∈ 1, . . . , p, b = log
(P(Y )P(Y ′)
)The x ′j s must be conditionally independent from each other,otherwise the outputs from the network are not posteriorprobabilities.
Myth Buster Geometric Interpretation of Deep Learning
So Why Deep Learning?
• One hidden layer, with many neurons, provides strong (andlikely enough) discriminatory power
• Extra hidden layers are only needed when the inputs areconditionally dependent - extra layers give posteriors evenwhen the inputs are not conditionally independent
• Big data, i.e. high dimensional data, typically hasconditionally correlated inputs.
• Hence deep learning is a great tool for big and alternativedata, but not just for the reasons we’ve been told
Myth Buster Geometric Interpretation of Deep Learning
Recurrent Neural Networks (p=5)
Z 1t−5
Z jt−5
ZHt−5
Xt−5
Z 1t−4
Z jt−4
ZHt−4
Xt−4
Z 1t−3
Z jt−3
ZHt−3
Xt−3
Z 1t−2
Z jt−2
ZHt−2
Xt−2
Z 1t−1
Z jt−1
ZHt−1
Xt−1
Z 1t
Z jt
ZHt
Xt
Yt
. . . . . .
Myth Buster Geometric Interpretation of Deep Learning
Non-linear Predictors
• Input-output pairs D = Xt ,YtNt=1 are auto-correlatedobservations of X and Y at times t = 1, . . . ,N
• Construct a nonlinear times series predictor, Yt(Xt), of anoutput, Y , using a high dimensional input matrix of p + 1length sub-sequences Xt :
Yt = F (Xt) where Xt = seqp(Xt) = (Xt−p, . . . ,Xt)
• Xt−j is a j th lagged observation of Xt , Xt−j = Lj [Xj ], forj = 0, . . . , p.
Myth Buster Geometric Interpretation of Deep Learning
Recurrent Neural Networks
• For each time step s = t − p, . . . , t, a function Fh generates ahidden state Zs :
Zs = Fh(Xs ,Zs−1) := σ(WhXs+UhZs−1+bh), Wh ∈ RH×d ,Zh ∈ RH×H
• When the output is continuous, the model output from thefinal hidden state is given by:
Yt = Fy (Zt) = WyZt + by , Wy ∈ RK×H
• When the output is categorical, the output is given by
Yt = Fy (Zt) = softmax(WyZt + by )
• Goal: find the weight matrices W = (Wh,Uh,Wy ) and biasesb = (bh, by ).
Myth Buster Geometric Interpretation of Deep Learning
Univariate Example: Recurrent Neural Networks areNon-linear Autoregressive Models
• Consider the univariate time series prediction Yt = F (Xt−1),using p previous observations Yt−ipi=1.
• Because this is a special case when no input is available attime t (since we are predicting it), we form the hidden statesto time Zt−1.
• The simplest case of a RNN with one hidden unit, H = 1, noactivation function and the dimensionality of the input vectoris d = 1.
Myth Buster Geometric Interpretation of Deep Learning
Univariate Example
• If Wh = Uh = φ, |φ| < 1, Wy = 1 and bh = by = 0.
• Then we can show that Yt = F (Xt−1) is a zero-driftauto-regressive, AR(p), model with geometrically decayingweights:
Zt−p = φYt−p,
Zt−p+1 = φ(Zt−p + Yt−p+1),
. . . = . . .
Zt−1 = φ(Zt−2 + Yt−1),
Yt = Zt−1,
where
Yt = (φL + φL2 + · · ·+ φpLp)[Yt ].
Myth Buster Geometric Interpretation of Deep Learning
Quick Multiple Choice Question
Identify the following correct statements:
• A linear recurrent neural network with a memory of p lags isan autoregressive model AR(p) with non-parametric error.
• Recurrent neural networks, as time series models, areguaranteed to be stationary, for any choice of weights.
• The amount of memory in a shallow recurrent networkcorresponds to the number of times a single perceptron layeris unfolded.
• The amount of memory in a deep recurrent networkcorresponds to the number of perceptron layers.
Myth Buster Geometric Interpretation of Deep Learning
Quick Multiple Choice Question: Answers
Identify the following correct statements:
• A linear recurrent neural network with a memory of p lags isan autoregressive model AR(p) with non-parametric error.
• Recurrent neural networks, as time series models, areguaranteed to be stationary, for any choice of weights.
• The amount of memory in a shallow recurrent networkcorresponds to the number of times a single perceptron layeris unfolded.
• The amount of memory in a deep recurrent networkcorresponds to the number of perceptron layers.
Answer: 1, 3.
Myth Buster Geometric Interpretation of Deep Learning
High Frequency Trading
Figure: A space-time diagram showing the limit order book. The contemporaneous depths imbalances at eachprice level, xi,t , are represented by the color scale: red denotes a high value of the depth imbalance and yellow theconverse. The limit order book are observed to polarize prior to a price movement.
Myth Buster Geometric Interpretation of Deep Learning
Predictive Performance Comparisons
Features MethodY = −1 Y = 0 Y = 1
precision recall f1 precision recall f1 precision recall f1
Liquidity ImbalanceLogistic 0.010 0.603 0.019 0.995 0.620 0.764 0.013 0.588 0.025
Kalman Filter 0.051 0.540 0.093 0.998 0.682 0.810 0.055 0.557 0.100RNN 0.037 0.636 0.070 0.996 0.673 0.803 0.040 0.613 0.075
Order FlowLogistic 0.042 0.711 0.079 0.991 0.590 0.740 0.047 0.688 0.088
Kalman Filter 0.068 0.594 0.122 0.996 0.615 0.751 0.071 0.661 0.128RNN 0.064 0.739 0.118 0.995 0.701 0.823 0.066 0.728 0.121
Spatio-temporalElastic Net 0.063 0.754 0.116 0.986 0.483 0.649 0.058 0.815 0.108
RNN 0.084 0.788 0.153 0.999 0.729 0.843 0.075 0.818 0.137FFWD NN 0.066 0.758 0.121 0.999 0.657 0.795 0.065 0.796 0.120
White Noise 0.004 0.333 0.007 0.993 0.333 0.499 0.003 0.333 0.007
Myth Buster Geometric Interpretation of Deep Learning
Algorithmic Trading Example: Predicting portfolio returns
• Consider a portfolio of positions in n stocks
• The portfolio is assumed to be equally weighted so that thereturns are
rP(t) =n∑
i=1
wi ri (t), wi =1
n
• Goal: Learn the relationship between all the previous stockreturns and directional change in the portfolio returns
• Key idea is to maximize the capacity to predict the next dayportfolio returns given cross-sectional market data
Myth Buster Geometric Interpretation of Deep Learning
Algorithmic Trading Example: Prediction formulation
• The observed data consists of historical returns XtTt=1
where Xt = r1(t), . . . , rn(t)
• The data is labeled with a categorical variable
Yt =
1, rP(t + 1) > ε
0, |rP(t + 1)| ≤ ε−1, rP(t + 1) < ε
• Goal is to learn the map Yt = FW ,b(Xt)
• ε is a threshold chosen from the training data to balance theobservations.
Myth Buster Geometric Interpretation of Deep Learning
Algorithmic Trading Example: Performance of logisticregression
in-sample out-of-sample
label precision recall f1-score support precision recall f1-score support
-1 0.91 0.88 0.89 1222 0.26 0.32 0.29 820 0.90 0.90 0.90 1347 0.38 0.35 0.36 1211 0.89 0.92 0.91 1271 0.37 0.34 0.35 97
avg / total 0.90 0.90 0.90 3840 0.34 0.34 0.34 300
Table: The bias-variance tradeoff is characterized by the difference between thein-sample and out-of-sample performance.
Myth Buster Geometric Interpretation of Deep Learning
Algorithmic Trading Example: Performance of deep neuralnetworks
in-sample out-of-sample
label precision recall f1-score support precision recall f1-score support
-1 0.94 0.95 0.94 1222 0.35 0.35 0.35 820 0.97 0.93 0.95 1347 0.44 0.52 0.48 1211 0.93 0.94 0.93 1271 0.41 0.32 0.36 97
avg / total 0.95 0.94 0.94 3840 0.41 0.41 0.41 300
Table: The bias-variance tradeoff is characterized by the difference between thein-sample and out-of-sample performance.
Myth Buster Geometric Interpretation of Deep Learning
Summary
A Neural networks aren’t themselves ’black-boxes’, althoughthey do treat the data generation process as a black box
B The output from neural network classifiers are onlyprobabilities if the features are conditionally independent (orthere are enough layers)
C One layer is typically sufficient to capture the non-linearity inmost financial applications (but multiple layers are needed forprobabilistic output)
D Recurrent neural networks are non-parametric, non-linear,extensions of classical time series methods
E TensorFlow doesn’t check that fitted Recurrent NeuralNetworks are stationary
Myth Buster Geometric Interpretation of Deep Learning
References
• M.F. Dixon, A High Frequency Trade Execution Model forSupervised Learning, High Frequency, arXiv:1710.03870, 2018.
• M.F. Dixon, N. Polson and V. Sokolov, Deep Learning forSpatio-Temporal Modeling: Dynamic Traffic Flows and HighFrequency Trading, Applied Stochastic Methods in Businessand Industry, arXiv:1705.09851, 2018.
• M.F. Dixon, Sequence Classification of the Limit Order Bookusing Recurrent Neural Networks, J. Computational Science,Special Issue on Topics in Computational and AlgorithmicFinance, arXiv:1707.05642, 2018.
Extra Slides
Autoregressive Processes AR(p)
• The pth order autoregressive process of a variable y dependsonly on the previous values of the variable plus a white noisedisturbance term
yt = µ+
p∑i=1
φiyt−i + ut
• Defining the polynomial functionφ(L) := (1− φ1L− φ2L
2 − · · · − φpLp), the AR(p) processcan be expressed in the more compact form
φ(L)yt = µ+ ut
Key point: an AR(p) process has a geometrically decaying acf.
Extra Slides
Autoregressive Classifiers
• Suppose we have conditionally i.i.d. Bernoulli r.v.s Xt withpt := P[Xt = 1|Ωt ] representing a binary event
• E[Xt | Ω] = 0 · (1− pt) + 1 · pt = pt
• V[Xt | Ω] = pt(1− pt)
• Under an ARMA model:
ln
(pt
1− pt
)= φ−1(L)(µ+ θ(L)ut)
Extra Slides
High Frequency Trading Example: Prediction Model
• The response isY = ∆ptt+h (1)
• ∆ptt+h is the forecast of discrete mid-price changes from timet to t + h, given measurement of the predictors up to time t.
• The predictors are embedded
x = x t = vec
x1,t−k . . . x1,t...
...xn,t−k . . . xn,t
(2)
• n is the number of quoted price levels, k is the number oflagged observations, and xi ,t ∈ [0, 1] is the relative depth,representing liquidity imbalance, at quote level i :
xi ,t =dai ,t
dai ,t + db
i ,t
. (3)
Extra Slides
Algorithmic Trading Example: Strategy
• A simple strategy S(Yt) chooses whether to hold a long, shortor neutral position in all stocks over the next period.
wt+1 = S(Yt) =
1n ,Yt = 1
0,Yt = 0
− 1n ,Yt = −1
Extra Slides
Algorithmic Trading Example: Portfolio returns
rP rGSPCµ 0.064 1.330σ 0.016 1.317
The selected portfolio consists of a subset of the S&P 500 andoutperforms it over a 15 year period.
Extra Slides
Algorithmic Trading Example: Data preparation
• Use all symbols of the S&P 500 on 2013-7-3.
• Extract historical daily adjusted close prices from Yahoofinance from 1998-1-2 to 2013-7-3
• Remove symbols which have missing prices rather thantruncate the time series.
• Use a training horizon of 3840 days, a test horizon of 30 days,and retrain every 30 days over 300 test observations.
• Avoid look ahead bias by only normalizing input with momentestimation from the training data.