recurrent network acoustic models - school of informatics · store temporal contextual information,...

24
Recurrent Network Acoustic Models Steve Renals Automatic Speech Recognition – ASR Lecture 13 29 February 2016 ASR Lecture 13 Recurrent Network Acoustic Models 1

Upload: others

Post on 14-Oct-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Recurrent Network Acoustic Models

Steve Renals

Automatic Speech Recognition ndash ASR Lecture 1329 February 2016

ASR Lecture 13 Recurrent Network Acoustic Models 1

Sequential Data

t-2

x1 x2 x3 x1 x2 x3

t-1

x1 x2 x3

t

output

hidden

input

2 frames of context

DNN for acoustic modelling

MFCC Inputs

CDPhone Outputs

Hidden units

(399=351)

2000

3ndash8 hidden layers

12000

Modelling sequential datawith time dependencesbetween feature vectors

Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite input response)

ASR Lecture 13 Recurrent Network Acoustic Models 2

Sequential Data

x1 x2 x3

t

output

recurrenthidden

input

Donrsquot use an input context win-dow ndash context is learned by therecurrent hidden (state) units

Modelling sequential datawith time dependencesbetween feature vectors

Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite impulse response)

Model sequential inputsusing recurrent connectionsto learn a time-dependentstate (in signal processingthis is called IIR ndash infiniteimpulse response)

ASR Lecture 13 Recurrent Network Acoustic Models 3

Recurrent networks

Can think of recurrent networks in terms of the dynamics of therecurrent hidden state

Settle to a fixed point ndash stable representation for a sequence(eg machine translation)

Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition

Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)

Useful behaviours of recurrent networks

Recurrent state as memory ndash remember things for(potentially) an infinite time

Recurrent state as information compression ndash compress asequence into a state representation

ASR Lecture 13 Recurrent Network Acoustic Models 4

Simplest recurrent network

yk(t) = softmax

(Hsumr=0

w(2)kr hr (t)

)

hj(t) = sigmoid

dsum

s=0

w(1)js xs(t) +

Hsumr=0

w(R)jr hr (t minus 1)︸ ︷︷ ︸

Recurrent part

Hidden (t)

Output (t)

Input (t) Hidden (t-1)

w(1)

w(2)

w(R)

ASR Lecture 13 Recurrent Network Acoustic Models 5

Recurrent network unfolded in time

Hidden (t)

Output (t)

Input (t)

Hidden (t-1)

w(1)

w(2)

w(R)

Hidden (t+1)

Input (t-1)

Output (t-1)

Input (t+1)

Output (t+1)

w(2)w(2)

w(1)w(1)

w(R)w(R) w(R)

An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights

We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing

if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)

ASR Lecture 13 Recurrent Network Acoustic Models 6

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 2: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Sequential Data

t-2

x1 x2 x3 x1 x2 x3

t-1

x1 x2 x3

t

output

hidden

input

2 frames of context

DNN for acoustic modelling

MFCC Inputs

CDPhone Outputs

Hidden units

(399=351)

2000

3ndash8 hidden layers

12000

Modelling sequential datawith time dependencesbetween feature vectors

Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite input response)

ASR Lecture 13 Recurrent Network Acoustic Models 2

Sequential Data

x1 x2 x3

t

output

recurrenthidden

input

Donrsquot use an input context win-dow ndash context is learned by therecurrent hidden (state) units

Modelling sequential datawith time dependencesbetween feature vectors

Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite impulse response)

Model sequential inputsusing recurrent connectionsto learn a time-dependentstate (in signal processingthis is called IIR ndash infiniteimpulse response)

ASR Lecture 13 Recurrent Network Acoustic Models 3

Recurrent networks

Can think of recurrent networks in terms of the dynamics of therecurrent hidden state

Settle to a fixed point ndash stable representation for a sequence(eg machine translation)

Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition

Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)

Useful behaviours of recurrent networks

Recurrent state as memory ndash remember things for(potentially) an infinite time

Recurrent state as information compression ndash compress asequence into a state representation

ASR Lecture 13 Recurrent Network Acoustic Models 4

Simplest recurrent network

yk(t) = softmax

(Hsumr=0

w(2)kr hr (t)

)

hj(t) = sigmoid

dsum

s=0

w(1)js xs(t) +

Hsumr=0

w(R)jr hr (t minus 1)︸ ︷︷ ︸

Recurrent part

Hidden (t)

Output (t)

Input (t) Hidden (t-1)

w(1)

w(2)

w(R)

ASR Lecture 13 Recurrent Network Acoustic Models 5

Recurrent network unfolded in time

Hidden (t)

Output (t)

Input (t)

Hidden (t-1)

w(1)

w(2)

w(R)

Hidden (t+1)

Input (t-1)

Output (t-1)

Input (t+1)

Output (t+1)

w(2)w(2)

w(1)w(1)

w(R)w(R) w(R)

An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights

We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing

if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)

ASR Lecture 13 Recurrent Network Acoustic Models 6

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 3: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Sequential Data

x1 x2 x3

t

output

recurrenthidden

input

Donrsquot use an input context win-dow ndash context is learned by therecurrent hidden (state) units

Modelling sequential datawith time dependencesbetween feature vectors

Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite impulse response)

Model sequential inputsusing recurrent connectionsto learn a time-dependentstate (in signal processingthis is called IIR ndash infiniteimpulse response)

ASR Lecture 13 Recurrent Network Acoustic Models 3

Recurrent networks

Can think of recurrent networks in terms of the dynamics of therecurrent hidden state

Settle to a fixed point ndash stable representation for a sequence(eg machine translation)

Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition

Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)

Useful behaviours of recurrent networks

Recurrent state as memory ndash remember things for(potentially) an infinite time

Recurrent state as information compression ndash compress asequence into a state representation

ASR Lecture 13 Recurrent Network Acoustic Models 4

Simplest recurrent network

yk(t) = softmax

(Hsumr=0

w(2)kr hr (t)

)

hj(t) = sigmoid

dsum

s=0

w(1)js xs(t) +

Hsumr=0

w(R)jr hr (t minus 1)︸ ︷︷ ︸

Recurrent part

Hidden (t)

Output (t)

Input (t) Hidden (t-1)

w(1)

w(2)

w(R)

ASR Lecture 13 Recurrent Network Acoustic Models 5

Recurrent network unfolded in time

Hidden (t)

Output (t)

Input (t)

Hidden (t-1)

w(1)

w(2)

w(R)

Hidden (t+1)

Input (t-1)

Output (t-1)

Input (t+1)

Output (t+1)

w(2)w(2)

w(1)w(1)

w(R)w(R) w(R)

An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights

We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing

if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)

ASR Lecture 13 Recurrent Network Acoustic Models 6

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 4: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Recurrent networks

Can think of recurrent networks in terms of the dynamics of therecurrent hidden state

Settle to a fixed point ndash stable representation for a sequence(eg machine translation)

Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition

Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)

Useful behaviours of recurrent networks

Recurrent state as memory ndash remember things for(potentially) an infinite time

Recurrent state as information compression ndash compress asequence into a state representation

ASR Lecture 13 Recurrent Network Acoustic Models 4

Simplest recurrent network

yk(t) = softmax

(Hsumr=0

w(2)kr hr (t)

)

hj(t) = sigmoid

dsum

s=0

w(1)js xs(t) +

Hsumr=0

w(R)jr hr (t minus 1)︸ ︷︷ ︸

Recurrent part

Hidden (t)

Output (t)

Input (t) Hidden (t-1)

w(1)

w(2)

w(R)

ASR Lecture 13 Recurrent Network Acoustic Models 5

Recurrent network unfolded in time

Hidden (t)

Output (t)

Input (t)

Hidden (t-1)

w(1)

w(2)

w(R)

Hidden (t+1)

Input (t-1)

Output (t-1)

Input (t+1)

Output (t+1)

w(2)w(2)

w(1)w(1)

w(R)w(R) w(R)

An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights

We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing

if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)

ASR Lecture 13 Recurrent Network Acoustic Models 6

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 5: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Simplest recurrent network

yk(t) = softmax

(Hsumr=0

w(2)kr hr (t)

)

hj(t) = sigmoid

dsum

s=0

w(1)js xs(t) +

Hsumr=0

w(R)jr hr (t minus 1)︸ ︷︷ ︸

Recurrent part

Hidden (t)

Output (t)

Input (t) Hidden (t-1)

w(1)

w(2)

w(R)

ASR Lecture 13 Recurrent Network Acoustic Models 5

Recurrent network unfolded in time

Hidden (t)

Output (t)

Input (t)

Hidden (t-1)

w(1)

w(2)

w(R)

Hidden (t+1)

Input (t-1)

Output (t-1)

Input (t+1)

Output (t+1)

w(2)w(2)

w(1)w(1)

w(R)w(R) w(R)

An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights

We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing

if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)

ASR Lecture 13 Recurrent Network Acoustic Models 6

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 6: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Recurrent network unfolded in time

Hidden (t)

Output (t)

Input (t)

Hidden (t-1)

w(1)

w(2)

w(R)

Hidden (t+1)

Input (t-1)

Output (t-1)

Input (t+1)

Output (t+1)

w(2)w(2)

w(1)w(1)

w(R)w(R) w(R)

An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights

We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing

if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)

ASR Lecture 13 Recurrent Network Acoustic Models 6

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 7: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Back-propagation through time (BPTT)

We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence

More efficiently run as a recurrent network

cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime

Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network

Truncated BPTT ndash backprop through just a few time steps(eg 20)

ASR Lecture 13 Recurrent Network Acoustic Models 7

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 8: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Vanishing and exploding gradients

BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients

This can prevent effective training

Modified optimisation algorithms

RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information

Modified hidden unit transfer functionsLong short term memory (LSTM)

Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs

ReLUs

ASR Lecture 13 Recurrent Network Acoustic Models 8

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 9: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 10: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Recurrent networks in speech recognition

1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)

2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)

2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)

2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)

ASR Lecture 13 Recurrent Network Acoustic Models 9

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 11: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Speech recognition with recurrent networks (1990s)

time (ms)

freq (

Hz)

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

RecurrentNeuralNetwork

SpeechAcoustics

Phoneme Probabilities

Robinson et al (1996)

ASR Lecture 13 Recurrent Network Acoustic Models 10

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 12: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

1990s RNN

Features

MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance

256 hidden (state) units

79 context-independent phone classes (outputs)

Output training target delayed by 5 frames

Use RNN scaled likelihoods in hybrid RNNHMM

About 100k trainable parameters

Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)

In 1994 training took five days on a specially designed parallelcomputer (the RAP)

ASR Lecture 13 Recurrent Network Acoustic Models 11

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 13: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Combined RNN system

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

u(t)

x(t+1)x(t)

Timedelay

y(t-4)

Mel +

PLP

Mel +

PLP

sh ow

iym

Speech waveform Preprocessor Recurrent net Markov mode

Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)

ASR Lecture 13 Recurrent Network Acoustic Models 12

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 14: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM v1

I(t) H(t-1)

H(t)

BasicSigmoid

Unit

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 15: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM v1

I(t) H(t-1)

H(t)

1

LinearRecurrence

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 16: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM v1

I(t) H(t-1)

H(t)

1

OutputGate

InputGate

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780

ASR Lecture 13 Recurrent Network Acoustic Models 13

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 17: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM v2

I(t) H(t-1)

H(t)

ForgetGate

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471

ASR Lecture 13 Recurrent Network Acoustic Models 14

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 18: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM v3

I(t) H(t-1)

H(t)

C(t)

C(t-1)

PeepholeConnections

ASR Lecture 13 Recurrent Network Acoustic Models 15

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 19: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM equations

2 LSTM ARCHITECTURES

In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows

W = nc nc 4 + ni nc 4 + nc no + nc 3

where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive

As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units

An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = Wymmt + by (6)

where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product

inpu

t

g ct1 h

it

ft

ct

ot recu

rren

tpr

ojec

tion

outp

ut

xt

mt

pt

rt

rt1

yt

memory blocks

Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity

of the vectors and g and h are the cell input and cell output activationfunctions generally tanh

With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows

it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)

ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)

mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)

yt = Wyrrt + Wyppt + by (14)(15)

where the r and p denote the recurrent and optional non-recurrentunit activations

21 Implementation

We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization

We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively

ASR Lecture 13 Recurrent Network Acoustic Models 16

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 20: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Google LSTM experiments (2014ndashdate)

Google voice search data ndash 1900h training set

Input features 40-d log mel filterbank energies

No input context single frame 40-d frame presented eachtimestep

LSTM networks

400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states

Delay output targets by 5 frames

Use RNN to estimated scaled likelihoods

Train using BPTT

ASR Lecture 13 Recurrent Network Acoustic Models 17

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 21: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM acoustic modelling architectures

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 18

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 22: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM with projection layer (LSTMP)

inpu

t

g cell h

it

ft

ct

ot

recu

rren

t

outp

ut

xt

mt

rt

rt1

yt

LSTM memory blocks

Figure 1 LSTMP RNN architecture A single memory block isshown for clarity

memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]

An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T

it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)

ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)

mt = ot h(ct) (5)yt = (Wymmt + by) (6)

where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc

are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper

22 Deep LSTM

As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time

input

LSTM

output

(a) LSTM

input

LSTM

LSTM

output

(b) DLSTM

input

LSTM

recurrent

output

(c) LSTMP

input

LSTM

recurrent

LSTM

recurrent

output

(d) DLSTMP

Figure 2 LSTM RNN architectures

instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step

23 LSTMP - LSTM with Recurrent Projection Layer

The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive

As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3

(Sak 2014)

ASR Lecture 13 Recurrent Network Acoustic Models 19

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 23: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

LSTM ASR Results (Google Voice Search task)

Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters

C P Depth N Dev Train WER() () ()

840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124

6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107

2048 512 1L 13M 673 718 113

where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc

By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer

With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added

rt = Wrmmt (7)yt = (Wyrrt + by) (8)

where the r denote the recurrent unit activations

24 Deep LSTMP

Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model

3 Distributed Training Scaling up toLarge Models with Parallelization

We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized

instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization

We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes

This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism

4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies

41 Systems amp Evaluation

All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value

(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21

Page 24: Recurrent Network Acoustic Models - School of Informatics · store temporal contextual information, learning LSTM models be-come computationally expensive. As an alternative to the

Reading

S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798

1735VtN93Mehb5k

FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162

089976600300015015VtN9ncehb5k

T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http

wwwcstredacukdownloadspublications1996rnn4csr96pdf

H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf

ASR Lecture 13 Recurrent Network Acoustic Models 21