# Tutorial Training Recurrent Neural Networks

Post on 28-Nov-2015

35 views

Embed Size (px)

TRANSCRIPT

<ul><li><p>A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach Herbert Jaeger Fraunhofer Institute for Autonomous Intelligent Systems (AIS) since 2003: International University Bremen First published: Oct. 2002 First revision: Feb. 2004 Second revision: March 2005 Abstract: This tutorial is a worked-out version of a 5-hour course originally held at AIS in September/October 2002. It has two distinct components. First, it contains a mathematically-oriented crash course on traditional training methods for recurrent neural networks, covering back-propagation through time (BPTT), real-time recurrent learning (RTRL), and extended Kalman filtering approaches (EKF). This material is covered in Sections 2 5. The remaining sections 1 and 6 9 are much more gentle, more detailed, and illustrated with simple examples. They are intended to be useful as a stand-alone tutorial for the echo state network (ESN) approach to recurrent neural network training. The author apologizes for the poor layout of this document: it was transformed from an html file into a Word file... This manuscript was first printed in October 2002 as H. Jaeger (2002): Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach. GMD Report 159, German National Research Center for Information Technology, 2002 (48 pp.) Revision history: 01/04/2004: several serious typos/errors in Sections 3 and 5 03/05/2004: numerous typos 21/03/2005: errors in Section 8.1, updated some URLs </p><p> 1</p></li><li><p>Index 1. Recurrent neural networks................................................................................................. 3 </p><p>1.1 First impression ................................................................................................................ 3 1.2 Supervised training: basic scheme ................................................................................... 5 1.3 Formal description of RNNs ............................................................................................ 6 1.4 Example: a little timer network........................................................................................ 8 </p><p>2. Standard training techniques for RNNs ........................................................................... 9 2.1 Backpropagation revisited................................................................................................ 9 2.2. Backpropagation through time ...................................................................................... 12 </p><p>3. Real-time recurrent learning ............................................................................................ 15 4. Higher-order gradient descent techniques.................................................................... 16 5. Extended Kalman-filtering approaches.......................................................................... 17 </p><p>5.1 The extended Kalman filter............................................................................................ 17 5.2 Applying EKF to RNN weight estimation ..................................................................... 18 </p><p>6. Echo state networks ......................................................................................................... 20 6.1 Training echo state networks.......................................................................................... 20 </p><p>6.1.1 First example: a sinewave generator ....................................................................... 20 6.1.2 Second Example: a tuneable sinewave generator ................................................... 24 </p><p>6.2 Training echo state networks: mathematics of echo states ............................................ 26 6.3 Training echo state networks: algorithm........................................................................ 29 6.4 Why echo states? ............................................................................................................ 33 6. 5 Liquid state machines.................................................................................................... 33 </p><p>7. Short term memory in ESNs............................................................................................ 34 7.1 First example: training an ESN as a delay line .............................................................. 35 7.2 Theoretical insights ........................................................................................................ 36 </p><p>8. ESNs with leaky integrator neurons ............................................................................... 38 8.1 The neuron model........................................................................................................... 39 8.2 Example: slow sinewave generator ................................................................................ 41 </p><p>9. Tricks of the trade ............................................................................................................. 41 References ............................................................................................................................ 45 </p><p> 2</p></li><li><p>1. Recurrent neural networks </p><p>1.1 First impression There are two major types of neural networks, feedforward and recurrent. In feedforward networks, activation is "piped" through the network from input units to output units (from left to right in left drawing in Fig. 1.1): </p><p>......</p><p> Figure 1.1: Typical structure of a feedforward network (left) and a recurrent network (right). Short characterization of feedforward networks: </p><p> typically, activation is fed forward from input to output through "hidden layers" ("Multi-Layer Perceptrons" MLP), though many other architectures exist </p><p> mathematically, they implement static input-output mappings (functions) basic theoretical result: MLPs can approximate arbitrary (term needs some </p><p>qualification) nonlinear maps with arbitrary precision ("universal approximation property") </p><p> most popular supervised training algorithm: backpropagation algorithm huge literature, 95 % of neural network publications concern feedforward nets </p><p>(my estimate) have proven useful in many practical applications as approximators of </p><p>nonlinear functions and as pattern classificators are not the topic considered in this tutorial </p><p> By contrast, a recurrent neural network (RNN) has (at least one) cyclic path of synaptic connections. Basic characteristics: </p><p> all biological neural networks are recurrent mathematically, RNNs implement dynamical systems basic theoretical result: RNNs can approximate arbitrary (term needs some </p><p>qualification) dynamical systems with arbitrary precision ("universal approximation property") </p><p> several types of training algorithms are known, no clear winner theoretical and practical difficulties by and large have prevented practical </p><p>applications so far </p><p> 3</p></li><li><p> not covered in most neuroinformatics textbooks, absent from engineering textbooks </p><p> this tutorial is all about them. </p><p>Because biological neuronal systems are recurrent, RNN models abound in the biological and biocybernetical literature. Standard types of research papers include... </p><p> bottom-up, detailed neurosimulation: o compartment models of small (even single-unit) systems o complex biological network models (e.g. Freeman's olfactory bulb </p><p>models) top-down, investigation of principles </p><p>o complete mathematical study of few-unit networks (in AIS: Pasemann, Giannakopoulos) </p><p>o universal properties of dynamical systems as "explanations" for cognitive neurodynamics, e.g. "concept ~ attractor state"; "learning ~ parameter change"; " jumps in learning and development ~ bifurcations" </p><p>o demonstration of dynamical working principles o synaptic learning dynamics and conditioning o synfire chains </p><p> This tutorial does not enter this vast area. The tutorial is about algorithmical RNNs, intended as blackbox models for engineering and signal processing. The general picture is given in Fig. 1.2: </p><p>physical system empirical time series data</p><p>RNN model model-generated data</p><p>observe</p><p>model</p><p>generate</p><p>fit (similar distribution)"learn", </p><p>"estimate","identify"</p><p>......</p><p> Figure 1.2: Principal moves in the blackbox modeling game. </p><p> 4</p></li><li><p>Types of tasks for which RNNs can, in principle, be used: </p><p> system identification and inverse system identification filtering and prediction pattern classification stochastic sequence modeling associative memory data compression </p><p> Some relevant application areas: </p><p> telecommunication control of chemical plants control of engines and generators fault monitoring, biomedical diagnostics and monitoring speech recognition robotics, toys and edutainment video data analysis man-machine interfaces </p><p> State of usage in applications: RNNs are (not often) proposed in technical articles as "in principle promising" solutions for difficult tasks. Demo prototypes in simulated or clean laboratory tasks. Not economically relevant yet. Why? supervised training of RNNs is (was) extremely difficult. This is the topic of this tutorial. </p><p>1.2 Supervised training: basic scheme There are two basic classes of "learning": supervised and unsupervised (and unclear cases, e.g. reinforcement learning). This tutorial considers only supervised training. In supervised training of RNNs, one starts with teacher data (or training data): empirically observed or artificially constructed input-output time series, which represent examples of the desired model behavior. </p><p>Figure 1.3: Supervised training scheme. </p><p>A. Training</p><p>Teacher:</p><p>Model:</p><p>B. Exploitation</p><p>Input: Correct (unknown) output:Model:</p><p>in</p><p>out</p><p>in</p><p>out</p><p>in</p><p>out</p><p> 5</p></li><li><p> The teacher data is used to train a RNN such that it more or less precisely reproduces (fits) the teacher data hoping that the RNN then generalizes to novel inputs. That is, when the trained RNN receives an input sequence which is somehow similiar to the training input sequence, it should generate an output which resembles the output of the original system. A fundamental issue in supervised training is overfitting: if the model fits the training data too well (extreme case: model duplicates teacher data exactly), it has only "learnt the training data by heart" and will not generalize well. Particularly important with small training samples. Statistical learning theory addresses this problem. For RNN training, however, this tended to be a non-issue, because known training methods have a hard time fitting training data well in the first place. </p><p>1.3 Formal description of RNNs The elementary building blocks of a RNN are neurons (we will use the term units) connected by synaptic links (connections) whose synaptic strength is coded by a weight. One typically distinguishes input units, internal (or hidden) units, and output units. At a given time, a unit has an activation. We denote the activations of input units by u(n), of internal units by x(n), of output units by y(n). Sometimes we ignore the input/internal/output distinction and then use x(n) in a metonymical fashion. </p><p>Figure 1.4: A typology of RNN models (incomplete). </p><p>discrete time </p><p>continuous time</p><p>spiking</p><p>spatially organized</p><p>))(()1( nxwfnx jiji )( jijii xfwxx(+ input, + bias, +noise, +outputfeedback...)</p><p> There are many types of formal RNN models (see Fig. 1.4). Discrete-time models are mathematically cast as maps iterated over discrete time steps n = 1, 2, 3, ... . Continuous-time models are defined through differential equations whose solutions are defined over a continous time t. Especially for purposes of biological modeling, continuous dynamical models can be quite involved and describe activation signals on the level of individual action potentials (spikes). Often the model incorporates a specification of a spatial topology, most often of a 2D surface where units are locally connected in retina-like structures. In this tutorial we will only consider a particular kind of discrete-time models without spatial organization. Our model consists of K input units with an activation (column) vector (1.1) , tK nunun ))(,),(()( 1 u</p><p> 6</p></li><li><p>of N internal units with an activation vector (1.2) , ))'(,),(()( 1 nxnxn Nx and of L output units with an activation vector (1.3) , ))'(,),(()( 1 nynyn Ly where t denotes transpose. The input / internal / output connection weights are collected in N x K / N x N / L x (K+N) weight matrices (1.4) ).(),(),( outij</p><p>outij</p><p>inij</p><p>in www WWW The output units may optionally project back to internal units with connections whose weights are collected in a N x L backprojection weight matrix (1.5) ).( backij</p><p>back wW </p><p>... ...</p><p>K input units</p><p>N internal units L output units</p><p> Figure 1.5. The basic network architecture used in this tutorial. Shaded arrows indicate optional connections. Dotted arrows mark connections which are trained in the "echo state network" approach (in other approaches, all connections can be trained). A zero weight value can be interpreted as "no connection". Note that output units may have connections not only from internal units but also (often) from input units and (rarely) from output units. The activation of internal units is updated according to (1.6) )),()()1(()1( nnnn backin yWWxuWfx where u(n+1) is the externally given input, and f denotes the component-wise application of the individual unit's transfer function, f (also known as activation function, unit output function, or squashing function). We will mostly use the sigmoid function f = tanh but sometimes also consider linear networks with f = 1. The output is computed according to </p><p> 7</p></li><li><p>(1.7) )),(),1(),1((()1( nnnn outout yxuWfy where (u(n+1),x(n+1),y(n)) denotes the concatenated vector made from input, internal, and output activation vectors. We will use output transfer functions fout = tanh or fout = 1; in the latter case we have linear output units. </p><p>1.4 Example: a little timer network Consider the input-output task of timing. The input signal has two components. The first component u1(n) is 0 most of the time, but sometimes jumps to 1. The second input u2(n) can take values between 0.1 and 1.0 in increments of 0.1, and assumes a new (random) of these values each time u1(n) jumps to 1. The desired output is 0.5 for 10 x u2(n) time steps after u1(n) was 1, else is 0. This amounts to implementing a timer: u1(n) gives the "go" signal for the timer, u2(n) gives the desired duration. </p><p>......</p><p>input 1: start signals </p><p>input 2: duration setting</p><p>ouput: rectangular signals of desired duration</p><p>......</p><p>input 1: start signals </p><p>input 2: duration setting</p><p>ouput: rectangular signals of desired duration</p><p>Figure 1.6: Schema of the timer network. The following figure shows traces of input and output generated by a RNN trained on this task according to the ESN approach: </p><p> Figure 1.7: Performance of a RNN trained on the timer task. Solid line in last graph: desired (teacher) output. Dotted line: network ouput. </p><p> 8</p></li><li><p>Clearly this task r...</p></li></ul>

Recommended

Recurrent Neural Networks - Hacettepe ?· Recurrent Neural Networks Multi-layer Perceptron Recurrent…