# reinforcement learning and artificial neural nets

of 36 /36
Reinforcement Learning and Artificial Neural Nets Pierre de Lacaze Director, Machine Intelligence Shareablee Inc. Lisp NYC September 20 th , 2016

Post on 15-Apr-2017

309 views

Category:

## Science

Embed Size (px)

TRANSCRIPT

Reinforcement Learningand Artificial Neural Nets

Pierre de LacazeDirector, Machine Intelligence

Shareablee Inc.

Lisp NYCSeptember 20th, 2016

[email protected]@shareablee.com

Overview

• Reinforcement Learning (RL)

• Artificial Neural Nets (ANN)

• Deep Learning (DL)

• Deep Reinforcement Learning (DRL)

Part 1

Reinforcement Learning

Most of the material is drawn from Tom Mitchell’s book Machine Learning (Chpt. 13)

Reinforcement Learning Scenario

• Similar to a Markov Decision Process (Bellman 1957)• However in Reinforcement Learning agent has no previous knowledge.

Reinforcement Learning Problem• Agent that can observe and act upon an environment

• Learn a policy (sequence of actions) to solve particular goals.

• State transition and rewards provided by the environment

• Agent typically has no domain knowledge

• Differences with other ML Approaches– Delayed Reward– Involves Exploration– Partially observable states– Training instances are generated by agent

The Reinforcement Learning Task• Markov Decision Process

– Agent has set S of observable states and set A of actions– At each step agent observes state and selects an action– Environment responds with a next state and a reward.

st+1 = δ(st, at), rt = r(st, at)

– Learn a policy π: S A for selecting next action

– A solution is to choose at each state the action with largest cumulative reward

Vπ(st) = rt + γrt+1 +…+ γ2rt+2 +….. , where 0 <= γ <= 1

– In fact we would like to learn an optimal such policy

π*(s) = argmax Vπ(s), for all states s π

Q Learning Motivation• Cannot learn π* directly because no training data of the form <s, a>

• Recall V* is the discounted cumulative reward

• Prefer s1 to s2 when V*(s1) > V*(s2)

• π*(s) = argmax[r(s, a) + γ V*(δ(s, a))] a

• Problem is requires perfect knowledge of reward function and state transition function

The Q Function

• Q(s,a) = r(s, a) + γ V*(δ(s, a))

• π*(s) = argmax Q(s, a) a• By learning Q instead of V* we no longer need

to have complete knowledge of r and δ.

• We will need to estimate training values for Q

Q Learning Algorithm

• Initialize the state/action rewards table to 0• Start in initial state s• Repeat forever– Select an action a ad execute it– Observe immediate reward r and new state s’– Update the table entry for s, a as follows• Q(s, a) = r + γ max [a’, Q(s’, a’)]

– s s’

Simple Grid Example

Initial Q Value Estimates Table{:s1 {:east [:s2 0], :south [:s4 0]} :s2 {:east [:s3 100], :west [:s1 0], :south [:s5 0]} :s3 {:east [:s3 0], :west [:s3 0], :north [:s3 0], :south [:s3 0]} :s4 {:east [:s5 0], :north [:s1 0]} :s5 {:east [:s6 0], :west [:s4 0], :north [:s2 0]} :s6 {:east [:s5 0], :north [:s3 100]}}

Final Q Value Estimates Table{:s1 {:east [:s2 90.0], :south [:s4 72.9]} :s2 {:east [:s3 100.0], :west [:s1 81.0], :south [:s5 81.0]} :s3 {:east [:s3 0], :west [:s3 0], :north [:s3 0], :south [:s3 0]} :s4 {:east [:s5 81.0], :north [:s1 81.0]} :s5 {:east [:s6 90.0], :west [:s4 72.9], :north [:s2 90.0]} :s6 {:east [:s5 81.0], :north [:s3 100.0]}}

Optimal Policy: [:s1 :east :s2 :east :s3]

S10

S20

S3100

S40

S50

S60

γ = 0.9

Q-Learning Convergence

• Under the following criteria Q-Learning is provably convergent:

• Deterministic MDP

• Bounded immediate reward values

• Actions are executed with nonzero frequency

Reminder

Show Code Examplesfrom rl.clj

Nondeterministic Q-Learning• Nondeterministic MDP: A single action can yield one of several states with some probability

distribution.

• Augment deterministic Q function with expected reward and weighted average of estimated Q values.

Q(s,a) = E[r(s,a)] + γ Σ P(s’|s, a) max Q(s’, a’) s’ s’

• To ensure convergence need a decaying weighted average:

Qn(s,a) (1-αn)Qn-1(s,a) + αn[r + max Qn-1(s’, a’)]

» 1

Where αn = ------------------ 1 + visitsn(s, a)

Temporal Difference Learning

• Q-Learning learns from immediate successor state.• TD-Learning learns from distant states as well.

• Q(1)(st, a) = rt + γ max [a, Q(st+1, a)]

• Q(2)(st, a) = rt + γrt+1 + γ2 max [a, Q(st+2, a)]

• Q(n)(st, a) = rt + γrt+1 +…+ γn-1rt+n-1 + γn max [a, Q(st+n, a)]

• Sutton, 1988: TD(λ)• Q(λ)(st, a) = rt + γrt+1 + γ2 max [a, Q(st+2, a)]

TD-Gammon • TD-Gammon, Gerry Tesaro, 1995

• Used Temporal Difference Learning• Used Neural Net to learn Value function

• Trained with 1.5 million games

• Effectively the first time Neural Nets were successfully used in conjunction with Reinforcement Learning

Issues in Reinforcement Learning• Exploration vs. Exploitation

– open research area – ε-greedy action selection– softmax action selection

• Generalization– Too many state, action pairs in real life– Learn state abstractions– Transference

• Q function approximators– Neural Nets as universal approximators (DRL)– Using the Q values estimates as training data– Cost of exploration in non-simulation settings

• Partial episodes– Estimate reward that would obtained by completing the episode– Trace functions

• Parallelism– Multiple agents learning a shared global policy– Deepmind RL asynchronous library

Reminder

Show cool videos from Pieter Abdeel’s (UC Berkeley)

IJCAI 2016 presentation

Part 2

Artificial Neural Networks

Most of the material is drawn from Tom Mitchell’s book Machine Learning (Chpt. 4)

Neural Nets in a Nutshell

• Neural Nets are a network of layers of units• Input layer, hidden layers, output layer• Different Types of Units:– Linear Unit, Perceptron, Sigmoid Unit, ReLU

• Task is to learn weights for different units• Typically given a set of training examples• Gradient Descent algorithm used to train units• Backpropagation algorithm used to train network

Properties of ANNs• General method for learning real-valued, discrete-value and

vector-valued functions from examples

• Backpropagation uses gradient-descent to adjust weights to best fit training data (input/output pairs)

• Robust to noisy data

• Successfully used in image recognition & speech recognition– Yann LeCun 1989 (handwritten chars), – Gary Cottrell 1990 (face recognition)

Appropriateness of ANNs

• Many training instances• Target function is real-valued, discrete-value

or vector-valued• Training data contains errors/noise• Training time is not an issue • Fast evaluation is important• Humans understanding of target function is

not important.

Linear Units and Perceptrons• Linear Unit: A linear combination of weighted inputs (real-valued)• Perceptron: Thresholded Linear Unit (discrete-valued)

Note: w0 is a bias whose purpose is to move the threshold of the activation function.

Representational Capacity• A single Perceptron can represent all primitive Boolean functions: AND, OR, NAND and

NOR all of which are all linear separable.

• Every Boolean function can be represented by some network of perceptrons.

• The XOR cannot be represented by a single perceptron because the decision surface is not linearly separable.

Linear Unit Error Function• Recall Output of a linear unit: w0 + w1x1 + …. Wnxn

• Use the Mean Squared Error to measure the error of the unit:

• Where D is the set of training examples

• Goal is to learn a weight vector that minimizes the error

Weight Space Error Surface

Training Rules for Single Units1. Delta Rule for Linear Units based on derivative of error function

2. Perceptron Training Rule based on derivative of the error function

• Converges asymptotically to minimum error hypothesis• Takes unbounded time to converge• Convergence is independent on linear separability

• Converges to a hypothesis that perfectly fits the training data• Takes bounded time to converge if learning rate is small enough• Convergence is dependent on linear separability• Convergence Proved by Minsky & Pappert 1969

1. Initialize each of weights wi to small random values +/- 0.5

2. Repeat until termination criteria is reached

a. Set each Δwi to zero

b. For each training example ([x1,…,xn], t)• Compute the output o of the unit • For each unit weight

• Set Δwi = Δwi + η (t – o) xi

• For each unit weight • Set wi = wi + Δwi

• Note: η is the learning rate, typically 0.5 or less.

Standard vs. Stochastic

• Standard gradient descent increments the weights after summing over all training examples

• Incremental or stochastic gradient descent increments the weights for each training example.

• Both approaches used in practice.• Stochastic version can avoid local minima.

Absolute Basics of ANNs• A network of units

• Network organized into layers:– Input layer of inputs– Hidden layer of units (one or more)– Output layer of units

• A standard ANN is fully connected– All outputs of a previous layers are connected to all units of the next layer.

• Outputs of the network are computed by feeding the inputs into the next layer, computing the outputs of that layer, feeding those into the subsequent layer, etc.

• A Feed-Forward ANN forms an acyclic graph.

Sigmoid Units

Need redefine the error function E to sum over all output units

Gradient for this new error function E is given by

Sigmoid σ(x) function has the nice property that its derivative is σ(x)(1 – σ(x))

1. Initial weights to small random numbers

2. Until termination criteria for each training example

a. Compute the network outputs for the training example

b. For each output unit k compute its error:

δk = ok (1 – ok) (tk – ok)

c. For each hidden unit h compute its error:

δh = oh (1 – oh) Σ (whk δk ) k

d. Update each network weight wij

wij = wij + η δh xij

Vectorizing Backpropagation• Each layer can be represented as

– matrix of weights, where ith row are weights of the ith unit – bias vector where ith element is the bias of the ith unit

• Alternative error functions– Adding a momentum term– Adding penalty term for weight magnitude– Adding terms for errors in the slope (Mitchell&Thrun, 1993)

• Decoupling distance and direction– Line search– Conjugate gradient method

• Dynamically altering network structure– Removing least salient connections. (LeCunn, 1990)

• Using different activation functions– Hyperbolic Tangent: TANH– Rectified Linear Units (ReLU)

• Used only in past couple of years• f(x) = max(0,x)

ALVINN: Early Self-Driving Car• Drove up to 70 mph on

sectioned off California highway.

• Network Architecture– 960 Input Units– 4 Hidden Units– 30 Output Units

• Pomerleau,1993

Reminder

Show Code Examplesannv.clj

1. Identity Function2. MNIST Data Set

References• Reinforcement Learning

– http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-Tom_Mitchell.pdf– https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

• Neural Nets & Deep Learning– http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-Tom_Mitchell.pdf– http://neuralnetworksanddeeplearning.com/chap2.html– https://en.wikipedia.org/wiki/Recurrent_neural_network– http://deeplearning.net/tutorial/deeplearning.pdf

• Convolutional Neural Networks– http://cs231n.github.io/convolutional-networks/– http://neuralnetworksanddeeplearning.com/chap6.html

• Deep Reinforcement Learning– http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/deep_rl.pdf– http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/