1 ece 517: reinforcement learning in artificial intelligence lecture 17: trtrl, implementation...

17
1 ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial Intelligence Artificial Intelligence Lecture 17: TRTRL, Implementation Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Considerations, Apprenticeship Learning Dr. Itamar Arel Dr. Itamar Arel College of Engineering College of Engineering Department of Electrical Engineering and Computer Science Department of Electrical Engineering and Computer Science The University of Tennessee The University of Tennessee Fall 2010 Fall 2010 November 3, 2010 November 3, 2010

Upload: evangeline-stanley

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

11

ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial Intelligence Artificial Intelligence

Lecture 17: TRTRL, ImplementationLecture 17: TRTRL, ImplementationConsiderations, Apprenticeship LearningConsiderations, Apprenticeship Learning

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2010Fall 2010

November 3, 2010November 3, 2010

Page 2: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 22

OutlineOutline

Recap on RNNsRecap on RNNs

Implementation and usage issues with RTRLImplementation and usage issues with RTRL Computational complexity and resources requiredComputational complexity and resources required

Vanishing gradient problemVanishing gradient problem

Apprenticeship learningApprenticeship learning

Page 3: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 33

Recap on RNNsRecap on RNNs

RNNs are potentially much stronger than FFNNRNNs are potentially much stronger than FFNN Can capture temporal dependenciesCan capture temporal dependencies Embed complex state representation (i.e. memory)Embed complex state representation (i.e. memory) Models of discrete-time dynamic systemsModels of discrete-time dynamic systems

They are (very) complex to trainThey are (very) complex to train TDNNTDNN – limited performance based on window – limited performance based on window RTRLRTRL – calculates a dynamic gradient on-line – calculates a dynamic gradient on-line

Page 4: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 44

RTRL reviewedRTRL reviewed

RTRL is a gradient descent based methodRTRL is a gradient descent based method

It relies on sensitivitiesIt relies on sensitivities

expressing the impact of any weight expressing the impact of any weight wwijij on the on the

activation of neuron activation of neuron kk..

The algorithm then consists of computing weight The algorithm then consists of computing weight changeschanges

Let’s look at the resources involved …Let’s look at the resources involved …

)()()1( ' tztpwtnetftp jikIUl

lijklkk

kij

)()()( tptetw kij

Ukkij

Page 5: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 55

Implementing RTRL – computations involvedImplementing RTRL – computations involved

The key component in RTRL is the sensitivities matrixThe key component in RTRL is the sensitivities matrix

Must be calculated for each neuronMust be calculated for each neuron

RTRL, however, is NOT local …RTRL, however, is NOT local …

Can the calculations be efficiently distributed?Can the calculations be efficiently distributed?

)()()1( ' tztpwtnetftp jikIUl

lijklkk

kij

NN 33 NN

NN 44

Page 6: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 66

Implementing RTRL – storage requirementsImplementing RTRL – storage requirements

Let’s assume a fully-connected network of Let’s assume a fully-connected network of NN neurons neurons

Memory resourcesMemory resources Weights matrixWeights matrix, , wwijij NN

22

ActivationsActivations, , yykk NN Sensitivity matrixSensitivity matrix NN

33

Total memory requirements Total memory requirements O(O(NN 33))

Let’s go over an example:Let’s go over an example: Let’s assume we have 1000 neurons in the systemLet’s assume we have 1000 neurons in the system Each value requires 20 bits to representEach value requires 20 bits to represent ~20 Gb of storage!!~20 Gb of storage!!

Page 7: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 77

Possible solutions – static subgroupingPossible solutions – static subgrouping

Zipser et. al (1989) suggested static grouping of neuronsZipser et. al (1989) suggested static grouping of neurons

Relaxing the “fully-connected” requirementRelaxing the “fully-connected” requirement Has backing in neuroscienceHas backing in neuroscience Average “branching factor” in the brain ~ 1000Average “branching factor” in the brain ~ 1000

Reduced the complexity by simply leaving out elements of Reduced the complexity by simply leaving out elements of the sensitivity matrix based upon subgrouping of neuronsthe sensitivity matrix based upon subgrouping of neurons

Neurons are subgrouped arbitrarilyNeurons are subgrouped arbitrarily Sensitivities between groups are ignoredSensitivities between groups are ignored All connections still exist in the forward pathAll connections still exist in the forward path

If If gg is the number of subgroups then … is the number of subgroups then … Storage is O(Storage is O(NN33/g/g2 2 )) Computational speedup is Computational speedup is gg33

Communications Communications each node communicates with each node communicates with N/gN/g nodesnodes

Page 8: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 88

Possible solutions – static subgrouping (cont.)Possible solutions – static subgrouping (cont.)

Zipser’s empirical tests indicate that these networks can Zipser’s empirical tests indicate that these networks can solve many of the problems full RTRL solvessolve many of the problems full RTRL solves

One caveat of the subgrouped RTRL training is that each One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target subnet must have at least one unit for which a target exists (since gradient information is not exchanged exists (since gradient information is not exchanged between groups)between groups)

Others have proposed dynamic subgroupingOthers have proposed dynamic subgrouping Subgrouping based on maximal gradient informationSubgrouping based on maximal gradient information Not realistic for hardware realizationNot realistic for hardware realization

Open research question: how to calculate gradient without Open research question: how to calculate gradient without the the O(O(NN33)) storage requirement? storage requirement?

Page 9: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI

Truncated Real Time Recurrent Learning (TRTRL)Truncated Real Time Recurrent Learning (TRTRL)

MotivationMotivation:: To obtain a scalable version of the RTRL algorithm while To obtain a scalable version of the RTRL algorithm while minimizing performance degradationminimizing performance degradation

How?How? Limit the sensitivities of each neuron to its Limit the sensitivities of each neuron to its ingressingress (incoming) and (incoming) and egressegress (outgoing) links (outgoing) links

Page 10: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI

Performing Sensitivity Calculations in TRTRLPerforming Sensitivity Calculations in TRTRL

For all nodes that are For all nodes that are not in the output setnot in the output set, the , the egressegress sensitivity values for node sensitivity values for node ii are calculated by imposing are calculated by imposing k=jk=j in in

the original RTRL sensitivity equation, such thatthe original RTRL sensitivity equation, such that

Similarly, the Similarly, the ingressingress sensitivity values for node sensitivity values for node jj are given by are given by

For For outputoutput neurons, a nonzero sensitivity element must exist neurons, a nonzero sensitivity element must exist in order to update the weightsin order to update the weights

)()()1( tztpwtsftp jj

ijijiiiij

)()()1( tytpwtsftp ijijjiijii

iji

)()()()1( tztpwtpwtsftp jikj

ijkjiijkikk

kij

Page 11: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI

The network structure remains the same with The network structure remains the same with TRTRL, only the calculation of sensitivities is TRTRL, only the calculation of sensitivities is reducedreduced

Significant reduction in resource requirements …Significant reduction in resource requirements …Computational load for each neuron drops to from Computational load for each neuron drops to from OO((NN33)) to to OO((2KN2KN),), where where KK denotes the number of denotes the number of output neuronsoutput neurons

Total computational complexity is now Total computational complexity is now OO((2KN2KN22))Storage requirements drop from Storage requirements drop from OO((NN33)) to to OO((NN22))

Example revisited: For Example revisited: For NN=100, 10 outputs =100, 10 outputs 100k 100k multiplications and only multiplications and only 20kB 20kB of storage!of storage!

Resource Requirements of TRTRLResource Requirements of TRTRL

Page 12: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 1212

Further TRTRL Improvements – Clustering of NeuronsFurther TRTRL Improvements – Clustering of Neurons

TRTRL introduced localization and memory TRTRL introduced localization and memory improvementimprovement

Clustered TRTRL adds scalability by reducing the Clustered TRTRL adds scalability by reducing the number of long connection lines between processing number of long connection lines between processing elementselements

Input

Output

Page 13: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI

Test case #1: Frequency DoublerTest case #1: Frequency Doubler

Input: Input: sinsin((xx)), target output , target output sinsin(2(2xx))Both networks had 12 neurons Both networks had 12 neurons

Page 14: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 1414

Vanishing Gradient ProblemVanishing Gradient Problem

Recap on goals:Recap on goals: Find temporal dependencies in data with a RNNFind temporal dependencies in data with a RNN The idea behind RTRL: The idea behind RTRL: when an error value is found, when an error value is found,

apply it to inputs seen an indefinite number of epochs apply it to inputs seen an indefinite number of epochs agoago

In 1994 (Bengio et. al) it has been shown that both BPTT In 1994 (Bengio et. al) it has been shown that both BPTT and RTRL suffer from the problem of and RTRL suffer from the problem of vanishing gradientvanishing gradient informationinformation

When using gradient based training rules, the “error When using gradient based training rules, the “error signal” that is applied to previous inputs tends to vanishsignal” that is applied to previous inputs tends to vanish

Because of this, long-term dependencies in the data are Because of this, long-term dependencies in the data are often overlookedoften overlooked

Short-term memory is ok, long-term (>10 epochs) – lostShort-term memory is ok, long-term (>10 epochs) – lost

Page 15: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 1515

Vanishing Gradient Problem (cont.)Vanishing Gradient Problem (cont.)

A learning error yields gradients on outputs, and A learning error yields gradients on outputs, and therefore on the state variables therefore on the state variables st

Since the weights (parameters) are shared across timeSince the weights (parameters) are shared across time

RNNxxtt yyttsstt

W

s

s

s

s

E

W

s

s

E

W

E t

t

tt

t

tt

ttt xsfy ,1

largefor 0...... '

1'

1'1

2

1

1

fff

s

s

s

s

s

s

s

stt

t

t

t

tt

Page 16: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 1616

What is Apprenticeship LearningWhat is Apprenticeship Learning

Many times we want to train an agent based on a Many times we want to train an agent based on a reference controllerreference controller

Riding a bicycleRiding a bicycle Flying a planeFlying a plane

Starting from scratch may take a very long timeStarting from scratch may take a very long time Particularly for large state/action spacesParticularly for large state/action spaces

May cost a lot (e.g. helicopter crashing)May cost a lot (e.g. helicopter crashing)

Process:Process: Train agent on reference controllerTrain agent on reference controller Evaluate trained agentEvaluate trained agent Improve trained agentImprove trained agent

Note: reference controller can be anything (e.g. Note: reference controller can be anything (e.g. heuristic controller for Car Race problem)heuristic controller for Car Race problem)

Page 17: 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel

ECE 517: Reinforcement Learning in AI 1717

Formalizing Apprenticeship LearningFormalizing Apprenticeship Learning

Let’s assume we have a reference policy Let’s assume we have a reference policy from which from which we want our agent to learnwe want our agent to learn

We would first like to learn the (approx.) value function, We would first like to learn the (approx.) value function, VV

Once we have Once we have VV, we can try an improve it based on , we can try an improve it based on the policy improvement theorem, i.e.the policy improvement theorem, i.e.

By following the original policy greedily we obtain a By following the original policy greedily we obtain a better policy!better policy!

In practice, many issues should be considered such as In practice, many issues should be considered such as state space coverage and exploration/exploitationstate space coverage and exploration/exploitation

Train on zero exploration, then explore gradually …Train on zero exploration, then explore gradually …

),(max)('

)()('

''

asQs

sVsV

a