manifold embeddings for model-based reinforcement...

Outline Introduction Method Experiments Conclusions

Manifold Embeddings for Model-BasedReinforcement Learning under Partial

Observability

By Keith Bush and Joelle Pineauat NIPS 2009

Presented by Chenghui CaiDuke University, ECE

February 19, 2010

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability


Outline

I Introduction of RLI Theory and Philosophy Behind RLI Markov Models and RLI Moving from Theory Study to Real Applications

I Background of This Paper

I Methods

I Experiments

I Conclusions and Discussions




RL

I Simple philosophy: agent receives rewards for good behaviors;punished for bad behaviors.

I General training data format: state (situation), action(decision), reward (label).

I Purpose of RL: learning behavior policy.

I Fundamental: Optimal Bellman Equation,V ∗(st) = maxat [E(r(st+1)|st , at) + γ ∗ E(V ∗(st+1)|st , at)]

I The most general learning framework:Degree of Labels’ Clearness

Most (with Explicit Labels)

Least (No Labels)

RL Supervised Learning Unsupervised Learning




Markov Models and RL

I Makov chain, HMM, MDP and POMDPs:

The POMDP Page

Partially Observable Markov Decision Processes

Tutorial | Papers | Talks | Code | Repository

POMDP FAQ

What is a POMDP?

POMDP is an acronym for a partially observable Markov decision process. This is a mathematical model that can capture the domain dynamics that include uncertainty in action effects, uncertainty in perceptual stimuli. Once a problem is captured as POMDP, it them becomes more ammendable for solution using optimization techniques.

Does this have anything to do with Markov Chains or HMMs?

Sure does. These are all stochastic, discrete state, discrete time models. Borrowing from Michael Littman's nifty explanatory grid:

Markov Models

Do we have control over the state transitons?

NO YES

Are the states

completely observable?

YESMarkov Chain MDP

Markov Decision Process

NO

HMM

Hidden Markov Model

POMDP

Partially Observable Markov Decision Process

file:///C|/Documents%20and%20Settings/Chenghui/Desk..._notes/GroupMeeting/20100219/The%20POMDP%20Page.htm (1 of 2)2/18/2010 6:00:53 PM

adopted from pomdp.orgI Two ways to learning the behavior policy:

I Model-based: learn dynamics, solve Markov Models;I Model-free: directly learning the policy, such as, RPR,

Q-learning.

I Applications: robotics, decision making under uncertainty.




Background of This Paper

I Barriers of Applications: (1) goal (reward) not well-defined;(2) exploration is expensive; (3) data not preserve Markovproperty.

I Solution 1: For many domains, particularly those governed bydifferential equations, leverage the induction of locality(nearest neighborhood, e.g., s(t + 1) and s(t) ) duringfunction approximation to satisfy Markov property

I Solution 2: Reconstruct state-spaces of partially observablesystems: transfer high-order Markov property to first orderMarkov property, and preserve the locality.

I Example: use manifold embeddings to reconstruct locallyEculidean state-spaces of forced partially observable systems;can find the embedding non-parametrically.




Summary of the method

An offline RL learning:

I Part 1, modeling phase: identify the appropriate embeddingand define the local model.

I Part 2, learning phase: leverage (use) the resultant localityand perform RL.




Modeling: Manifold Embeddings for RL 1/2

Purpose: Using nonlinear dynamic systems theory to reconstructcomplete state observability from incomplete state via delayedembeddings (representation).

I Assume real-valued vector space RM ; action a; statedynamics function f ; and deterministic policy functiona(t) = π(s(t)), where s(t) the state.

s(t + 1) = f (s(t), a(t)) = f (s(t), π(s(t))) = φ(s(t)) (1)

I if the system is observed via function y , s.t.

s̃(t) = y(s(t)) (2)




Modeling: Manifold Embeddings for RL 2/2

I Construct a vector sE (t) below s.t. sE lies on a subset of REwhich is an embedding of s.

sE (t) = [s̃(t), s̃(t − 1), · · · , s̃(t − (E − 1))],E > 2M (3)

I Because embeddings preserve the connectivity of the originalvector space RM , in the context of RL the following mappingψ with :

sE (t + 1) = ψ(sE (t)), (4)

may be substituted for f and vectors sE (t) may besubstituted for corresponding vectors s(t).




Modeling: Nonparametric Idenfication of ManifoldEmbeddings 1/2

Problem left: how to compute E , the embedding dimension.Solution: Singular Value Decomposition (SVD) Algorithm

I Given a sequence of state observations s̃ of length S̃ , choose asufficiently large fixed embedding dimension, Ê .

I For each embedding window size T̂min ∈ {Ê , · · · , S̃},1. Define a matrix SÊ of row vectors sÊ (t), t ∈ {T̂min, · · · , S̃}, by

the rule:

sÊ (t) = [s̃(t), s̃(t − τ), · · · , s̃(t − (Ê − 1)τ)], (5)

2. Compute the SVD of the matrix SÊ , SÊ = UΣW∗.

3. Record the vector of singular values, σ(T̂min) in Σ.




Modeling: Nonparametric Idenfication of ManifoldEmbeddings 2/2

I Estimate the embedding parameters, Tmin and E , of s byanalysis of the second σ2(T̂min).

1. Approximate window size, Tmin, of s is the T̂min value of thefirst local maxima of the sequence of all σ2(T̂min), forT̂min ∈ {Ê , · · · , S̃}.

2. Approximate embedding dimension, E , is the number ofnon-trivial singular values of σ(Tmin).

I Embeddings s via parameters, Tmin and E , yields the matrixSE of row vectors sE (t), t ∈ {Tmin, · · · , S̃}.




Modeling: Generative Local Models from Embeddings 1/2

Purpose: Generate local model, a trajectory of the underlyingsystem and prepare the observed “state” for RL.

I Consider a dataset D of a set of temporally aligned sequencesof s(t), a(t) and reward r(t), t ∈ {1, · · · , S̃}.

I Apply the above spectral embedding method to D and yield asequence of vectors sE (t), t ∈ {Tmin, · · · , S̃}.

I A local model M of D is the set of 3-tuples,m(t) = {sE (t), a(t), r(t)}, t ∈ {Tmin, · · · , S̃}.

I Define operations on these tuples,A(m(t)) = a(t),S(m(t)) = sE (t),Z(m(t)) = sz(t), wheresz(t) = [sE (t), a(t)], and U(M, a) = Ma, where Ma is the subsetof tuples in M containing action a.




Modeling: Generative Local Models from Embeddings 2/2

Consider a state vector x(i) in RE indexed by simulation time i ,and Compute its locality, i.e., nearest neighbor.

I Model’s nearest neighbor of x(i) when taking action a(i)defined in case of discrete set of actions and continuous case:

m(tx(i)) = arg minm(t)∈U(M,a(i))

‖S(m(t))− x(i)‖, a ∈ A (6)

m(tx(i)) = arg minm(t)∈M

‖Z(m(t))− [x(i), ωa(i)]‖, a ∈ A, (7)

where ω is a scaling parameter.I Model gradient and numerical integration defined as:

∇x(i) = S(m(txi + 1))− S(m(txi )) (8)x(i + 1) = x(i) +4i(∇x(i) + η) (9)

where η is a vector of noise and ∆i is the integration step-size.




Q-learning

Consider the obtained set of sequences x(i) in RE , a(i) and r(i);Learn the optimal policy function, π∗, that maximizes the expectedsum of future rewards, termed the optimal action-value function orQ-function, Q∗, s.t.

Q∗(x(i), a(i)) = r(i + 1) + γmaxa

Q∗(x(i + 1), a) (10)

Iteratively construct an approximate Q of Q∗:

δ(i) = r(i + 1) + γmaxa

Q(x(i + 1), a)− Q(x(i), a(i)) (11)

where δ(i) is temporal difference error that can be used to improvethe approximation.

Q(x(i), a(i)) = Q(x(i), a(i)) + αδ(i) (12)

where α is the learning rate.By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE



Mountain Car: parking on the hill 1/2

A second-order nonlinear dynamic system:

adopted from library.rl-community.org




Mountain Car: parking on the hill 2/2

50000

50000

0 100000

100000

1500002.5

150000

2000005.0

200000

−1.07.5 10.0 −0.5 0.0 0.5

1000

1000

100

100

0

−1.

0

510

−0.

5

Training Samples

15

0.0

Pat

h−to

−go

al L

engt

hP

ath−

to−

goal

Len

gth

20

0.5

x(t)Tmin (sec)

x(t)Tmin (sec)

τ)x(

t−τ)

x(t−

Sin

gula

r V

alue

sS

ingu

lar

Val

ues

−1.0 −0.5 0.0 0.50 2.5 5.0

−1.

0

7.5 10.0

−0.

50.

00.

5

05

10

Learned Policy(d)

15

(a) Embedding Performance, E=2(b)Random Policy

Embedding Performance, E=3(c)

Training Samples

σ5

σ2

σ1

σ3

σ1

σ2

σ5

0.20

0.701.201.702.20

Tmin

BestMax

Random

0.20

0.701.201.702.20

Tmin

BestMax

Random

Figure 1: Learning experiments on Mountain Car under partial observability. (a) Embedding spec-trum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) under random policy. (b) Learningperformance as a function of embedding parameters and quantity of training data. (c) Embeddingspectrum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) for the learned policy.

We use the Mountain Car dynamics and boundaries of Sutton andBarto [23]. We fix the initial statefor all experiments (and resets) to be the lowest point of themountain domain with zero velocity,which requires the longest path-to-goal in the optimal policy. Only the position element of thestate is observable. During the modeling phase, we record this domain under a random controlpolicy for 10,000 time-steps (∆t = 0.05 seconds), where the action is changed every∆t = 0.20seconds. We then compute the spectral embedding of the observations (Tmin = [0.20, 9.95] sec.,∆Tmin = 0.25 sec., andÊ = 5). The resulting spectrum is presented in Figure 1(a). We concludethat the embedding of Mountain Car under the random policy requires dimensionE = 3 with amaximum embedding window ofTmin = 1.70 seconds.

To evaluate learning phase outcomes with respect to modeling phase outcomes, we perform an ex-periment where we model the randomly collected observations using embedding parameters drawnfrom the product of the setsTmin = {0.20, 0.70, 1.20, 1.70, 2.20} seconds andE = {2, 3}. Whilewe fix the size of the local model to 10,000 elements we vary thetotal amount of training samplesobserved from 10,000 to 200,000 at intervals of 10,000. We use batchQ-learning to identify theoptimal policy in a model-based way—in Equation 5 the transition between state-action pair andthe resulting state-reward pair is drawn from the model (η = 0.001). After learning converges, weexecute the learned policy on the real system for 10,000 time-steps, recording the mean path-to-goallength over all goals reached. Each configuration is executed 30 times.

We summarize the results of these experiments by log-scale plots, Figures 1(b) and (c), for embed-dings of dimension two and three, respectively. We compare learning performance against threemeasures: the maximum performing policy achievable given the dynamics of the system (path-to-goal = 63 steps), the best (99th percentile) learned policy for each quantity of training data for eachembedding dimension, and the random policy. Learned performance is plotted as linear regressionfits of the data.

Policy performance results of Figures 1(b) and (c) may be summarized by the following obser-vations. Performance positively relates to the quantity ofoff-line training data for all embeddingparameters. Except for the configuration (E = 2, Tmin = 0.20), influence ofTmin on learningperformance relative toE is small. Learning performance of 3-dimensional embeddings dominate

5

adopted from the presented NIPS paperBy Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE



Neurostimulation Treatment of Epilepsy 1/2

Control

1 Hz

0 50 100 150

0.5 Hz

σ3

σ2σ1

σ2σ3

(a) Example Field Potentials (b) Neurostimulation Embedding Spectrum

−0.

6−

0.4

−0.

2 0

.6 0

.0 0

.2 0

.4

−0.

6−

0.4

−0.

2 0

.6 0

.0 0

.2 0

.4

−0.4 −0.2 0.6 0.0 0.2 0.4−0.6−3 0−1−2

(c) Neurostimulation Model2 Hz

200 sec

1 m

V

Sin

gula

r V

alue

s0

5010

015

020

025

0

0.0 0.5 1.0 1.5 2.0

010

2030

4050

60

Sin

gula

r V

alue

s

*

*

TT minmin (s) (s)

2nd

Prin

cipa

l Com

pone

nt3rd Principal Component1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

Figure 2: Graphical summary of the modeling phase of our adaptive neurostimulation study.(a) Sample observations from the fixed-frequency stimulation dataset. Seizures are labeled withhorizontal lines. (b) The embedding spectrum of the fixed-frequency stimulation dataset. The largemaximum ofσ2 at approximately 100 sec. is an artifact of the periodicity of seizures in the dataset.*Detail of the embedding spectrum forTmin = [0.05, 2.0] depicting a maximum ofσ2 at the time-scale of individual stimulation events. (c) The resultant neurostimulation model constructed fromembedding the dataset with parameters (E = 3, Tmin = 1.05 sec.). Note, the model has beendesampled 5× in the plot.

In this complex domain we apply the spectral method differently than described in Section 2. Ratherthan building the model directly from the embedding (E = 3, Tmin = 1.05), we perform a changeof basis on the embedding (Ê = 15, Tmin = 1.05), using the first three columns of the right sin-gular vectors, analogous to projecting onto the principal components. This embedding is plotted inFigure 2(c). Also, unlike the previous case study, we convert stimulation events in the training datafrom discrete frequencies to a continuous scale of time-elapsed-since-stimulation. This allows us tocombine all of the data into a single state-action space and then simulate any arbitrary frequency.Based on exhaustive closed-loop simulations of fixed-frequency suppression efficacy across a spec-trum of [0.001, 2.0] Hz, we constrain the model’s action set to discrete frequenciesa = {2.0, 0.25}Hz in the hopes of easing the learning problem. We then perform batchQ-learning over the model(∆t = 0.05, ω = 0.1, andη = 0.00001), using discount factorγ = 0.9. We structure the rewardfunction to penalize each electrical stimulation by−1 and each visited seizure state by−20.Without stimulation, seizure states comprise 25.6% of simulation states. Under a 1.0 Hz fixed-frequency policy, stimulation events comprise 5.0% and seizures comprise 6.8% of the simulationstates. The policy learned by the agent also reduces the percent of seizure states to 5.2% of sim-ulation states while stimulating only 3.1% of the time (effective frequency equals 0.62 Hz). Insimulation, therefore, the learned policy achieves the goal.

We then deployed the learned policy on real brain slices to test on-line seizure suppression perfor-mance. The policy was tested over four trials on two unique brain slices extracted from the sameanimal. The effective frequencies of these four trials were{0.65, 0.64, 0.66, 0.65} Hz. In all trialsseizures were effectively suppressed after a short transient period, during which the policy and sliceachieved equilibrium. (Note: seizures occurring at the onset of stimulation are common artifactsof neurostimulation). Figure 3 displays two of these trialsspaced over four sequential phases: (a)a control (no stimulation) phase used to determine baselineseizure activity, (b) a learned policytrial lasting 1,860 seconds, (c) a recovery phase to ensure slice viability after stimulation and torecompute baseline seizure activity, and (d) a learned policy trial lasting 2,130 seconds.

7




Neurostimulation Treatment of Epilepsy 2/2

(a) Control Phase

60 sec

2 m

V

(d) Policy Phase 2

(b) Policy Phase 1

(c) Recovery PhaseStimulations

*

Figure 3: Field potential trace of a real seizure suppression experiment using a policy learned fromsimulation. Seizures are labeled as horizontal lines abovethe traces. Stimulation events are markedby vertical bars below the traces. (a) A control phase used todetermine baseline seizure activity.(b) The initial application of the learned policy. (c) A recovery phase to ensure slice viability afterstimulation and recompute baseline seizure activity. (d) The second application of the learned policy.*10 minutes of trace are omitted while the algorithm was reset.

5 Discussion and Related WorkThe RL community has long studied low-dimensional representations to capture complex domains.Approaches for efficient function approximation, basis function construction, and discovery of em-beddings has been the topic of significant investigations [3, 11, 20, 15, 13]. Most of this work hasbeen limited to the fully observable (MDP) case and has not been extended to partially observableenvironments. The question of state space representation in partially observable domains was tack-led under the POMDP framework [14] and recently in the PSR framework [19]. These methodsaddress a similar problem but have been limited primarily todiscrete action and observation spaces.The PSR framework was extended to continuous (nonlinear) domains [25]. This method is signifi-cantly different from our work, both in terms of the class of representations it considers and in thecriteria used to select the appropriate representation. Furthermore, it has not yet been applied toreal-world domains. An empirical comparison with our approach is left for future consideration.

The contribution of our work is to integrate embeddings withmodel-based RL to solve real-worldproblems. We do this by leveraging locality preserving qualities of embeddings to construct dynamicmodels of the system to be controlled. While not improving thequality of off-line learning thatis possible, these models permit embedding validation and reasoning over the domain, either toconstrain the learning problem or to anticipate the effectsof the learned policy on the dynamics of thecontrolled system. To demonstrate our approach, we appliedit to learn a neurostimulation treatmentof epilepsy, a challenging real-world domain. We showed that the policy learned off-line from anembedding-based, local model can be successfully transferred on-line. This is a promising steptoward widespread application of RL in real-world domains.Looking to the future, we anticipatethe ability to adjust the embeddinga priori using a non-parametric policy gradient approach overthe local model. An empirical investigation into the benefits of this extension are also left for futureconsideration.

AcknowledgmentsThe authors thank Dr. Gabriella Panuccio and Dr. Massimo Avoli of the Montreal NeurologicalInstitute for generating the time-series described in Section 4. The authors also thank Arthur Guez,Robert Vincent, Jordan Frank, and Mahdi Milani Fard for valuable comments and suggestions. Theauthors gratefully acknowledge financial support by the Natural Sciences and Engineering ResearchCouncil of Canada and the Canadian Institutes of Health Research.

8




Conclusions

I An integration of existing methods with a new application

I Many potential interesting real applications ahead.



OutlineIntroductionMethodExperimentsConclusions