manifold embeddings for model-based reinforcement...

18
Outline Introduction Method Experiments Conclusions Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE February 19, 2010 By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Upload: others

Post on 30-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Outline Introduction Method Experiments Conclusions

    Manifold Embeddings for Model-BasedReinforcement Learning under Partial

    Observability

    By Keith Bush and Joelle Pineauat NIPS 2009

    Presented by Chenghui CaiDuke University, ECE

    February 19, 2010

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Outline

    I Introduction of RLI Theory and Philosophy Behind RLI Markov Models and RLI Moving from Theory Study to Real Applications

    I Background of This Paper

    I Methods

    I Experiments

    I Conclusions and Discussions

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    RL

    I Simple philosophy: agent receives rewards for good behaviors;punished for bad behaviors.

    I General training data format: state (situation), action(decision), reward (label).

    I Purpose of RL: learning behavior policy.

    I Fundamental: Optimal Bellman Equation,V ∗(st) = maxat [E(r(st+1)|st , at) + γ ∗ E(V ∗(st+1)|st , at)]

    I The most general learning framework:Degree of Labels’ Clearness

    Most (with Explicit Labels)

    Least (No Labels)

    RL Supervised Learning Unsupervised Learning

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Markov Models and RL

    I Makov chain, HMM, MDP and POMDPs:

    The POMDP Page

    Partially Observable Markov Decision Processes

    Tutorial | Papers | Talks | Code | Repository

    POMDP FAQ

    What is a POMDP?

    POMDP is an acronym for a partially observable Markov decision process. This is a mathematical model that can capture the domain dynamics that include uncertainty in action effects, uncertainty in perceptual stimuli. Once a problem is captured as POMDP, it them becomes more ammendable for solution using optimization techniques.

    Does this have anything to do with Markov Chains or HMMs?

    Sure does. These are all stochastic, discrete state, discrete time models. Borrowing from Michael Littman's nifty explanatory grid:

    Markov Models

    Do we have control over the state transitons?

    NO YES

    Are the states

    completely observable?

    YESMarkov Chain MDP

    Markov Decision Process

    NO

    HMM

    Hidden Markov Model

    POMDP

    Partially Observable Markov Decision Process

    file:///C|/Documents%20and%20Settings/Chenghui/Desk..._notes/GroupMeeting/20100219/The%20POMDP%20Page.htm (1 of 2)2/18/2010 6:00:53 PM

    adopted from pomdp.orgI Two ways to learning the behavior policy:

    I Model-based: learn dynamics, solve Markov Models;I Model-free: directly learning the policy, such as, RPR,

    Q-learning.

    I Applications: robotics, decision making under uncertainty.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Background of This Paper

    I Barriers of Applications: (1) goal (reward) not well-defined;(2) exploration is expensive; (3) data not preserve Markovproperty.

    I Solution 1: For many domains, particularly those governed bydifferential equations, leverage the induction of locality(nearest neighborhood, e.g., s(t + 1) and s(t) ) duringfunction approximation to satisfy Markov property

    I Solution 2: Reconstruct state-spaces of partially observablesystems: transfer high-order Markov property to first orderMarkov property, and preserve the locality.

    I Example: use manifold embeddings to reconstruct locallyEculidean state-spaces of forced partially observable systems;can find the embedding non-parametrically.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Summary of the method

    An offline RL learning:

    I Part 1, modeling phase: identify the appropriate embeddingand define the local model.

    I Part 2, learning phase: leverage (use) the resultant localityand perform RL.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Modeling: Manifold Embeddings for RL 1/2

    Purpose: Using nonlinear dynamic systems theory to reconstructcomplete state observability from incomplete state via delayedembeddings (representation).

    I Assume real-valued vector space RM ; action a; statedynamics function f ; and deterministic policy functiona(t) = π(s(t)), where s(t) the state.

    s(t + 1) = f (s(t), a(t)) = f (s(t), π(s(t))) = φ(s(t)) (1)

    I if the system is observed via function y , s.t.

    s̃(t) = y(s(t)) (2)

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Modeling: Manifold Embeddings for RL 2/2

    I Construct a vector sE (t) below s.t. sE lies on a subset of REwhich is an embedding of s.

    sE (t) = [s̃(t), s̃(t − 1), · · · , s̃(t − (E − 1))],E > 2M (3)

    I Because embeddings preserve the connectivity of the originalvector space RM , in the context of RL the following mappingψ with :

    sE (t + 1) = ψ(sE (t)), (4)

    may be substituted for f and vectors sE (t) may besubstituted for corresponding vectors s(t).

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Modeling: Nonparametric Idenfication of ManifoldEmbeddings 1/2

    Problem left: how to compute E , the embedding dimension.Solution: Singular Value Decomposition (SVD) Algorithm

    I Given a sequence of state observations s̃ of length S̃ , choose asufficiently large fixed embedding dimension, Ê .

    I For each embedding window size T̂min ∈ {Ê , · · · , S̃},1. Define a matrix SÊ of row vectors sÊ (t), t ∈ {T̂min, · · · , S̃}, by

    the rule:

    sÊ (t) = [s̃(t), s̃(t − τ), · · · , s̃(t − (Ê − 1)τ)], (5)

    2. Compute the SVD of the matrix SÊ , SÊ = UΣW∗.

    3. Record the vector of singular values, σ(T̂min) in Σ.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Modeling: Nonparametric Idenfication of ManifoldEmbeddings 2/2

    I Estimate the embedding parameters, Tmin and E , of s byanalysis of the second σ2(T̂min).

    1. Approximate window size, Tmin, of s is the T̂min value of thefirst local maxima of the sequence of all σ2(T̂min), forT̂min ∈ {Ê , · · · , S̃}.

    2. Approximate embedding dimension, E , is the number ofnon-trivial singular values of σ(Tmin).

    I Embeddings s via parameters, Tmin and E , yields the matrixSE of row vectors sE (t), t ∈ {Tmin, · · · , S̃}.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Modeling: Generative Local Models from Embeddings 1/2

    Purpose: Generate local model, a trajectory of the underlyingsystem and prepare the observed “state” for RL.

    I Consider a dataset D of a set of temporally aligned sequencesof s(t), a(t) and reward r(t), t ∈ {1, · · · , S̃}.

    I Apply the above spectral embedding method to D and yield asequence of vectors sE (t), t ∈ {Tmin, · · · , S̃}.

    I A local model M of D is the set of 3-tuples,m(t) = {sE (t), a(t), r(t)}, t ∈ {Tmin, · · · , S̃}.

    I Define operations on these tuples,A(m(t)) = a(t),S(m(t)) = sE (t),Z(m(t)) = sz(t), wheresz(t) = [sE (t), a(t)], and U(M, a) = Ma, where Ma is the subsetof tuples in M containing action a.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Modeling: Generative Local Models from Embeddings 2/2

    Consider a state vector x(i) in RE indexed by simulation time i ,and Compute its locality, i.e., nearest neighbor.

    I Model’s nearest neighbor of x(i) when taking action a(i)defined in case of discrete set of actions and continuous case:

    m(tx(i)) = arg minm(t)∈U(M,a(i))

    ‖S(m(t))− x(i)‖, a ∈ A (6)

    m(tx(i)) = arg minm(t)∈M

    ‖Z(m(t))− [x(i), ωa(i)]‖, a ∈ A, (7)

    where ω is a scaling parameter.I Model gradient and numerical integration defined as:

    ∇x(i) = S(m(txi + 1))− S(m(txi )) (8)x(i + 1) = x(i) +4i(∇x(i) + η) (9)

    where η is a vector of noise and ∆i is the integration step-size.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Q-learning

    Consider the obtained set of sequences x(i) in RE , a(i) and r(i);Learn the optimal policy function, π∗, that maximizes the expectedsum of future rewards, termed the optimal action-value function orQ-function, Q∗, s.t.

    Q∗(x(i), a(i)) = r(i + 1) + γmaxa

    Q∗(x(i + 1), a) (10)

    Iteratively construct an approximate Q of Q∗:

    δ(i) = r(i + 1) + γmaxa

    Q(x(i + 1), a)− Q(x(i), a(i)) (11)

    where δ(i) is temporal difference error that can be used to improvethe approximation.

    Q(x(i), a(i)) = Q(x(i), a(i)) + αδ(i) (12)

    where α is the learning rate.By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Mountain Car: parking on the hill 1/2

    A second-order nonlinear dynamic system:

    adopted from library.rl-community.org

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Mountain Car: parking on the hill 2/2

    50000

    50000

    0 100000

    100000

    1500002.5

    150000

    2000005.0

    200000

    −1.07.5 10.0 −0.5 0.0 0.5

    1000

    1000

    100

    100

    0

    −1.

    0

    510

    −0.

    5

    Training Samples

    15

    0.0

    Pat

    h−to

    −go

    al L

    engt

    hP

    ath−

    to−

    goal

    Len

    gth

    20

    0.5

    x(t)Tmin (sec)

    x(t)Tmin (sec)

    τ)x(

    t−τ)

    x(t−

    Sin

    gula

    r V

    alue

    sS

    ingu

    lar

    Val

    ues

    −1.0 −0.5 0.0 0.50 2.5 5.0

    −1.

    0

    7.5 10.0

    −0.

    50.

    00.

    5

    05

    10

    Learned Policy(d)

    15

    (a) Embedding Performance, E=2(b)Random Policy

    Embedding Performance, E=3(c)

    Training Samples

    σ5

    σ2

    σ1

    σ3

    σ1

    σ2

    σ5

    0.20

    0.701.201.702.20

    Tmin

    BestMax

    Random

    0.20

    0.701.201.702.20

    Tmin

    BestMax

    Random

    Figure 1: Learning experiments on Mountain Car under partial observability. (a) Embedding spec-trum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) under random policy. (b) Learningperformance as a function of embedding parameters and quantity of training data. (c) Embeddingspectrum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) for the learned policy.

    We use the Mountain Car dynamics and boundaries of Sutton andBarto [23]. We fix the initial statefor all experiments (and resets) to be the lowest point of themountain domain with zero velocity,which requires the longest path-to-goal in the optimal policy. Only the position element of thestate is observable. During the modeling phase, we record this domain under a random controlpolicy for 10,000 time-steps (∆t = 0.05 seconds), where the action is changed every∆t = 0.20seconds. We then compute the spectral embedding of the observations (Tmin = [0.20, 9.95] sec.,∆Tmin = 0.25 sec., andÊ = 5). The resulting spectrum is presented in Figure 1(a). We concludethat the embedding of Mountain Car under the random policy requires dimensionE = 3 with amaximum embedding window ofTmin = 1.70 seconds.

    To evaluate learning phase outcomes with respect to modeling phase outcomes, we perform an ex-periment where we model the randomly collected observations using embedding parameters drawnfrom the product of the setsTmin = {0.20, 0.70, 1.20, 1.70, 2.20} seconds andE = {2, 3}. Whilewe fix the size of the local model to 10,000 elements we vary thetotal amount of training samplesobserved from 10,000 to 200,000 at intervals of 10,000. We use batchQ-learning to identify theoptimal policy in a model-based way—in Equation 5 the transition between state-action pair andthe resulting state-reward pair is drawn from the model (η = 0.001). After learning converges, weexecute the learned policy on the real system for 10,000 time-steps, recording the mean path-to-goallength over all goals reached. Each configuration is executed 30 times.

    We summarize the results of these experiments by log-scale plots, Figures 1(b) and (c), for embed-dings of dimension two and three, respectively. We compare learning performance against threemeasures: the maximum performing policy achievable given the dynamics of the system (path-to-goal = 63 steps), the best (99th percentile) learned policy for each quantity of training data for eachembedding dimension, and the random policy. Learned performance is plotted as linear regressionfits of the data.

    Policy performance results of Figures 1(b) and (c) may be summarized by the following obser-vations. Performance positively relates to the quantity ofoff-line training data for all embeddingparameters. Except for the configuration (E = 2, Tmin = 0.20), influence ofTmin on learningperformance relative toE is small. Learning performance of 3-dimensional embeddings dominate

    5

    adopted from the presented NIPS paperBy Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Neurostimulation Treatment of Epilepsy 1/2

    Control

    1 Hz

    0 50 100 150

    0.5 Hz

    σ3

    σ2σ1

    σ2σ3

    (a) Example Field Potentials (b) Neurostimulation Embedding Spectrum

    −0.

    6−

    0.4

    −0.

    2 0

    .6 0

    .0 0

    .2 0

    .4

    −0.

    6−

    0.4

    −0.

    2 0

    .6 0

    .0 0

    .2 0

    .4

    −0.4 −0.2 0.6 0.0 0.2 0.4−0.6−3 0−1−2

    (c) Neurostimulation Model2 Hz

    200 sec

    1 m

    V

    Sin

    gula

    r V

    alue

    s0

    5010

    015

    020

    025

    0

    0.0 0.5 1.0 1.5 2.0

    010

    2030

    4050

    60

    Sin

    gula

    r V

    alue

    s

    *

    *

    TT minmin (s) (s)

    2nd

    Prin

    cipa

    l Com

    pone

    nt3rd Principal Component1st Principal Component

    2nd

    Prin

    cipa

    l Com

    pone

    nt

    Figure 2: Graphical summary of the modeling phase of our adaptive neurostimulation study.(a) Sample observations from the fixed-frequency stimulation dataset. Seizures are labeled withhorizontal lines. (b) The embedding spectrum of the fixed-frequency stimulation dataset. The largemaximum ofσ2 at approximately 100 sec. is an artifact of the periodicity of seizures in the dataset.*Detail of the embedding spectrum forTmin = [0.05, 2.0] depicting a maximum ofσ2 at the time-scale of individual stimulation events. (c) The resultant neurostimulation model constructed fromembedding the dataset with parameters (E = 3, Tmin = 1.05 sec.). Note, the model has beendesampled 5× in the plot.

    In this complex domain we apply the spectral method differently than described in Section 2. Ratherthan building the model directly from the embedding (E = 3, Tmin = 1.05), we perform a changeof basis on the embedding (Ê = 15, Tmin = 1.05), using the first three columns of the right sin-gular vectors, analogous to projecting onto the principal components. This embedding is plotted inFigure 2(c). Also, unlike the previous case study, we convert stimulation events in the training datafrom discrete frequencies to a continuous scale of time-elapsed-since-stimulation. This allows us tocombine all of the data into a single state-action space and then simulate any arbitrary frequency.Based on exhaustive closed-loop simulations of fixed-frequency suppression efficacy across a spec-trum of [0.001, 2.0] Hz, we constrain the model’s action set to discrete frequenciesa = {2.0, 0.25}Hz in the hopes of easing the learning problem. We then perform batchQ-learning over the model(∆t = 0.05, ω = 0.1, andη = 0.00001), using discount factorγ = 0.9. We structure the rewardfunction to penalize each electrical stimulation by−1 and each visited seizure state by−20.Without stimulation, seizure states comprise 25.6% of simulation states. Under a 1.0 Hz fixed-frequency policy, stimulation events comprise 5.0% and seizures comprise 6.8% of the simulationstates. The policy learned by the agent also reduces the percent of seizure states to 5.2% of sim-ulation states while stimulating only 3.1% of the time (effective frequency equals 0.62 Hz). Insimulation, therefore, the learned policy achieves the goal.

    We then deployed the learned policy on real brain slices to test on-line seizure suppression perfor-mance. The policy was tested over four trials on two unique brain slices extracted from the sameanimal. The effective frequencies of these four trials were{0.65, 0.64, 0.66, 0.65} Hz. In all trialsseizures were effectively suppressed after a short transient period, during which the policy and sliceachieved equilibrium. (Note: seizures occurring at the onset of stimulation are common artifactsof neurostimulation). Figure 3 displays two of these trialsspaced over four sequential phases: (a)a control (no stimulation) phase used to determine baselineseizure activity, (b) a learned policytrial lasting 1,860 seconds, (c) a recovery phase to ensure slice viability after stimulation and torecompute baseline seizure activity, and (d) a learned policy trial lasting 2,130 seconds.

    7

    adopted from the presented NIPS paperBy Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Neurostimulation Treatment of Epilepsy 2/2

    (a) Control Phase

    60 sec

    2 m

    V

    (d) Policy Phase 2

    (b) Policy Phase 1

    (c) Recovery PhaseStimulations

    *

    Figure 3: Field potential trace of a real seizure suppression experiment using a policy learned fromsimulation. Seizures are labeled as horizontal lines abovethe traces. Stimulation events are markedby vertical bars below the traces. (a) A control phase used todetermine baseline seizure activity.(b) The initial application of the learned policy. (c) A recovery phase to ensure slice viability afterstimulation and recompute baseline seizure activity. (d) The second application of the learned policy.*10 minutes of trace are omitted while the algorithm was reset.

    5 Discussion and Related WorkThe RL community has long studied low-dimensional representations to capture complex domains.Approaches for efficient function approximation, basis function construction, and discovery of em-beddings has been the topic of significant investigations [3, 11, 20, 15, 13]. Most of this work hasbeen limited to the fully observable (MDP) case and has not been extended to partially observableenvironments. The question of state space representation in partially observable domains was tack-led under the POMDP framework [14] and recently in the PSR framework [19]. These methodsaddress a similar problem but have been limited primarily todiscrete action and observation spaces.The PSR framework was extended to continuous (nonlinear) domains [25]. This method is signifi-cantly different from our work, both in terms of the class of representations it considers and in thecriteria used to select the appropriate representation. Furthermore, it has not yet been applied toreal-world domains. An empirical comparison with our approach is left for future consideration.

    The contribution of our work is to integrate embeddings withmodel-based RL to solve real-worldproblems. We do this by leveraging locality preserving qualities of embeddings to construct dynamicmodels of the system to be controlled. While not improving thequality of off-line learning thatis possible, these models permit embedding validation and reasoning over the domain, either toconstrain the learning problem or to anticipate the effectsof the learned policy on the dynamics of thecontrolled system. To demonstrate our approach, we appliedit to learn a neurostimulation treatmentof epilepsy, a challenging real-world domain. We showed that the policy learned off-line from anembedding-based, local model can be successfully transferred on-line. This is a promising steptoward widespread application of RL in real-world domains.Looking to the future, we anticipatethe ability to adjust the embeddinga priori using a non-parametric policy gradient approach overthe local model. An empirical investigation into the benefits of this extension are also left for futureconsideration.

    AcknowledgmentsThe authors thank Dr. Gabriella Panuccio and Dr. Massimo Avoli of the Montreal NeurologicalInstitute for generating the time-series described in Section 4. The authors also thank Arthur Guez,Robert Vincent, Jordan Frank, and Mahdi Milani Fard for valuable comments and suggestions. Theauthors gratefully acknowledge financial support by the Natural Sciences and Engineering ResearchCouncil of Canada and the Canadian Institutes of Health Research.

    8

    adopted from the presented NIPS paperBy Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

  • Outline Introduction Method Experiments Conclusions

    Conclusions

    I An integration of existing methods with a new application

    I Many potential interesting real applications ahead.

    By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE

    Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

    OutlineIntroductionMethodExperimentsConclusions