reinforcement learning : dynamic programming

Download Reinforcement Learning : Dynamic Programming

If you can't read please download the document

Upload: sidone

Post on 08-Jan-2016

56 views

Category:

Documents


6 download

DESCRIPTION

Reinforcement Learning : Dynamic Programming. Csaba Szepesv ári University of Alberta Kioloa, MLSS’08 Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Reinforcement Learning. - PowerPoint PPT Presentation

TRANSCRIPT

  • Reinforcement Learning:Dynamic ProgrammingCsaba SzepesvriUniversity of Alberta

    Kioloa, MLSS08

    Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/

    TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

  • Reinforcement LearningRL = Sampling based methods to solve optimal control problems

    ContentsDefining AIMarkovian Decision ProblemsDynamic ProgrammingApproximate Dynamic ProgrammingGeneralizations

    (Rich Sutton)

  • LiteratureBooksRichard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, 1996JournalsJMLR, MLJ, JAIR, AIConferencesNIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI

  • Some More BooksMartin L. Puterman. Markov Decision Processes. Wiley, 1994.Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007).James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003.

  • ResourcesRL-Gluehttp://rlai.cs.ualberta.ca/RLBB/top.htmlRL-Libraryhttp://rlai.cs.ualberta.ca/RLR/index.htmlThe RL Toolbox 2.0http://www.igi.tugraz.at/ril-toolbox/general/overview.htmlOpenDPhttp://opendp.sourceforge.netRL-Competition (2008)!http://rl-competition.org/June 1st, 2008: Test runs begin!Related fields:Operations research (MOR, OR)Control theory (IEEE TAC, Automatica, IEEE CDC, ECC)Simulation optimization (Winter Simulation Conference)

  • Abstract Control ModelEnvironmentactionsSensations(and reward) Controller= agentPerception-action loop

  • Zooming in..external sensationsmemorystaterewardactionsinternal sensationsagent

  • A Mathematical ModelPlant (controlled object):xt+1 = f(xt,at,vt) xt : state, vt : noisezt = g(xt,wt) zt : sens/obs, wt : noiseState: Sufficient statistics for the futureIndependently of what we measure ..or..Relative to measurementsControllerat = F(z1,z2,,zt)at: action/control=> PERCEPTION-ACTION LOOPCLOSED-LOOP CONTROLDesign problem: F = ?Goal: t=1T r(zt,at)! max

    Objective StateSubjective State

  • A Classification of ControllersFeedforward:a1,a2, is designed ahead in time???Feedback:Purely reactive systems: at = F(zt)Why is this bad?Feedback with memory:mt = M(mt-1,zt,at-1)~interpreting sensations at = F(mt)decision making: deliberative vs. reactive

  • Feedback controllersPlant:xt+1 = f(xt,at,vt)zt+1 = g(xt,wt)Controller:mt = M(mt-1,zt,at-1)at = F(mt)mt xt: state estimation, filteringdifficulties: noise,unmodelled partsHow do we compute at?With a model (f): model-based control..assumes (some kind of) state estimationWithout a model: model-free control

  • Markovian Decision Problems

  • Markovian Decision Problems(X,A,p,r)X set of statesA set of actions (controls)p transition probabilities p(y|x,a)r rewards r(x,a,y), or r(x,a), or r(x) discount factor 0 < 1

  • The Process View(Xt,At,Rt)Xt state at time tAt action at time tRt reward at time tLaws:Xt+1~p(.|Xt,At)At ~ (.|Ht): policyHt = (Xt,At-1,Rt-1, .., A1,R1,X0) history Rt = r(Xt,At,Xt+1)

  • The Control ProblemValue functions:

    Optimal value function:

    Optimal policy:

  • Applications of MDPsOperations research Econometrics

    Optimal investmentsReplacement problemsOption pricingLogistics, inventory managementActive visionProduction schedulingDialogue controlControl, statistics Games, AI

    Bioreactor controlRobotics (Robocup Soccer)DrivingReal-time load balancingDesign of experiments (Medical tests)

  • Variants of MDPsDiscountedUndiscounted: Stochastic Shortest PathAverage rewardMultiple criteriaMinimaxGames

  • MDP ProblemsPlanning The MDP (X,A,P,r,) is known. Find an optimal policy *!Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy *!Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning

  • Solving MDPs DimensionsWhich problem? (Planning, learning, optimal learning)Exact or approximate?Uses samples?Incremental?Uses value functions?Yes: Value-function based methodsPlanning: DP, Random Discretization Method, FVI, Learning: Q-learning, Actor-critic, No: Policy search methodsPlanning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus), RepresentationStructured state:Factored states, logical representation, Structured policy space:Hierarchical methods

  • Dynamic Programming

  • Richard Bellman (1920-1984)Control theorySystems AnalysisDynamic Programming: RAND Corporation, 1949-1955

    Bellman equationBellman-Ford algorithmHamilton-Jacobi-Bellman equationCurse of dimensionalityinvariant imbeddingsGrnwall-Bellman inequality

  • Bellman OperatorsLet :X ! A be a stationary policyB(X) = { V | V:X! R, ||V||1
  • Proof of T V = VWhat you need to know:Linearity of expectation: E[A+B] = E[A]+E[B]Law of total expectation: E[ Z ] = x P(X=x) E[ Z | X=x ], and E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. Markov property: E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y] V(x) = E [t=01 t Rt|X0 = x] = y P(X1=y|X0=x) E[t=01 t Rt|X0 = x,X1=y] (by the law of total expectation) = y p(y|x,(x)) E[t=01 t Rt|X0 = x,X1=y] (since X1~p(.|X0,(X0))) = y p(y|x,(x)) {E[ R0|X0=x,X1=y]+ E [ t=01t Rt+1|X0=x,X1=y]} (by the linearity of expectation) = y p(y|x,(x)) {r(x,(x),y) + V(y)} (using the definition of r, V) = (T V)(x).(using the definition of T)

  • The Banach Fixed-Point TheoremB = (B,||.||) Banach spaceT: B1! B2 is L-Lipschitz (L>0) if for any U,V, || T U T V || L ||U-V||.T is contraction if B1=B2, L
  • An Algebra for ContractionsProp: If T1:B1! B2 is L1-Lipschitz, T2: B2 ! B3 is L2-Lipschitz then T2 T1 is L1 L2 Lipschitz.Def: If T is 1-Lipschitz, T is called a non-expansionProp: M: B(X A) ! B(X), M(Q)(x) = maxa Q(x,a) is a non-expansionProp: Mulc: B! B, Mulc V = c V is |c|-LipschitzProp: Addr: B ! B, Add V = r + V is a non-expansion.Prop: K: B(X) ! B(X), (K V)(x)=y K(x,y) V(y) is a non-expansion if K(x,y) 0, y K(x,y) =1.

  • Policy Evaluations are ContractionsDef: ||V||1 = maxx |V(x)|, supremum norm; here ||.||Theorem: Let T the policy evaluation operator of some policy . Then T is a -contraction.Corollary: V is the unique fixed point of T. Vk+1 = T Vk ! V, 8 V0 2 B(X) and ||Vk-V|| = O(k).

  • The Bellman Optimality OperatorLet T:B(X)! B(X) be defined by (TV)(x) = maxa y p(y|x,a) { r(x,a,y) + V(y) }Def: is greedy w.r.t. V if TV =T V.Prop: T is a -contraction. Theorem (BOE): T V* = V*.Proof: Let V be the fixed point of T. T T V* V. Let be greedy w.r.t. V. Then T V = T V. Hence V = V V V* V = V*.

  • Value IterationTheorem: For any V0 2 B(X), Vk+1 = T Vk, Vk ! V* and in particular ||Vk V*||=O(k). What happens when we stop early?Theorem: Let be greedy w.r.t. V. Then ||V V*|| 2||TV-V||/(1-).Proof: ||V-V*|| ||V-V||+||V-V*|| Corollary: In a finite MDP, the number of policies is finite. We can stop when ||Vk-TVk|| (1-)/2, where = min{ ||V*-V|| : V V* }Pseudo-polynomial complexity

  • Policy Improvement [Howard 60]Def: U,V2 B(X), V U if V(x) U(x) holds for all x2 X.Def: U,V2 B(X), V > U if V U and 9 x2 X s.t. V(x)>U(x).Theorem (Policy Improvement): Let be greedy w.r.t. V. Then V V. If T V>V then V>V.

  • Policy IterationPolicy Iteration()V VDo {improvement}V VLet : T V = T VV V While (V>V)Return

  • Policy Iteration TheoremTheorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy.Proof: Follows from the Policy Improvement Theorem.

  • Linear ProgrammingV T V V V* = T V*.Hence, V* is the largest V that satisfies V T V.V T V ,(*) V(x) yp(y|x,a){r(x,a,y)+ V(y)}, 8x,aLinProg(V): x V(x) ! min s.t. V satisfies (*).Theorem: LinProg(V) returns the optimal value function, V*.Corollary: Pseudo-polynomial complexity

  • Variations of a Theme

  • Approximate Value IterationAVI: Vk+1 = T Vk + kAVI Theorem: Let = maxk ||k||. Then limsupk!1 ||Vk-V*|| 2 / (1-).Proof: Let ak = ||Vk V*||. Then ak+1 = ||Vk+1 V*|| = ||T Vk T V* + k || ||Vk-V*|| + = ak + . Hence, ak is bounded. Take limsup of both sides: a a + ; reorder.//(e.g., [BT96])

  • Fitted Value Iteration Non-expansion OperatorsFVI: Let A be a non-expansion, Vk+1 = A T Vk. Where does this converge to?Theorem: Let U,V be such that A T U = U and T V = V. Then ||V-U|| ||AV V||/(1-).Proof: Let U be the fixed point of TA. Then ||U-V|| ||AV-V||/(1-). Since A U = A T (AU), U=AU. Hence, ||U-V|| =||AU-V|| ||AU-AV||+||AV-V|| [Gordon 95]

  • Application to AggregationLet be a partition of X, S(x) be the unique cell that x belongs to.Let A: B(X)! B(X) be (A V)(x) = z (z;S(x)) V(z), where is a distribution over S(x).p(C|B,a) = x2 B (x;B) y2 C p(y|x,a), r(B,a,C) = x2 B (x;B) y2 C p(y|x,a) r(x,a,y).Theorem: Take (,A,p,r), let V be its optimal value function, VE(x) = V(S(x)). Then ||VE V*|| ||AV*-V*||/(1-).

  • Action-Value FunctionsL: B(X)! B(X A), (L V)(x,a) = y p(y|x,a) {r(x,a,y) + V(y)}. One-step lookahead.Note: is greedy w.r.t. V if (LV)(x,(x)) = max a (LV)(x,a).Def: Q* = L V*.Def: Let Max: B(X A)! B(X), (Max Q)(x) = max a Q(x,a).Note: Max L = T.Corollary: Q* = L Max Q*.Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*.T = L Max is a -contractionValue iteration, policy iteration,

  • Changing GranularityAsynchronous Value Iteration:Every time-step update only a few statesAsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V*.How to use? Prioritized SweepingIPS [MacMahan & Gordon 05]:Instead of an update, put state on the priority queueWhen picking a state from the queue, update itPut predecessors on the queue Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positiveLRTA* [Korf 90] ~ RTDP [Barto, Bradtke, Singh 95]Focussing on parts of the state that matterConstraints:Same problem solved from several initial positionsDecisions have to be fastIdea: Update values along the paths

  • Changing GranularityGeneralized Policy Iteration:Partial evaluation and partial improvement of policiesMulti-step lookahead improvement

    AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird 93]

  • Variations of a theme [SzeLi99]Game against nature [Heger 94]: infw tt Rt(w) with X0 = xRisk-sensitive criterion: log ( E[ exp(tt Rt ) | X_0 = x ] )Stochastic Shortest PathAverage RewardMarkov gamesSimultaneous action choices (Rock-paper-scissor)Sequential action choicesZero-sum (or not)

  • References[Howard 60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960.[Gordon 95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261268, 1995.[Watkins 90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990.[McMahan, Gordon 05] H. B. McMahan and Geoffrey J. Gordon:Fast Exact Planning in Markov Decision Processes. ICAPS. [Korf 90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189211, 1990.[Barto, Bradtke & Singh, 95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81138, 1995. [Williams & Baird, 93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-93-14, November, 1993. [SzeLi99] Cs. Szepesvri and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 20172059, 1999.[Heger 94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105111, 1994.

    Why DP? (1950) Wilson, Secretary of Defense hated research. Not speaking of math! Not math; planning, thinking, decision making programming; multistage, dynamic, time-varying dynamicRAND, 1948: David Blackwell, George Dantzig, Ted Harris, Sam Karlin, Lloyd Shapley ! (game theory, decision theory)What Kushner said: The word programming was used by the military to mean scheduling. Dantzig's linear programming was an abbreviation of ``programming with linear models." Bellman has described the origin of the name ``dynamic programming " as follows. An Assistant Secretary of the Air Force, who was believed to be strongly anti-mathematics was to visit RAND. So Bellman was concerned that his work on the mathematics of multi-stage decision process would be unappreciated. But ``programming" was still OK, and the Air Force was concerned with rescheduling continuously due to uncertainties. Thus ``dynamic programming" was chosen a politically wise descriptor. On the other hand, when I asked him the same question, he replied that he was trying to upstage Dantzig's linear programming by adding dynamic. Perhaps both motivations were true.