reinforcement learning hidden markov model - cog sciajyu/teaching/cogs202_sp14/slides/lect6.pdf ·...
TRANSCRIPT
Hidden Markov ModelReinforcement Learning
Week 6 PresentationYashodhan, Chun, Ning
Hidden Markov Models (HMM)Questions:● What are HMMs useful for ?● What are some of the assumptions underlying HMMs ?● What are the 3 problems for HMMs ? Explain each in
terms of the coin toss example.
Coin Toss Example3 coins - C1, C2, C3Select a coin randomly, flip it and repeat
Given only the sequence HTTHTHHT, can we find out the sequence of coins that was chosen ?
C1
H
C2
T
C1
T
C3
H
C1
T
C1
H
C2
H
C3
T
Coin
Outcome
Hidden Markov Model
C1
H
C2
T
C1
T
C3
H
C1
T
C1
H
C2
H
C3
T
Coin
Outcome
State sequence i1,i2,...iT
Observation sequence O1,O2,...,OT
N = number of distinct states (N = 3 here)M = number of distinct observation symbols (M = 2 here)T = length of observation sequence (T = 8 here)Denote the N states by 1,2,...,N (state i corresponds to coin i being chosen)Denote the M observation symbols by V = {v1,...,vM} (v1 = H, v2 = T)
Hidden Markov ModelComponent Meaning Example
Initial state distribution Probability of being in state i at t = 1
For i = 1, this is the probability of choosing coin 1 at t = 1
Transition matrix Probability of a transition from state i to state j
a12 is the probability of choosing coin 2 immediately after coin 1
Emission matrix Probability of observing vk in state j
b1(2) is the probability of observing Tails given that coin 1 has been chosen
AssumptionsFinite context
Shared distributions
Three Problems for HMMsProbability of the observation sequence
Choosing the most likely state sequence
Estimate the parameters of the HMM
Problem 1 - Direct computation
Involves 2TNT multiplications. For N = 5, T = 100, this is ~1072 multiplications
Marginalization
Product Rule
C.I.
Problem 1 - Forward VectorProbability of sequence of t observations with state i at time t
marginalization
product rule
C.I.
Base case of recursion
Problem 1 - Forward Vector
Now we can write P(O) in terms of the forward vector
marginalization
Problem 1 - Using Forward VectorCompute forward vector at t = 1
Compute forward vectors for t = 2 to T
Compute probability of the observation sequence
t = 1 .. T
i = 1 .. n
t = 1 .. T
i = 1 .. n
t = 1 .. T
i = 1 .. n
Number of multiplications is of the order of N2T
Problem 1 - Using Backward Vector
Compute backward vector at t = T
Compute backward vectors for t = T-1 to 1
Compute probability of the observation sequence
t = 1 .. T
i = 1 .. n
t = 1 .. T
i = 1 .. n
t = 1 .. T
i = 1 .. n
Probability of observation sequence from time t+1 to T given state i at time t
Reinforcement Learning
Agent and environment interact at discrete time steps : t = 0, 1, 2,..Agent observes state at step t : st ∈SProduces action at step t : at ∈ A(st )Gets resulting reward : rt +1 ∈ℜ and resulting next state : st +1Policy at step t, πt :a mapping from states to action probabilitiesπt (s, a) = probability that at = a when st = s
Goals and Rewards● Reward: a single number rt at each time step● Agent’s goal: maximize cumulative reward in the long
run● Examples of rewards
○ Maze: +1 for escape, -1 for each time prior to escape○ Walking: proportional to robot’s forward motion
● Focus is on what the robot should achieve, not how it should be achieved○ Chess: reward only for winning, not for achieving sub-goals
Returns - Formalizing the goalReward sequence after time t: rt+1,rt+2,rt+3…Return Rt: function of the reward sequenceMaximize expected return Episodic tasks:
Continuing tasks (discounted return):
where is the discount rate
ExampleFailure: the pole falling beyond a critical angle or the cart hitting the end of the track
As an episodic task: episode ends upon failurereward = +1 for each step before failurereturn = number of steps before failure
As a continuing task with discounted returnreward = -1 upon failure, 0 otherwisereturn = , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible
The Markov PropertiesIn general case, the environment state may depends everything that has happened earlier.
If the environment state has Markov property, the environment response at t +1 only depends on state and action in the previous time.
Markov Decision Process (MDP)Definition: A reinforcement learning task that satisfies the Markov property.
Transition Probability:
Expected value of next reward:
Recycling Robot MDP
❏ At each time the robot needs to decide whether it should 1). actively search for a can, 2). remain stationary and wait for someone to bring it a can, 3). go back to home to recharge its battery
❏ Reward = numbers of cans collected❏ Searching will collect more cans but lower the battery. If
it runs out of battery, has to be rescued❏ decision is solely based on energy level of the battery
Recycling Robot MDP, cont’s
❏ Searching beginning with high energy leaves energy level high with probability and low with 1 -
❏ Searching beginning with low energy leaves energy level low with probability depleted with probability 1 -
❏ Each collected can is counted as unit reward❏ Rescue will result -3 reward
Recycling Robot MDP, cont’s
State-Value Function for policy ❏ State-Value function: the expected return when starting
in the state and following a certain policy ❏ Policy: A policy is a mapping from each state to
the probability of taking action in state
State-Value Function, cont’sBackup DiagramBellman equation for
Action-Value Function for policy Action-Value function: Expected return starting from state , taking the action , and then following policy .
GridworldAction: North,south,east,westReward: 1). -1 for taking the agent off the grid; 2). 0 for other action except those that move the agent out of the special states A and B; 3) +10 for any action in state A; 4) +5 for any action in state B
Optimal State-Value Functions
❏ There are always one or more policies that are better than or equal to others. These are optimal policies denoted by
❏ Optimal state-value functions
Bellman optimality equation for Bellman optimal equation Backup diagram
Bellman optimal equation has a unique solution independent of the policy
Optimal Action-Value Function for Optimal action-value function gives expected return for taking action in state and then following the optimal policy
Bellman optimality equation for
Backup diagram
Greedy policy
● For each state, there will be one or more actions at which the maximum is obtained the Bellman optimality equation. Any policy that assigns non-zero probability only to those action is an optimal policy(greedy policy).
● If one uses optimal value function to evaluate the one-step consequences of actions, then the greedy policy is actually optimal in the long-term sense.
Bellman Optimality Equations for the Recycling Robot
Yu & Dayan 2005● Expected Uncertainty and Unexpected Uncertainty● AcetylCholine and Norepinephrine● Observed Phenomenon:
Task Paradigm
Model for the Task and Neurochemistry
Generic Internal Model
Three Models● The Ideal Learner
● The Approximate Inference Model
● The Bottom-up Naive Model
Model ComparisonCost:
NE and Ach, using Approximate Model
Without Depletion of NE or Ach
With Depletion of NE or Ach
Different “Depletion” Scenario