reinforcement learning hidden markov model - cog sciajyu/teaching/cogs202_sp14/slides/lect6.pdf ·...

Hidden Markov ModelReinforcement Learning

Week 6 PresentationYashodhan, Chun, Ning

Hidden Markov Models (HMM)Questions:● What are HMMs useful for ?● What are some of the assumptions underlying HMMs ?● What are the 3 problems for HMMs ? Explain each in

terms of the coin toss example.

Coin Toss Example3 coins - C1, C2, C3Select a coin randomly, flip it and repeat

Given only the sequence HTTHTHHT, can we find out the sequence of coins that was chosen ?

C1

H

C2

T

C1

T

C3

H

C1

T

C1

H

C2

H

C3

T

Coin

Outcome

Hidden Markov Model

C1

H

C2

T

C1

T

C3

H

C1

T

C1

H

C2

H

C3

T

Coin

Outcome

State sequence i1,i2,...iT

Observation sequence O1,O2,...,OT

N = number of distinct states (N = 3 here)M = number of distinct observation symbols (M = 2 here)T = length of observation sequence (T = 8 here)Denote the N states by 1,2,...,N (state i corresponds to coin i being chosen)Denote the M observation symbols by V = {v1,...,vM} (v1 = H, v2 = T)

Hidden Markov ModelComponent Meaning Example

Initial state distribution Probability of being in state i at t = 1

For i = 1, this is the probability of choosing coin 1 at t = 1

Transition matrix Probability of a transition from state i to state j

a12 is the probability of choosing coin 2 immediately after coin 1

Emission matrix Probability of observing vk in state j

b1(2) is the probability of observing Tails given that coin 1 has been chosen

AssumptionsFinite context

Shared distributions

Three Problems for HMMsProbability of the observation sequence

Choosing the most likely state sequence

Estimate the parameters of the HMM

Problem 1 - Direct computation

Involves 2TNT multiplications. For N = 5, T = 100, this is ~1072 multiplications

Marginalization

Product Rule

C.I.

Problem 1 - Forward VectorProbability of sequence of t observations with state i at time t

marginalization

product rule

C.I.

Base case of recursion

Problem 1 - Forward Vector

Now we can write P(O) in terms of the forward vector

marginalization

Problem 1 - Using Forward VectorCompute forward vector at t = 1

Compute forward vectors for t = 2 to T

Compute probability of the observation sequence

t = 1 .. T

i = 1 .. n

t = 1 .. T

i = 1 .. n

t = 1 .. T

i = 1 .. n

Number of multiplications is of the order of N2T

Problem 1 - Using Backward Vector

Compute backward vector at t = T

Compute backward vectors for t = T-1 to 1

Compute probability of the observation sequence

t = 1 .. T

i = 1 .. n

t = 1 .. T

i = 1 .. n

t = 1 .. T

i = 1 .. n

Probability of observation sequence from time t+1 to T given state i at time t

Reinforcement Learning

Agent and environment interact at discrete time steps : t = 0, 1, 2,..Agent observes state at step t : st ∈SProduces action at step t : at ∈ A(st )Gets resulting reward : rt +1 ∈ℜ and resulting next state : st +1Policy at step t, πt :a mapping from states to action probabilitiesπt (s, a) = probability that at = a when st = s

Goals and Rewards● Reward: a single number rt at each time step● Agent’s goal: maximize cumulative reward in the long

run● Examples of rewards

○ Maze: +1 for escape, -1 for each time prior to escape○ Walking: proportional to robot’s forward motion

● Focus is on what the robot should achieve, not how it should be achieved○ Chess: reward only for winning, not for achieving sub-goals

Returns - Formalizing the goalReward sequence after time t: rt+1,rt+2,rt+3…Return Rt: function of the reward sequenceMaximize expected return Episodic tasks:

Continuing tasks (discounted return):

where is the discount rate

ExampleFailure: the pole falling beyond a critical angle or the cart hitting the end of the track

As an episodic task: episode ends upon failurereward = +1 for each step before failurereturn = number of steps before failure

As a continuing task with discounted returnreward = -1 upon failure, 0 otherwisereturn = , for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible

The Markov PropertiesIn general case, the environment state may depends everything that has happened earlier.

If the environment state has Markov property, the environment response at t +1 only depends on state and action in the previous time.

Markov Decision Process (MDP)Definition: A reinforcement learning task that satisfies the Markov property.

Transition Probability:

Expected value of next reward:

Recycling Robot MDP

❏ At each time the robot needs to decide whether it should 1). actively search for a can, 2). remain stationary and wait for someone to bring it a can, 3). go back to home to recharge its battery

❏ Reward = numbers of cans collected❏ Searching will collect more cans but lower the battery. If

it runs out of battery, has to be rescued❏ decision is solely based on energy level of the battery

Recycling Robot MDP, cont’s

❏ Searching beginning with high energy leaves energy level high with probability and low with 1 -

❏ Searching beginning with low energy leaves energy level low with probability depleted with probability 1 -

❏ Each collected can is counted as unit reward❏ Rescue will result -3 reward

Recycling Robot MDP, cont’s

State-Value Function for policy ❏ State-Value function: the expected return when starting

in the state and following a certain policy ❏ Policy: A policy is a mapping from each state to

the probability of taking action in state

State-Value Function, cont’sBackup DiagramBellman equation for

Action-Value Function for policy Action-Value function: Expected return starting from state , taking the action , and then following policy .

GridworldAction: North,south,east,westReward: 1). -1 for taking the agent off the grid; 2). 0 for other action except those that move the agent out of the special states A and B; 3) +10 for any action in state A; 4) +5 for any action in state B

Optimal State-Value Functions

❏ There are always one or more policies that are better than or equal to others. These are optimal policies denoted by

❏ Optimal state-value functions

Bellman optimality equation for Bellman optimal equation Backup diagram

Bellman optimal equation has a unique solution independent of the policy

Optimal Action-Value Function for Optimal action-value function gives expected return for taking action in state and then following the optimal policy

Bellman optimality equation for

Backup diagram

Greedy policy

● For each state, there will be one or more actions at which the maximum is obtained the Bellman optimality equation. Any policy that assigns non-zero probability only to those action is an optimal policy(greedy policy).

● If one uses optimal value function to evaluate the one-step consequences of actions, then the greedy policy is actually optimal in the long-term sense.

Bellman Optimality Equations for the Recycling Robot

Yu & Dayan 2005● Expected Uncertainty and Unexpected Uncertainty● AcetylCholine and Norepinephrine● Observed Phenomenon:

Task Paradigm

Model for the Task and Neurochemistry

Generic Internal Model

Three Models● The Ideal Learner

● The Approximate Inference Model

● The Bottom-up Naive Model

Model ComparisonCost:

NE and Ach, using Approximate Model

Without Depletion of NE or Ach

With Depletion of NE or Ach

Different “Depletion” Scenario

reinforcement learning hidden markov model - cog sciajyu/teaching/cogs202_sp14/slides/lect6.pdf ·...

Documents