Download - Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Topics in Computational SustainabilityCS 325

Spring 2016

Making Choices: Sequential Decision MakingNote to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

http://www.cs.cmu.edu/~awm/tutorials

Stochastic programming

?

ba c

{a(pa),b(pb),c(pc)}

maximizes expected utility

minimizes expected cost

Probabilistic model

wet dry

decision

Problem Setup

Given limited budget, what parcels should I conserve to maximize the expected number of occupied territories in 50 years?

! "#$%#"

Conserved

parcels

Available

parcelsCurrent

territories

Potential

territories

Metapopulation = Cascade

i

j

k

l

m

i

j

k

l

m

i

j

k

l

m

i

j

k

l

m

i

j

k

l

m

• Metapopulation model can be viewed as a cascade in the layered graph representing territories over time

Target nodes: territories at final time step

Patches

Management Actions• Conserving parcels adds nodes to the network to

create new pathways for the cascade

Parcel 1

Parcel 2

Initial

network

Management Actions

Parcel 1

Parcel 2

Initial

network

• Conserving parcels adds nodes to the network to create new pathways for the cascade

Cascade Optimization Problem

Given:

• Patch network– Initially occupied territories

– Colonization and extinction probabilities

• Management actions– Already-conserved parcels

– List of available parcels and their costs

• Time horizon T

• Budget B

Find set of parcels with total cost at most B that maximizes the

expected number of occupied territories at time T.

Can we make our decision adaptively?

9

Sequential decision making

• We have a systems that changes state over time

• Can (partially) control the system state transitions by taking actions

• Problem gives an objective that specifies which states (or state sequences) are more/less preferred

• Problem: At each time step select an action to optimize the overall (long-term) objective – Produce most preferred sequences of “states”

Discounted Rewards/CostsAn assistant professor gets paid, say, 20K per year.

How much, in total, will the A.P. earn in their life?

20 + 20 + 20 + 20 + 20 + … = Infinity

What’s wrong with this argument?

$ $

Discounted Rewards

“A reward (payment) in the future is not worth quite

as much as a reward now.”

– Because of chance of obliteration

– Because of inflation

Example:

Being promised $10,000 next year is worth only 90% as

much as receiving $10,000 right now.

Assuming payment n years in future is worth only

(0.9)n of payment now, what is the AP’s Future

Discounted Sum of Rewards ?

Infinite Sum

Assuming a discount rate of 0.9, how much

does the assistant professor get in total?

20 + .9 20 + .92 20 + .93 20 + …

= 20 + .9 (20 + .9 20 + .92 20 + …)

x = 20 + .9 x

x = 20/(.1) = 200

People in economics and probabilistic decision-

making do this all the time.

The “Discounted sum of future rewards” using

discount factor g” is

(reward now) +

g (reward in 1 time step) +

g 2 (reward in 2 time steps) +

g 3 (reward in 3 time steps) +

:

: (infinite sum)

Discount Factors

Markov System: the Academic Life

Define:

JA = Expected discounted future rewards starting in state A

JB = Expected discounted future rewards starting in state B

JT = “ “ “ “ “ “ “ T

JS = “ “ “ “ “ “ “ S

JD = “ “ “ “ “ “ “ D

How do we compute JA, JB, JT, JS, JD ?

A.Assistant

Prof20

B.Assoc.

Prof60

S.On theStreet

10

D.Dead

0

T.Tenured

Prof400

0.7

0.7

0.6

0.3

0.2 0.2

0.2

0.3

0.60.2

Working Backwards

A. Assistant

Prof.: 20

B. Associate

Prof.: 60

T. Tenured

Prof.: 100

S. Out on the

Street: 10 D. Dead: 0

1.0

0.60.2

0.2

0.7

0.3

0.6

0.2

0.20.7

0.3

Discount

factor 0.9

0

270

27

247151

Reincarnation?

A. Assistant

Prof.: 20

B. Associate

Prof.: 60

T. Tenured

Prof.: 100

S. Out on the

Street: 10 D. Dead: 0

0.5

0.60.2

0.2

0.7

0.3

0.6

0.2

0.20.7

0.3

Discount

factor 0.90.5

System of Equations

L(A) = 20 + .9(.6 L(A) + .2 L(B) + .2 L(S))

L(B) = 60 + .9(.6 L(B) + .2 L(S) + .2 L(T))

L(S) = 10 + .9(.7 L(S) + .3 L(D))

L(T) = 100 + .9(.7 L(T) + .3 L(D))

L(D) = 0 + .9 (.5 L(D) + .5 L(A))

Solving a Markov System with

Matrix Inversion

• Upside: You get an exact answer

• Downside: If you have 100,000 states

you’re solving a 100,000 by 100,000 system

of equations.

Define

J1(Si) = Expected discounted sum of rewards over the next 1 time step.

J2(Si) = Expected discounted sum rewards during next 2 steps


:

Jk(Si) = Expected discounted sum rewards during next k steps

J1(Si) = (what?)

J2(Si) =(what?)

Jk+1(Si) =(what?)

Value Iteration: another way to solve

a Markov System

Define

J1(Si) = Expected discounted sum of rewards over the next 1 time step.



:

Jk(Si) = Expected discounted sum rewards during next k steps

J1(Si) = ri

(what?)

J2(Si) =(what?)

:

Jk+1(Si) =(what?)

Value Iteration: another way to solve

a Markov System

N

j

jiji sJpr1

1 )(g

N

j

j

k

iji sJpr1

)(g

N = Number of states

Let’s do Value Iteration

k Jk(SUN) Jk(WIND) Jk(HAIL)

1

2

3

4

5

SUN

+4

WIND

0

HAIL

.::.:.::

-81/2

1/2

1/21/2

1/2

1/2

g = 0.5

Let’s do Value Iteration

k Jk(SUN) Jk(WIND) Jk(HAIL)

1 4 0 -8

2 5 -1 -10

3 5 -1.25 -10.75

4 4.94 -1.44 -11

5 4.88 -1.52 -11.11

SUN

+4

WIND

0

HAIL

.::.:.::

-81/2

1/2

1/21/2

1/2

1/2

g = 0.5

Value Iteration for solving Markov

Systems

• Compute J1(Si) for each i

• Compute J2(Si) for each i

:

• Compute Jk(Si) for each i

As k→∞ Jk(Si)→J*(Si)

When to stop? When

Max Jk+1(Si) – Jk(Si) < ξi

This is faster than matrix inversion (N3 style)

if the transition matrix is sparse

What if we have a way to interact with the Markov system?

A Markov Decision Process

g = 0.9

Poor &

Unknown

+0

Rich &

Unknown

+10

Rich &

Famous

+10

Poor &

Famous

+0

You run a

startup

company.

In every

state you

must choose

between

Saving

money (S)

or

Advertising

(A).

S

S

AA

S1

1

1

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

Markov Decision Processes

An MDP has…

• A set of states {s1 ··· SN}

• A set of actions {a1 ··· aM}

• A set of rewards {r1 ··· rN} (one for each state)

• A transition probability function

k

ijP

kijk

ij action use I and ThisNextProbP

On each step:

0. Call current state Si

1. Receive reward ri

2. Choose action {a1 ··· aM}

3. If you choose action ak you’ll move to state Sj with

probability

4. All future rewards are discounted by g

What’s a solution to an MDP? A sequence of actions?

A PolicyA policy is a mapping from states to actions.

Examples

STATE → ACTION

PU S

PF A

RU S

RF A

STATE → ACTION

PU A

PF A

RU A

RF A

Polic

y N

um

ber

1:

• How many possible policies in our example?

• Which of the above two policies is best?

• How do you compute the optimal policy?

PU0

PF0

RU+10

RF+10

RF10

PF0

PU0

RU10

S

A

AA

A A

1

1

1

1

1

1/2

1/2

1/2

1/21/2

1/2

Polic

y N

um

ber

2:

Interesting Fact

For every M.D.P. there exists an optimal

policy.

It’s a policy such that for every possible start

state there is no better option than to follow

the policy.

Computing the Optimal Policy

Idea One:

Run through all possible policies.

Select the best.

What’s the problem ??

Optimal Value Function

Define J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policy

S1

+0

S3

+2

S2

+3

B

1/2

1/21/2

1/2

0

1

1

1

1/3

1/3

1/3

Question:

What is an optimal policy

for that MDP?

(assume g = 0.9)

What is J*(S1) ?

What is J*(S2) ?

What is J*(S3) ?

Computing the Optimal Value

Function with Value Iteration

Define

Jk(Si) = Maximum possible expected

sum of discounted rewards I

can get if I start at state Si and

I live for k time steps.

Note that J1(Si) = ri

Let’s compute Jk(Si) for our example

k Jk(PU) Jk(PF) Jk(RU) Jk(RF)

1

2

3

4

5

6

k Jk(PU) Jk(PF) Jk(RU) Jk(RF)

1 0 0 10 10

2 0 4.5 14.5 19

3 2.03 8.55 16.52 25.08

Bellman’s Equation

N

j

j

na

iji

a

i

n r1

1 SJPSJ max g

i

n

i

n

i

SJSJ when converged 1

max

Value Iteration for solving MDPs

• Compute J1(Si) for all i

• Compute J2(Si) for all i

• :

• Compute Jn(Si) for all i

…..until converged

…Also known as

Dynamic Programming

Download - Topics in Computational Sustainabilityermon/cs325/slides/mdp2.pdf · 2016. 5. 17. · Topics in Computational Sustainability CS 325 Spring 2016 Making Choices: Sequential Decision

Top Related