Topics in Computational SustainabilityCS 325
Spring 2016
Making Choices: Sequential Decision MakingNote to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Stochastic programming
?
ba c
{a(pa),b(pb),c(pc)}
maximizes expected utility
minimizes expected cost
Probabilistic model
wet dry
decision
Problem Setup
Given limited budget, what parcels should I conserve to maximize the expected number of occupied territories in 50 years?
! "#$%#"
Conserved
parcels
Available
parcelsCurrent
territories
Potential
territories
Metapopulation = Cascade
i
j
k
l
m
i
j
k
l
m
i
j
k
l
m
i
j
k
l
m
i
j
k
l
m
• Metapopulation model can be viewed as a cascade in the layered graph representing territories over time
Target nodes: territories at final time step
Patches
Management Actions• Conserving parcels adds nodes to the network to
create new pathways for the cascade
Parcel 1
Parcel 2
Initial
network
Management Actions
Parcel 1
Parcel 2
Initial
network
• Conserving parcels adds nodes to the network to create new pathways for the cascade
Management Actions
Parcel 1
Parcel 2
Initial
network
• Conserving parcels adds nodes to the network to create new pathways for the cascade
Cascade Optimization Problem
Given:
• Patch network– Initially occupied territories
– Colonization and extinction probabilities
• Management actions– Already-conserved parcels
– List of available parcels and their costs
• Time horizon T
• Budget B
Find set of parcels with total cost at most B that maximizes the
expected number of occupied territories at time T.
Can we make our decision adaptively?
9
Sequential decision making
• We have a systems that changes state over time
• Can (partially) control the system state transitions by taking actions
• Problem gives an objective that specifies which states (or state sequences) are more/less preferred
• Problem: At each time step select an action to optimize the overall (long-term) objective – Produce most preferred sequences of “states”
Discounted Rewards/CostsAn assistant professor gets paid, say, 20K per year.
How much, in total, will the A.P. earn in their life?
20 + 20 + 20 + 20 + 20 + … = Infinity
What’s wrong with this argument?
$ $
Discounted Rewards
“A reward (payment) in the future is not worth quite
as much as a reward now.”
– Because of chance of obliteration
– Because of inflation
Example:
Being promised $10,000 next year is worth only 90% as
much as receiving $10,000 right now.
Assuming payment n years in future is worth only
(0.9)n of payment now, what is the AP’s Future
Discounted Sum of Rewards ?
Infinite Sum
Assuming a discount rate of 0.9, how much
does the assistant professor get in total?
20 + .9 20 + .92 20 + .93 20 + …
= 20 + .9 (20 + .9 20 + .92 20 + …)
x = 20 + .9 x
x = 20/(.1) = 200
People in economics and probabilistic decision-
making do this all the time.
The “Discounted sum of future rewards” using
discount factor g” is
(reward now) +
g (reward in 1 time step) +
g 2 (reward in 2 time steps) +
g 3 (reward in 3 time steps) +
:
: (infinite sum)
Discount Factors
Markov System: the Academic Life
Define:
JA = Expected discounted future rewards starting in state A
JB = Expected discounted future rewards starting in state B
JT = “ “ “ “ “ “ “ T
JS = “ “ “ “ “ “ “ S
JD = “ “ “ “ “ “ “ D
How do we compute JA, JB, JT, JS, JD ?
A.Assistant
Prof20
B.Assoc.
Prof60
S.On theStreet
10
D.Dead
0
T.Tenured
Prof400
0.7
0.7
0.6
0.3
0.2 0.2
0.2
0.3
0.60.2
Working Backwards
A. Assistant
Prof.: 20
B. Associate
Prof.: 60
T. Tenured
Prof.: 100
S. Out on the
Street: 10 D. Dead: 0
1.0
0.60.2
0.2
0.7
0.3
0.6
0.2
0.20.7
0.3
Discount
factor 0.9
0
270
27
247151
Reincarnation?
A. Assistant
Prof.: 20
B. Associate
Prof.: 60
T. Tenured
Prof.: 100
S. Out on the
Street: 10 D. Dead: 0
0.5
0.60.2
0.2
0.7
0.3
0.6
0.2
0.20.7
0.3
Discount
factor 0.90.5
System of Equations
L(A) = 20 + .9(.6 L(A) + .2 L(B) + .2 L(S))
L(B) = 60 + .9(.6 L(B) + .2 L(S) + .2 L(T))
L(S) = 10 + .9(.7 L(S) + .3 L(D))
L(T) = 100 + .9(.7 L(T) + .3 L(D))
L(D) = 0 + .9 (.5 L(D) + .5 L(A))
Solving a Markov System with
Matrix Inversion
• Upside: You get an exact answer
• Downside: If you have 100,000 states
you’re solving a 100,000 by 100,000 system
of equations.
Define
J1(Si) = Expected discounted sum of rewards over the next 1 time step.
J2(Si) = Expected discounted sum rewards during next 2 steps
J3(Si) = Expected discounted sum rewards during next 3 steps
:
Jk(Si) = Expected discounted sum rewards during next k steps
J1(Si) = (what?)
J2(Si) =(what?)
Jk+1(Si) =(what?)
Value Iteration: another way to solve
a Markov System
Define
J1(Si) = Expected discounted sum of rewards over the next 1 time step.
J2(Si) = Expected discounted sum rewards during next 2 steps
J3(Si) = Expected discounted sum rewards during next 3 steps
:
Jk(Si) = Expected discounted sum rewards during next k steps
J1(Si) = ri
(what?)
J2(Si) =(what?)
:
Jk+1(Si) =(what?)
Value Iteration: another way to solve
a Markov System
N
j
jiji sJpr1
1 )(g
N
j
j
k
iji sJpr1
)(g
N = Number of states
Let’s do Value Iteration
k Jk(SUN) Jk(WIND) Jk(HAIL)
1
2
3
4
5
SUN
+4
WIND
0
HAIL
.::.:.::
-81/2
1/2
1/21/2
1/2
1/2
g = 0.5
Let’s do Value Iteration
k Jk(SUN) Jk(WIND) Jk(HAIL)
1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
4 4.94 -1.44 -11
5 4.88 -1.52 -11.11
SUN
+4
WIND
0
HAIL
.::.:.::
-81/2
1/2
1/21/2
1/2
1/2
g = 0.5
Value Iteration for solving Markov
Systems
• Compute J1(Si) for each i
• Compute J2(Si) for each i
:
• Compute Jk(Si) for each i
As k→∞ Jk(Si)→J*(Si)
When to stop? When
Max Jk+1(Si) – Jk(Si) < ξi
This is faster than matrix inversion (N3 style)
if the transition matrix is sparse
What if we have a way to interact with the Markov system?
A Markov Decision Process
g = 0.9
Poor &
Unknown
+0
Rich &
Unknown
+10
Rich &
Famous
+10
Poor &
Famous
+0
You run a
startup
company.
In every
state you
must choose
between
Saving
money (S)
or
Advertising
(A).
S
S
AA
S1
1
1
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
Markov Decision Processes
An MDP has…
• A set of states {s1 ··· SN}
• A set of actions {a1 ··· aM}
• A set of rewards {r1 ··· rN} (one for each state)
• A transition probability function
k
ijP
kijk
ij action use I and ThisNextProbP
On each step:
0. Call current state Si
1. Receive reward ri
2. Choose action {a1 ··· aM}
3. If you choose action ak you’ll move to state Sj with
probability
4. All future rewards are discounted by g
What’s a solution to an MDP? A sequence of actions?
A PolicyA policy is a mapping from states to actions.
Examples
STATE → ACTION
PU S
PF A
RU S
RF A
STATE → ACTION
PU A
PF A
RU A
RF A
Polic
y N
um
ber
1:
• How many possible policies in our example?
• Which of the above two policies is best?
• How do you compute the optimal policy?
PU0
PF0
RU+10
RF+10
RF10
PF0
PU0
RU10
S
A
AA
A A
1
1
1
1
1
1/2
1/2
1/2
1/21/2
1/2
Polic
y N
um
ber
2:
Interesting Fact
For every M.D.P. there exists an optimal
policy.
It’s a policy such that for every possible start
state there is no better option than to follow
the policy.
Computing the Optimal Policy
Idea One:
Run through all possible policies.
Select the best.
What’s the problem ??
Optimal Value Function
Define J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policy
S1
+0
S3
+2
S2
+3
B
1/2
1/21/2
1/2
0
1
1
1
1/3
1/3
1/3
Question:
What is an optimal policy
for that MDP?
(assume g = 0.9)
What is J*(S1) ?
What is J*(S2) ?
What is J*(S3) ?
Computing the Optimal Value
Function with Value Iteration
Define
Jk(Si) = Maximum possible expected
sum of discounted rewards I
can get if I start at state Si and
I live for k time steps.
Note that J1(Si) = ri
Let’s compute Jk(Si) for our example
k Jk(PU) Jk(PF) Jk(RU) Jk(RF)
1
2
3
4
5
6
k Jk(PU) Jk(PF) Jk(RU) Jk(RF)
1 0 0 10 10
2 0 4.5 14.5 19
3 2.03 8.55 16.52 25.08
Bellman’s Equation
N
j
j
na
iji
a
i
n r1
1 SJPSJ max g
i
n
i
n
i
SJSJ when converged 1
max
Value Iteration for solving MDPs
• Compute J1(Si) for all i
• Compute J2(Si) for all i
• :
• Compute Jn(Si) for all i
…..until converged
…Also known as
Dynamic Programming