information theory of decisions and actions

Information Theory of Decisions and Actions

Naftali Tishby and Daniel Polani

2

Contents• Guiding Questions• Introduction

Guiding Questions Q1. How is Fuster’s perception-action-cycle and Shannon’s information theory

related? How is this analogy related with reinforcement learning? Q2. What is value-to-go? What is information-to-go? How do we trade-off be-

tween these two terms? Give a formulation that can make this trade-off. Hint: Free-energy principle. How can we find the optimal policy, i.e. the one mini-mizing its information-to-go under a constraint on the attained value-to-go.

Q3. Define the entropy. Define the relative entropy or Kullback-Leibler diver-gence. Define the Markov decision process (MDP). Define the value function of the MDP. How is the value function optimized? What is Bellman equation and how is this related with the MDP problem? What’s the relationship be-tween reinforcement learning, the MDP, and the Bellman equation?

Q4. Use a Bayesian network or graphical model (see Figure in page 12) to de-scribe the perception-action cycle of an agent with sensors and memory. What are the characteristics of this agent?

3

Introduction• We need to develop intelligent behaviour for artificial agents, such as organ-

isms.• The “cycle” view, such as perception-action cycle, can help identifying biases,

incentives and constraints for the self-organized formation of intelligent pro-cessing in living organisms.

• There are many modeling ways for quantitative treatment of the perception-ac-tion cycle.

• Information-theoretic treatment for perception-action cycle can compare sce-narios with differing computational models.

• Markovian Decision Process (MDP) framework solves the problem of finding the optimal policy which maximizes the reward achieved by agents.

• Goal of the paper is to marry the MDP formalism with an information-theo-retic treatment of the processing cost required by the agent to attain a given level of performance.

4

Shannon’s Information TheoryWhat is the Shannon’s information theory?

A branch of applied mathematics, electrical engineering, and com-puter science involving the quantification of information.

A key measure of information is known as entropy, which is usu-ally expressed by the average number of bits needed to store or communicate one symbol in message.

Entropy quantifies the uncertainty involved in predicting the value of random variable.

5

Shannon’s Information TheoryEntropy and Information

Entropy of a random variable

The entropy is a measure of uncertainty about the outcome of the random variable before it has been measured, or seen, and is a natural choice for this.

Attain maximum for uniform distribution, reflecting the state of maximal uncertainty.

6

Shannon’s Information Theory Conditional entropy of random variables

→ The conditional entropy measures the remaining uncertainty about if is known.

7

Shannon’s Information Theory Joint entropy of a random variables

→ Joint entropy is a measure of the uncertainty associated with a set of variables Mutual information between and

→ Mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables.

8

Shannon’s Information TheoryRelative entropy (Kullback-Leibler divergence)

The relative entropy is a measure how much “compression” (or predic-tion, both in bits) could be gained if instead of an hypothesized distribu-tion of , a concrete distribution is utilized.

One has with equality if and only if everywhere. The relative entropy can become infinite if for an outcome that can occur

with nonzero probability one assumes a probability . The mutual information between two variables and can be expressed as

9

Markov Decision Processes

MDP: Definition Discrete time stochastic control process Basic model for the interaction of an organism (or an artificial

agent) with a stochastic environment The core problem of MDPs is to find a “policy” for the decision

maker The goal is to choose a policy that will maximize some cumula-

tive function of the random rewards, typically the expected dis-counted sum over a potentially infinite horizon.

10

Markov Decision Processes Given a state set , and for each state an action set , an MDP is specified

by the tuple , defined for all and : the probability that performing an action in a state will move the agent to state : the expected reward for this particular transition

11

Markov Decision ProcessesValue function of MDP and its optimization

Policy specified an explicit probability to select action if the agent in a state

Total cumulated reward

Future expected cumulative reward value (Bellman Equation)

Per-action value function Q which is expanded from value function V

12

Markov Decision ProcessesBellman equation

A dynamic decision problem

Constraint

Bellman’s principle of optimalityAn optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision Bellman equation

Constraint

13

Markov Decision ProcessesReinforcement learning

If the probabilities or rewards are unknown, the problem is one of reinforcement learning

For this purpose it is useful to define a further function, which cor-responds to taking the action and then continuing optimally.

14

Markov Decision Processes• Problem

– The MDP framework is concerned with describing the task and with solving the problem of finding the optimal policy

(value-to-go)– It is not concerned with the actual processing cost that is involved

with carrying out the given policies (Information-to-go)

15

Guiding Questions• Q1. How is Fuster’s perception-action-cycle and

Shannon’s information theory related? How is this analogy related with reinforcement learning?

• Q3. Define the entropy. Define the relative en-tropy or Kullback-Leibler divergence. Define the Markov decision process (MDP). Define the value function of the MDP. How is the value function optimized? What is Bellman equation and how is this related with the MDP problem? What’s the re-lationship between reinforcement learning, the MDP, and the Bellman equation?

16

Bayesian Network• Bayesian network of general agent

17

𝑊 𝑡 −3 𝑊 𝑡 −2 𝑊 𝑡 −1 𝑊 𝑡+ 1𝑊 𝑡

𝑆𝑡− 3 𝐴𝑡− 3

𝑀 𝑡− 3 𝑀 𝑡− 2 𝑀 𝑡− 1 𝑀 𝑡

𝑆𝑡− 2 𝐴𝑡− 2 𝑆𝑡−1 𝐴𝑡− 1 𝑆𝑡 𝐴𝑡

: World state

: sensor of agent

: memory of agent

: Action

Bayesian Network• Characteristics of agent

– Agent can be considered as an all-knowing observer. – Agent can access full states to world state – Memory of reactive agent will be ignored.

• Apply previous comment to graph

18

𝑆𝑡− 3

𝐴𝑡− 3

𝑆𝑡− 2

𝐴𝑡− 2

𝑆𝑡−1

𝐴𝑡− 1

𝑆𝑡

𝐴𝑡

Value to go / Information to go• Value-to-go

– Future expected reward in the course of a behaviour sequence to-wards a goal

• Information-to-go– Cumulated information processing cost or bandwidth required to

specify the future decision and action sequence

• Trade-off– In a view of biological ramification, organism finds optimal re-

wards that an organism can accumulate under given constraints on its informational bandwidth

– How much reward the organism can accumulate vs. how much in-formational bandwidth it needs for that

19

Information-to-go• Formalism

– The cumulated information processing cost or bandwidth required to specify the future decision and action sequence

– This is computed specifying a given starting state and initial ac-tion and accumulating information-to-go into the open-ended fu-ture

– Let is fixed prior on the distribution of successive states and ac-tions

– Define now the process complexity as the Kullback-Leibler diver-gence between actual distribution of states and actions after t and the one assumed in the prior

20

Information-to-go• Formalism

– Since are independent so

Action distributions are consistent with them via the policy which we assume constant over time for all t

21

Information to go• Formalism

With

22

Calculating trade-off• Using Lagrange method

– The constrained optimization problem of finding minimal informa-tion-to-go at a given level of value-to-go can be turn into an un-constrained one.

– Let the Lagrange multiplier as

– Lagrangian build a link to the Free Energy formalism corresponds to the physical entropy corresponds to the energy of system– This provides additional justification for the minimization of the

information-to-go under value-to-go constraints• Minimization of identifies the least committed policy in the

sense that the future is the least informative

23


– To find the optimal policy,

where the minimization ranges over all policies– To resolve this equation

24


– Extending above equation by Lagrange term for the normalization of and taking the gradient with respect to and then setting the gradient of to 0 provides

25


– Iterating the system of self-consistent above equations till conver-gence for every state will produce an optimal policy.

26

information theory of decisions and actions

Documents

information entropy

quantification of information

key measure of information

random variables joint

perceptionaction cycle

value of random variable

measure of uncertainty

mdp problem