information theory in intelligent decision making · 2015. 3. 5. · daniel polani adaptive systems...

Information Theoryin Intelligent Decision Making

Daniel Polani

Adaptive Systems and Algorithms Research GroupsSchool of Computer Science

University of Hertfordshire, United Kingdom

March 5, 2015

Daniel Polani Information Theory in Intelligent Decision Making


The Theory

Daniel Polani



March 5, 2015


Motivation

Artificial Intelligence

modelling cognition in humans

realizing human-level “intelligent” behaviour in machines

jumble of various ideas to get above points working

Question

Is there a joint way of understanding cognition?

Probability

we have probability theory for a theory of uncertainty

we have information theory for endowing probability with asense of “metrics”


Motivation

Artificial Intelligence

modelling cognition in humans

realizing human-level “intelligent” behaviour in machines (justperformance: not necessarily imitating biological substrate)

jumble of various ideas to get above points working

Question

Is there a joint way of understanding cognition?

Probability

we have probability theory for a theory of uncertainty

we have information theory for endowing probability with asense of “metrics”


Random Variables

Def.: Event Space

Consider an event space Ω = ω1, ω2, . . . , finite or countablyinfinite with a (probability) measure PΩ : Ω→ [0, 1] s.t.

∑ω PΩ(ω) = 1. The ω are called events.

Random Variable

A random variable X is a map X : Ω→ X with some outcomespace X = x1, x2, . . . and induced probability measurePX(x) = PΩ(X−1(x)).We also write instead

PX(x) ≡ P(X = x) ≡ p(x) .


Neyman-Pearson Lemma I

Lemma

Consider observations x1, x2, . . . , xn of a random variable Xand two potential hypotheses (distributions) p1 and p2 theycould have been based upon.

Consider the test for hypothesis p1 to be given as(x1, x2, . . . , xn) ∈ A where

A =

x = (x′1, x′2, . . . , x′n)∣∣∣ p1(x′1,x′2,...,x′n)

p2(x′1,x′2,...,x′n)≥ C

with some

C ∈ R+.

Assuming the rate α of false negatives p1(A) to be given.Generated by p1, but not in A

If β is the rate of false positives p2(A)Then: any test with false negative rate α′ ≤ α has false

positive rate β′ ≥ β.

(Cover and Thomas, 2006)


Neyman-Pearson Lemma II

Proof(Cover and Thomas, 2006)

Let A as above and B some other acceptance region; χA and χBbe the indicator functions. Then for all x:

[χA(x)− χB(x)] [p1(x)− Cp2(x)] ≥ 0 .

Multiplying out & integrating:

0 ≤∑A(p1 − Cp2)−∑

B(p1 − Cp2)

= (1− α)− Cβ− (1− α′) + Cβ′

= C(β′ − β)− (α− α′)


Neyman-Pearson Lemma III

Consideration

assume events xi.i.d.

test becomes:

∏i

p1(xi)

p2(xi)≥ C

logarithmize:

∑i

logp1(xi)

p2(xi)≥ κ (:= log C)

Note

: Kullback-Leibler Divergence

Average “evidence” growth per sample

DKL(p1||p2) =

E[

logp1(X)

p2(X)

]= ∑

x∈Xp(x) log

p1(x)p2(x)


Neyman-Pearson Lemma IV

Consideration


test becomes:

∏i

p1(xi)

p2(xi)≥ C

logarithmize:

∑i

logp1(xi)


Note

: Kullback-Leibler Divergence


DKL(p1||p2) =

E[

logp1(X)

p2(X)

]= ∑

x∈Xp(x) log

p1(x)p2(x)


Neyman-Pearson Lemma V

Consideration


test becomes:

∏i

p1(xi)

p2(xi)≥ C

logarithmize:

∑i

logp1(xi)


Note: Kullback-Leibler Divergence


DKL(p1||p2) = Ep1

[log

p1(X)

p2(X)

]= ∑

x∈Xp1(x) log

p1(x)p2(x)


Neyman-Pearson Lemma VI

-100

0

100

200

300

400

500

600

700

800

900

0 2000 4000 6000 8000 10000

log

sum

samples

"0.40_vs_0.60.dat""0.50_vs_0.60.dat""0.55_vs_0.60.dat"


Neyman-Pearson Lemma VII

-100

0

100

200

300

400

500

600

700

800

900

0 2000 4000 6000 8000 10000

log

sum

samples

"0.40_vs_0.60.dat""0.50_vs_0.60.dat""0.55_vs_0.60.dat"

dkl_04*xdkl_05 * x

dkl_055 * x


Part I

Information Theory — Motivation


Structural MotivationIntrinsic Pathways to Information Theory

InformationTheory

optimalcommunication

Shannonaxioms

physicalentropy

Laplace’sprinciple

typicalitytheory

optimal Bayes

Rate Distortion

informationgeometry


Structural MotivationIntrinsic Pathways to Information Theory

InformationTheory

AI

optimalcommunication

Shannonaxioms

physicalentropy

Laplace’sprinciple

typicalitytheory

optimal Bayes

Rate Distortion

informationgeometry


Optimal Communication

Codes

task: send messages (disambiguate states) from sender toreceiver

consider self-delimiting codes (without extra delimitingcharacter)

simple example: prefix codes

Def.: Prefix Codes

codes where none is a prefix of another code


Prefix Codes

0 1

0 1

0

0

0

1

1


Kraft Inequality

Theorem

Assume events x ∈ X = x1, x2, . . . xk are coded using prefixcodewords based on alphabet size b = |B|, with lengthsl1, l2, . . . , lk for the respective events, then one has

k

∑i=1

bli ≤ 1 .

Proof Sketch(Cover and Thomas, 2006)

Let lmax be the length of the longest codeword. Expand tree fullyto level lmax. Fully expanded leaves are either: 1. codewords; 2.descendants of codewords; 3. neither.An li codeword has blmax−li full-tree descendants, which must bedifferent for the different codewords and there cannot be morethan blmax in total. Hence

∑ blmax−li ≤ blmax

Remark

The converse also holds.


Considerations — Most compact code

Assume

Want to code stream of eventsx ∈ X appearing with probabilityp(x).

Note

1 try to make li as small aspossible

2 make b−li as large as possible3 limited by Kraft inequality;

ideally becoming equality

∑i

b−li = 1

as li are integers, that’s typically not exact

Minimize

Average code length:E[L] = ∑i p(xi) li under

constraint ∑i b−li != 1

Result

Differentiating Lagrangian

∑i

p(xi) li + λ ∑i

b−li

w.r.t. l gives codewordlengths for “shortest” code:

li = − logb p(xi)

Average Codeword Length

= ∑i

p(xi) · li = −∑x

p(x) log p(x)

In the following, assume binary log.


Entropy

Def.: Entropy

Consider the random variable X. Then the entropy H(X) of X isdefined as

H(X)

[≡ H(p)]

:= −∑x

p(x) log p(x)

with convention 0 log 0 ≡ 0

Interpretations

average optimal codeword lengthuncertainty (about next sample of X)physical entropymuch more . . .

Quote

“Why don’t you call it entropy. In the first place, a mathematicaldevelopment very much like yours already exists in Boltzmann’sstatistical mechanics, and in the second place, no one understandsentropy very well, so in any discussion you will be in a position ofadvantage.”

John von Neumann


Entropy

Def.: Entropy

Consider the random variable X. Then the entropy H(X) of X isdefined as

H(X)[≡ H(p)] := −∑x

p(x) log p(x)

with convention 0 log 0 ≡ 0

Interpretations

average optimal codeword lengthuncertainty (about next sample of X)physical entropymuch more . . .

Quote

“Why don’t you call it entropy. In the first place, a mathematicaldevelopment very much like yours already exists in Boltzmann’sstatistical mechanics, and in the second place, no one understandsentropy very well, so in any discussion you will be in a position ofadvantage.”

John von Neumann


Meditation

Probability/Code Mismatch

Consider events x following a probability p(x), but modelerassuming mistakenly probability q(x), with optimal code lengths− log q(x). Then “code length waste per symbol” given by

−∑x

p(x) log q(x) + ∑x

p(x) log p(x)

= ∑x

p(x) logp(x)q(x)

= DKL(p||q)


Part II

Types


A Tip of Types(Cover and Thomas, 2006)

Method of Types: Motivation

consider sequences with same empirical distribution

how many of these with a particular distribution

probability of such a sequence

Sketch of the Method

consider binary event set X = 0, 1w.l.o.g.

consider sample x(n) = (x1, . . . , xn) ∈ X n

the type p(n)x is the empirical distribution of symbols y ∈ X in

sample x(n). I.e. px(n)(y) counts how often symbol y appears

in x(n). Let Pn be set of types with denominator n.or dividing n

for p ∈ Pn, call the set of all sequences x(n) ∈ X n with type pthe type class C(p) = x(n)|px(n) = p.


Type Theorem

Type Count

If |X | = 2, one has |Pn| = n + 1 different types for sequences oflength n.easy to generalize

Important

Pn grows only polynomially, but X n grows exponentially with n.It follows that (at least one) type must contain exponentially manysequences. This corresponds to the “macrostate” in physics.

Theorem(Cover and Thomas, 2006)

If x1, x2, . . . , xn is an i.i.d. drawn sample sequence drawn from q,then the probability of x(n) depends only on its type and is given by

2−n[H(px(n)

)+DKL(px(n)||q)]

Corollary

If x(n) has type q, then its probability is given by

2−nH(q)

A large value of H(q) indicates many possible candidates x(n) andhigh uncertainty, a small value few candidates and low uncertainty.

here, we interpret probability q as type


Part III

Laplace’s Principle and Friends


Laplace’s Principle of Insufficient Reason I

Scenario

Consider X . A probability distribution is assumed on X , but it isunknown.Laplace’s principle of insufficient reason states that, in absence ofany reason to assume that the outcomes are inequivalent, theprobability distribution on X is assumed as equidistribution.

Question

How to generalize when something is known?


Answer: Types

Dominant Sample Sequence

Remember: sequence probability of sequences in type class C(q)

2−nH(q)

A priori, a probability q maximizing H(q) will generate dominatingsequence types dominating all others.

Maximum Entropy Principle

Maximize: H(q) with respect to qResult: equidistribution q(x) = 1

|X |


Sanov’s Theorem I

Theorem

Consider i.i.d. sequenceX1, X2, . . . , Xn of random variables,distributed according to q(X). Letfurther E be a set of probabilitydistributions.

Then (amongst other), if E is closedand with p∗ = arg minp∈E D(p||q),one has

1n

log q(n)(E) −→ −D(p∗||q)

p∗E

q


Sanov’s Theorem II

Interpretation

p is unknown, but one knows constraints for p (e.g. some

condition, such as some mean value U != ∑x p(x)U(x) must be

attained, i.e. the set E is given), then the dominating types arethose close to p∗.

Special Case

if prior q is equidistribution (indifference), then minimizing D(p||q)under constraints E is equivalent to maximizing H(p) under theseconstraints.

Jaynes’ Maximum Entropy Principle


Sanov’s Theorem III

Jaynes’ Principle

generalization of Laplace’s Principle

maximally uncommitted distribution


Maximum Entropy Distributions INo constraints

We are interested in maximizing

H(X) = −∑x

p(x) log p(x)

over all probabilities p. The probability p lives in the simplex

∆ = q ∈ R|X ||∑i qi = 1, qi ≥ 0

The maximization requires to respect constraints, of which we now

consider only ∑x p(x) != 1.

The edge constraints happen not to be invoked here.


Maximum Entropy Distributions IINo constraints

Unconstrained maximization via Lagrange:

maxp

[−∑x

p(x) log p(x) + λ ∑x

p(x)]

Taking derivative ∇p(x) gives

− log p(x)− 1 + λ!= 0

. Thus p(x) = eλ−1 ≡ 1/|X | — equidistribution


Maximum Entropy DistributionsLinear Constraints

Constraints are now

∑x

p(x) != 1

∑x

p(x) f (x) != f .

Derive Lagrangian

0 =

∇P[

−∑x


p(x) + µ ∑x

p(x) f (x)

]

− log p(x)− 1 + λ + µ f (x) = 0

so that one has

Boltzmann/Gibbs Distribution

p(x) = eλ−1+µ f (x)

=1Z

eµ f (x)


Maximum Entropy DistributionsLinear Constraints

Constraints are now

∑x

p(x) != 1

∑x

p(x) f (x) != f .

Derive Lagrangian

0 = ∇P[−∑x


p(x) + µ ∑x

p(x) f (x)]

− log p(x)− 1 + λ + µ f (x) = 0

so that one has

Boltzmann/Gibbs Distribution

p(x) = eλ−1+µ f (x)

=1Z

eµ f (x)


Part IV

Kullback-Leibler and Friends


Conditional Kullback-Leibler

DKL can be conditional

DKL [p(Y|x)||q(Y|x)]DKL [p(Y|X)||q(Y||X)] = ∑

xp(x)DKL [p(Y|x)||q(Y|x)]


Kullback-Leibler and Bayes(Biehl, 2013)

Want to estimate p(x|θ), where θ is the parameter. Observe y.Seek “best” q(x|y) for this y in the following sense:

1 minimize DKL of true distribution to model distribution q

2 averaged over possible observations y3 averaged over θ

minq

∫dθ p(θ) ∑

yp(y|θ)

DKL[p(x|θ)||q(x|y)]

Result

q(x|y) is the Bayesian inference obtained from p(y|x) and p(x)




1 minimize DKL of true distribution to model distribution q2 averaged over possible observations y

3 averaged over θ

minq

∫dθ p(θ)

∑y

p(y|θ) DKL[p(x|θ)||q(x|y)]

Result





1 minimize DKL of true distribution to model distribution q2 averaged over possible observations y3 averaged over θ

minq

∫dθ p(θ) ∑

yp(y|θ) DKL[p(x|θ)||q(x|y)]

Result



Conditional Entropies

Special Case: Conditional Entropy

H(Y|X = x) := −∑y

p(y|x) log p(y|x)

H(Y|X) := −∑x

p(x)∑y

p(y|x) log p(y|x)

Information

Reduction of entropy (uncertainty) by knowing another variable

I(X; Y) := H(Y)− H(Y|X)

= H(X)− H(X|Y)= H(X) + H(Y)− H(X, Y)= DKL[p(x, y)||p(x)p(y)]


Part V

Towards Reality


Rate/Distortion TheoryCode below specifications

Reminder

Information is about sending messages. We considered mostcompact codes over a given noiseless channel. Now consider thesituation where either:

1 channel is not noiseless but has noisy characteristics p(x|x) or

2 we cannot afford to spend average of H(X) bits per symbolto transmit

Question

What happens? Total collapse of transmission


Rate/Distortion Theory IDistortion

“Compromise”

don’t longer insist on perfect transmission

accept compromise, measure distortion d(x, x) betweenoriginal x and transmitted xsmall distortion good, large distortion “baaad”

Theorem: Rate Distortion Function

Given p(x) for generation of symbols X,

R(D) := minp(x|x)

E[d(X,X)]=D

I(X; X)

where the mean is over p(x, x) = p(x|x)p(x).


Rate/Distortion Theory IIDistortion

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 0.2 0.4 0.6 0.8 1

r(x)


First Example: Infotaxis(Vergassola et al., 2007)



Applications

Daniel Polani



March 5, 2015


Thank You

Information-theoretic PA-LoopInvariants,EmpowermentAlexander Klyubin

Relevant InformationChrystopher NehanivNaftali TishbyThomas MartinetzJan Kim

DigestedInformationChristoph Salge

ContinuousEmpowermentTobias JungPeter Stone

Christoph SalgeCornelius GlackinEC (FEELIXGROWING, FP6),NSF, ONR, DARPA,FHA

CollectiveEmpowermentPhilippe Capdepuy

Collective SystemsMalte Harder

World Structure,Graphs,Empowerment inGamesTom Anthony

Sensor Evolution,Informationdistribution over thePA-LoopSander van DijkAlexandra MarkAchim Liese

Information Flow,PA-Loop ModelsNihat Ay

FurtherContributionsMikhail ProkopenkoLars OlssonPhilippe CapdepuyMalte HarderSimon McGregor

This work was partially supported byFP7 ICT-270219


Part VI

Crash Introduction


Modelling Cognition: Motivation from Biology

Question

Why/how did cognition evolve in biology?

Observations in biology

sensors often highly optimized:

detection of few molecules (moths)(Dusenbery, 1992)

detection of few or individual photons (humans/toads)(Hecht et al., 1942; Baylor et al., 1979)

auditive sense operates close to thermal noise(Denk and Webb, 1989)

cognitive processing very expensive(Laughlin et al., 1998; Laughlin, 2001)


Conclusions

Evidence

sensors often operate at physical limitsevolutionary pressure for high cognitive functions

But What For?

close the cycleactions matter

Entscheidend ist, was hinten rauskommt.

Trade-Offs

sharpening sensorsimprove processingboosting actuators

Was man nicht im Kopf hat, muss man in denBeinen haben.


Part VII

Information


Decisions, Decisions

Challenge

Linking sensors, processing and actuators

The Physical and the Biological

Physics: given dynamical equations etc.known (in principle)

Biological Cognition: no established unique modelcomplex, difficult to untangle

Robotic Cognition: many near-equivalent incompatiblesolutions and architecturesoften specific and hand-designed

Problem

Considerable arbitrariness in treatment of cognition


Idea

Issues

uniform treatment of cognition

distinguish:

essentialincidental

aspects of computation

Proposal: “Covariant” Modeling of Computation

Physics: observations may depend on “coordinate system”for same underlying phenomenon

Cognition: computation may depend on architecturebut essentially computes “the same concepts”

Bottom Line

“coordinate-” (mechanism-)free view of cognition?


Landauer’s Principle

Fundamental Limits for Information Processing

On lowest level: cannot fully separate physics and informationprocessing

Consequence: erasure of information from a “memory” createsheat

Connection: of energy and information

(Wt, Mt)

Wt

Mt

(Wt+1, Mt+1)

Wt+1

Mt+1


Informational Invariants: Beyond Physics

Law of Requisite Variety(Ashby, 1956; Touchette and Lloyd, 2000, 2004)

Ashby: “only variety can destroy variety”

extension by Touchette/Lloyd

Open-Loop Controller: max. entropy reduction∆H∗open

Closed-Loop Controller: max. entropy reduction∆Hclosed ≤ ∆H∗open + I(Wt; At)

. . . Wt−3

At−3

Wt−2

At−2

Wt−1

At−1

Wt

At

Wt+1

At+1

Wt+2. . .


Informational Invariants: Scenario

Core Statement

Task: consider e.g. navigational task

Informationally: reduction of entropy of initial (arbitrary) state

Example:

−10 −5 0 5 10

−10

−5

0

5

10

\tex[c][c][1][0]x

\te

x[c

][c][1

][0

]y


Information Bookkeeping

Bayesian Network

. . .Wt−3

St−3

. . . Mt−3

At−3

Wt−2

St−2

Mt−2

At−2

Wt−1

St−1

Mt−1

At−1

Wt

St

Mt

At

Wt+1

St+1

Mt+1. . .

At+1

Wt+2. . .

Informational “Conservation Laws”

Total Sensor History: S(t) = (S0, S1, . . . , St−1)

Result:limt→∞

I(S(t); W0) = H(W0)

(Klyubin et al., 2007), and see also (Ashby, 1956; Touchette and Lloyd, 2000, 2004)



Bayesian Network

. . .Wt−3

St−3 At−3

Wt−2

St−2 At−2

Wt−1

St−1 At−1

Wt

St At

Wt+1

St+1 At+1

Wt+2. . .



Result:limt→∞

I(S(t); W0) = H(W0)




Bayesian Network

. . . Wt−3

St−3 At−3

Wt−2

St−2 At−2

Wt−1

St−1 At−1

Wt

St At

Wt+1

St+1 At+1

Wt+2. . .



Result:limt→∞

I(S(t); W0) = H(W0)



Observations

Key Motto

There is no perpetuum mobile of 3rd kind.

Actually, rather, there may be no free lunch, but sometimes there is free beer.

Information Balance Sheet

Task Invariant: H(W0) determines minimum informationrequired to get to center

Task Variant: but can be spread/concentrated differently over

timeenvironment and agents (“stigmergy”)sensors and memory

(Klyubin et al., 2004a,b, 2007; van Dijk et al., 2010)

Note: invariance is purely entropic: indifferent to task

Next Step

refine towards specific tasks


Observations

Key Motto

There is no perpetuum mobile of 3rd kind.Actually, rather, there may be no free lunch, but sometimes there is free beer.

Information Balance Sheet

Task Invariant: H(W0) determines minimum informationrequired to get to center

Task Variant: but can be spread/concentrated differently over

timeenvironment and agents (“stigmergy”)sensors and memory

(Klyubin et al., 2004a,b, 2007; van Dijk et al., 2010)

Note: invariance is purely entropic: indifferent to task

Next Step

refine towards specific tasks


Information for Decision Making

Replace gradient follower by general policy π

Dynamics

. . . St−2

At−2

St−1

π

At−1

St

π

At

St+1

π

At+1

St+2. . .

π

Utility

Vπ(s) := Eπ[Rt + Rt+1 + · · · | s]

= ∑a

π(a|s)∑s′

Pass′[Ra

ss′ + V(s′)]


A Parsimony Principle

Traditional MDP

Task: find best policy π∗

Traditional RL: does not consider decision costs

Credo: information processing expensive in biology!(Laughlin et al., 1998; Laughlin, 2001; Polani, 2009)

Hypothesis: organisms trade off information-processing costs withtask payoff(Tishby and Polani, 2011; Polani, 2009; Laughlin, 2001)

Therefore: include information cost and expand to I-MDP(Polani et al., 2006; Tishby and Polani, 2011)

Principle of Information Parsimony

minimize I(S; A) (relevant information) at fixed utility level


Motto

It is a very sad thing that nowadays there is so little uselessinformation.

Oscar Wilde


Relevant Information and its Policies

Computation

Via Lagrangian formalism:(Stratonovich, 1965; Polani et al., 2006; Belavkin, 2008, 2009; Still and Precup, 2012; Saerens et al., 2009; Tishbyand Polani, 2011)

find:min

π

(I(S; A)− βE[Vπ(S)]

)β→ ∞: policy is optimal while informationally parsimonious!

β finite: policy suboptimal at fixed level E[Vπ(S)] whileinformationally parsimonious

I(S; A) as well as Vπ depend on π

Expectation

for higher utility, more relevant information required

and vice versa


Experiments

Scenario

Define (Pass′ , Ra

ss′) by:

States: grid worldActions: north, east, south,

west

Reward: action produces a“reward” of -1 untilgoal reached

Experiment

Trade off utility and relevantinformation

Question

Form of expected trade-off?

B

A


Experiment — Find the Corner

-50

-40

-30

-20

-10

0

0 0.2 0.4 0.6 0.8 1 1.2

E[Q(S,A)]

I(S;A)

Optimal Case

goal B has higher utilitythan A

but needs a lot moreinformation per step

Suboptimal Case

goal B much worse thangoal A

for same information cost


Experiment — With a Twist I

Experiment Revisited

grid-world again

consider only goal A

cost as before

The “Twist”(Polani, 2011)

permute directions north, east, south, west!

random fixed permutation of directions for each state

replace (Pass′ , Ra

ss′) by (Pass′ , Ra

ss′) where

Pass′ := Pσs(a)

ss′

Rass′ := Rσs(a)

ss′


Experiment — With a Twist II

Expectation

as a traditional MDP, “twisted” MDP (Pass′ , Ra

ss′) remainsexactly equivalent:

same optimal values

V∗(s) = V∗(s), s ∈ S

same optimal policy after undoing twistpre-/post-twist policies equivalent via

Qπ(s, a) = Qπ(s, σs(a))π(s, a) = π(s, σs(a))

And as I-MDP?


Experiment With a Twist: Uh-Oh!

-50

-40

-30

-20

-10

0

0 0.2 0.4 0.6 0.8 1 1.2

E[V(S)]

I(S;A)

Optimal Case

sanity check: utility samefor original and twisted

but latter needs a lotmore information per step

Suboptimal Case

twisted MDP becomesmuch worse than original

at same information cost


Intermediate Conclusions

Insights

as traditional MDP both experiments fully equivalent

as I-MDP, however . . .

significant difference between

agent “taking actions with it” andhaving “realigned” set of actions at each step

embodiment allows to offload informational effort(eg. Paul, 2006; Pfeifer and Bongard, 2007)


Part VIII

Goal-Relevant Information


Towards Multiple Goals

Extension

assume family of tasks (e.g. multiple goals)

action now depends on both state and goals

St−1

At−1

St

At

St+1

G


I(G; At|st) = H(At|st)− H(At|Gt, st)


Towards Multiple Goals

Extension

assume family of tasks (e.g. multiple goals)

action now depends on both state and goals

St−1

At−1

St

At

St+1

G

Goal-Relevant Information (Regularized)

minπ(at|st,g)

(I(G; At|St)− βE[Vπ(St, G, At)]

)Daniel Polani Information Theory in Intelligent Decision Making


I(G; At|st)


Goal-Relevant and Sensor Information Trade-Offs

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

I(S t

;At|G

)

I(G; At|St)

α = 0

α = 1

Lagrangian

minπ(at|st,g)

[(1− α)I(G; At|St) + αI(St; At|G)− βE[Vπ(St, G, At)]

]Daniel Polani Information Theory in Intelligent Decision Making

Information “Caching”

Note

not only the how much of goal-relevant information matters

but also the which

Consider

Accessible History (Context): e.g.

At−1 = (A0, A1, . . . , At−1)

“Cache Fetch”: new goal-relevant information not already used

I(At; G|At−1) = H(At|At−1)− H(At|G, At−1)


Subgoals

I(At; G|At−1, s) I(At−1; G|At, s)new goal information discarded goal information

(van Dijk and Polani, 2011; van Dijk and Polani, 2013)

Psychological Connections?

Crossing doors causes forgetting(see also Radvansky et al., 2011)


Efficient Relevant Goal Information(van Dijk and Polani, 2013)

“Most Efficient” Goal

G −→ G1 −→ A←− Smin

I(G1;A|S)≥CI(G; G1)

February 14, 2013 17:28 WSPC/INSTRUCTION FILE acs12

14 S.G. van Dijk and D. Polani

(a) |G1| = 3 (b) |G1| = 4

(c) |G1| = 5 (d) |G1| = 6

Fig. 8: Goal clusters induced by the bottleneck G1 on the primary goal-information

pathway in a 6-room grid world navigation task. Figures (a) to (d) show the map-

pings for increasing cardinality of the bottleneck variable.

distribution for this pathway:

minp(g2|g)

I(G; G2) subj. to I(St; At|G2) ≥ CI2(7)

5.1. Observation 4: Natural Abstraction

Firstly we will study the primary pathway, constrained with bottleneck G1, and

solve (6) to find the goal mapping induced by this bottleneck on the pathway. Figure

8 shows such mappings found for different capacities of the bottleneck variable

in a 6-room grid navigation scenario, with the lower bound CI1fixed as high as

possible (see Appendix C for more details), such that the clustering becomes most

informative.

One result to note is that the stringent lower bound results in a hard clustering:

each goal is deterministicly mapped to a single element in G1. Secondly, the map-

ping adheres to the local connectivity of goals: goal states in the same cluster are

connected directly in the transition graph of the MDP.

Moreover, the clustering also attempts to adhere to the physical boundaries of


Making State Predictive for Actions

“Most Enhancive” Goal

G −→ G2 −→ A←− Smin

I(S;A|G2)≥CI(G; G2)


Informational Constraints-Driven Organization in Goal-Directed Behavior 17

(a) |G| = 4 (b) |G| = 5

(c) |G| = 6 (d) |G| = 7

Fig. 10: Goal clusters induced by the bottleneck G2 on the secondary, state-

information modulating goal-information pathway in a 9-room grid world navi-

gation task. Figures (a) to (d) show the mappings for increasing cardinality of the

bottleneck variable.

It is important to note that the global relations between states and goals is

strongly determined by the set of available actions. Consider for instance the subset

of ’north-eastern goals’, i.e. those shaded darkest in Fig. 10a. Knowing that the goal

is in this subset allows the agent to use state knowledge to make the informative

distinction of whether the goal is likely to the north or to the west. But this dis-

tinction is only relevant because the agent has access to distinct actions that define

these directions. Differently defined actions would induce other relations with goals,

and likely a different factorization would appear as goal-based frame of reference. In

the extreme, a much less structured set of actions can have a strong adverse effect

on informational requirements [16], most probably making it difficult to construct

a useful abstraction in constrained pathways.


Making State Predictive for Actions

“Most Enhancive” Goal

G −→ G2 −→ A←− Smin

I(S;A|G2)≥CI(G; G2)

Insights

“spillover” ignoring localboundaries

action informationinduces global “frame ofreference”

depends on actionconsistency


Informational Constraints-Driven Organization in Goal-Directed Behavior 17

(a) |G| = 4 (b) |G| = 5

(c) |G| = 6 (d) |G| = 7

Fig. 10: Goal clusters induced by the bottleneck G2 on the secondary, state-

information modulating goal-information pathway in a 9-room grid world navi-

gation task. Figures (a) to (d) show the mappings for increasing cardinality of the

bottleneck variable.

It is important to note that the global relations between states and goals is

strongly determined by the set of available actions. Consider for instance the subset

of ’north-eastern goals’, i.e. those shaded darkest in Fig. 10a. Knowing that the goal

is in this subset allows the agent to use state knowledge to make the informative

distinction of whether the goal is likely to the north or to the west. But this dis-

tinction is only relevant because the agent has access to distinct actions that define

these directions. Differently defined actions would induce other relations with goals,

and likely a different factorization would appear as goal-based frame of reference. In

the extreme, a much less structured set of actions can have a strong adverse effect

on informational requirements [16], most probably making it difficult to construct

a useful abstraction in constrained pathways.


Part IX

Empowerment: Motivation


Universal Utilities

Problems

in biology, success criterium is survivalconcept of a “task” and “reward” is not sharp“search space” too large for full-fledged success feedback

pure Darwinism: feedback by deaththis is very sparse

Notes

Homeostasis: provides dense networks to guide living beingsProblem:

specific to particular organismsdesigned on case-to-case basis for artificial agentsmore generalizable perspective in view of success ofevolution?


Universal Utilities

Problems

in biology, success criterium is survivalconcept of a “task” and “reward” is not sharp“search space” too large for full-fledged success feedbackpure Darwinism: feedback by death

this is very sparse

Notes




Universal Utilities

Problems

in biology, success criterium is survivalconcept of a “task” and “reward” is not sharp“search space” too large for full-fledged success feedbackpure Darwinism: feedback by deaththis is very sparse

Notes




Idea

Universal Drives and Utilities

Core Idea: adaptational feedback should be dense and rich

artificial curiosity, learning progress, autotelic principle,intrinsic reward(Schmidhuber, 1991; Kaplan and Oudeyer, 2004; Steels, 2004; Singh et al., 2005)

homeokinesis, and predictive information(Der, 2001; Ay et al., 2008)

Physical Principle:

causal entropic forcing(Wissner-Gross and Freer, 2013)


Present Ansatz

Use Embodiment

optimize informational fit into the sensorimotor niche

maximization of potentialto inject information into the environment (via actuators)and recapture it from the environment (via sensors)

(Klyubin et al., 2005a,b; Nehaniv et al., 2007; Klyubin et al., 2008)


Here: Empowerment

Motto

“Being in control of one’s destiny

and knowing it

is good.”(Jung et al., 2011)

More Precisely

information-theoretic version of

controllability (being in control of destiny)

observability (knowing about it)

combined


Here: Empowerment

Motto

“Being in control of one’s destinyand knowing it

is good.”(Jung et al., 2011)

More Precisely

information-theoretic version of

controllability (being in control of destiny)

observability (knowing about it)

combined


Formalism

Bayesian Network

. . .Wt−3

St−3

. . . Mt−3

At−3

Wt−2

St−2

Mt−2

At−2

Wt−1

St−1

Mt−1

At−1

Wt

St

Mt

At

Wt+1

St+1

Mt+1. . .

At+1

Wt+2. . .



Formalism

Bayesian Network

. . .Wt−3

St−3 At−3

Wt−2

St−2 At−2

Wt−1

St−1 At−1

Wt

St At

Wt+1

St+1 At+1

Wt+2. . .



Formalism

Bayesian Network

. . .Wt−3

St−3 At−3

Wt−2

St−2 At−2

Wt−1

St−1 At−1

Wt

St At

Wt+1

St+1 At+1

Wt+2. . .

“Free Will” Actions

Empowerment: Formal Definition

E(k) := maxp(at−k ,at−k+1,...,at−1)

I(At−k, At−k+1, . . . , At−1; St)



Formalism

Bayesian Network

. . .Wt−3

St−3 At−3

Wt−2

St−2 At−2

Wt−1

St−1 At−1

Wt

St At

Wt+1

St+1 At+1

Wt+2. . .



E(k)(wt−k) :=max

p(at−k ,at−k+1,...,at−1|wt−k)I(At−k, At−k+1, . . . , At−1; St|wt−k)



Formalism

Bayesian Network

. . .Wt−3

St−3 At−3

Wt−2

St−2 At−2

Wt−1

St−1 At−1

Wt

St At

Wt+1

St+1 At+1

Wt+2. . .



E(k)(wt−k) := maxp(a(k)t−k |wt−k)

I(A(k)t−k; St|wt−k)



Empowerment — a Universal Utility

Notes

Empowerment E(k)(w) defined

given horizon k, i.e. local

given starting state w (or context, for POMDPs)

i.e. empowerment is function of state, “utility”

However

only defined by world dynamics

no reward function assumed


Empowerment — Notes

Properties of Empowerment

want to maximize potential information flow

could be injected through the actuatorsinto the environmentand recaptured by sensors in the future

potential influence on the environmentwhich is detectable through agent sensorsdetermined by embodiment Pa

ss′ onlyno external reward Ra

ss′

Bottom Line

information-theoretic controllability/observabilityinformational efficiency of sensorimotor niche


Other Interpretations

Related Concepts

mobility

money

affordance

graph centrality(Anthony et al., 2008)

antithesis to “helplessness”(Seligman and Maier, 1967; Overmier and Seligman, 1967)

Think Strategic

Tactics is what you do when you have a plan

Strategy is what you do when you haven’t


Part X

First Examples


Maze Empowerment

maze average distance

E ∈ [1; 2.32] E ∈ [1.58; 3.70] E ∈ [3.46; 5.52] E ∈ [4.50; 6.41]


Empowerment vs. Average Distance

*

******

**

* *

* *

*

****

*

* *

**

*

*

****

* *

**

*

*

***** *

***

*****

*

**

**

*****

*

*

***

**

*

***

*

*

****

*

**

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

6 8 10 12 14 16

4.5

5.0

5.5

6.0

d

EE


Box Pushing

stationary box pushable box

box invisibleto agent

E ∈ [5.86, 5.93] E = log2 61 ≈ 5.93 bit

box visible toagent

E ∈ [5.86, 5.93] E ∈ [5.93, 7.79]


In the Continuum:Pendulum Swing-up Task w/o Reward(Jung et al., 2011)

Dynamics

pendulum (length l = 1, mass m = 1, grav g = 9.81, friction µ = 0.05)

ϕ(t) =−µϕ(t) + mgl sin(ϕ(t)) + u(t)

ml2

with state st = (ϕ(t), ϕ(t)) and continuous control u ∈ [−5, 5].system time discretized to ∆ = 0.05 secdiscretize actions to u ∈ −5,−2.5, 0,+2.5,+5

Goal

To provide this system with some matching purpose, considerpendulum swing-up task

Comparison

empowerment-based controltraditional optimal control

Optimal control problem is solved by approximate dynamic programming on a high-resolution grid.Daniel Polani Information Theory in Intelligent Decision Making

Results: Performance

0 1 2 3 4 5 6 7 8 9 10−2

−1

0

1

2

3

4

5

Time (sec)

Performance of optimal policy (FVI+KNN on 1000x1000 grid)

phiphidot

0 1 2 3 4 5 6 7 8 9 10−5

−4

−3

−2

−1

0

1

2

Time (sec)

Performance of maximally empowered policy (3−step)

phiphidot

Phase plot of ϕ and ϕ when following the respective greedy policy from the last slide. Note that for ϕ, the y-axisshows the height of the pendulum (+1 means upright, the goal state).Daniel Polani Information Theory in Intelligent Decision Making

Results: “Explored” Space

−pi −pi/2 0 pi/2 pi−6

−4

−2

0

2

4

6Empowerment−based Exploration

φ [rad]

φ’ [r

ad/s

]

Action 0

Action 1

Action 2

Action 3

Action 4


Empowerment: Acrobot(Jung et al., 2011)

Setting

two-linked pendulumactuation in hip only

Idea

Add LQR control to bang-bang control


Acrobot: Demo


Block’s World(Salge, 2013)

Properties

scenario with modifiable world

deterministic (i.e. empowerment is log n where n is thenumber of reachable states in horizon k)

agent can incorporate, place, destroy blocks and move

estimated via (highly incomplete) sampling

Empowered “Minecrafter”

(Salge, 2013)


Explorer Accompanying Robot(Glackin et al., 2015)

Consortium

Demonstrator II


Part XI

References


Anthony, T., Polani, D., and Nehaniv, C. L. (2008). On preferredstates of agents: how global structure is reflected in localstructure. In Bullock, S., Noble, J., Watson, R., and Bedau,M. A., editors, Artificial Life XI: Proceedings of the EleventhInternational Conference on the Simulation and Synthesis ofLiving Systems, Winchester 5–8. Aug., pages 25–32. MITPress, Cambridge, MA.

Ashby, W. R. (1956). An Introduction to Cybernetics. Chapman &Hall Ltd.

Ay, N., Bertschinger, N., Der, R., Guttler, F., and Olbrich, E.(2008). Predictive information and explorative behavior ofautonomous robots. European Journal of Physics B,63:329–339.

Baylor, D., Lamb, T., and Yau, K. (1979). Response of retinalrods to single photons. Journal of Physiology, London,288:613–634.

Belavkin, R. (2008). The duality of utility and information inoptimally learning systems. In Proc. 7th IEEE InternationalConference on ’Cybernetic Intelligent Systems’. IEEE Press.


Belavkin, R. V. (2009). Bounds of optimal learning. In AdaptiveDynamic Programming and Reinforcement Learning, 2009.ADPRL’09. IEEE Symposium on, pages 199–204. IEEE.

Biehl, M. (2013). Kullback-leibler and bayes. Internal Memo.

Cover, T. M. and Thomas, J. A. (2006). Elements of InformationTheory. Wiley, 2nd edition.

Denk, W. and Webb, W. W. (1989). Thermal-noise-limitedtransduction observed in mechanosensory receptors of theinner ear. Phys. Rev. Lett., 63(2):207–210.

Der, R. (2001). Self-organized acqusition of situated behavior.Theory Biosci., 120:1–9.

Dusenbery, D. B. (1992). Sensory Ecology. W. H. Freeman andCompany, New York.

Glackin, C., Salge, C., Trendafilov, D., Greaves, M., Polani, D.,Leu, A., Haque, S. J. U., Slavnic, S., , and Ristic-Durrant, D.(2015). An information-theoretic intrinsic motivation modelfor robot navigation and path planning.


Hecht, S., Schlaer, S., and Pirenne, M. (1942). Energy, quanta andvision. Journal of the Optical Society of America, 38:196–208.

Jung, T., Polani, D., and Stone, P. (2011). Empowerment forcontinuous agent-environment systems. Adaptive Behaviour,19(1):16–39. Published online 13. January 2011.

Kaplan, F. and Oudeyer, P.-Y. (2004). Maximizing learningprogress: an internal reward system for development. In Iida,F., Pfeifer, R., Steels, L., and Kuniyoshi, Y., editors,Embodied Artificial Intelligence, volume 3139 of LNAI, pages259–270. Springer.

Klyubin, A., Polani, D., and Nehaniv, C. (2007). Representationsof space and time in the maximization of information flow inthe perception-action loop. Neural Computation,19(9):2387–2432.

Klyubin, A. S., Polani, D., and Nehaniv, C. L. (2004a).Organization of the information flow in the perception-actionloop of evolved agents. In Proceedings of 2004 NASA/DoDConference on Evolvable Hardware, pages 177–180. IEEEComputer Society.


Klyubin, A. S., Polani, D., and Nehaniv, C. L. (2004b). Trackinginformation flow through the environment: Simple cases ofstigmergy. In Pollack, J., Bedau, M., Husbands, P., Ikegami,T., and Watson, R. A., editors, Artificial Life IX: Proceedingsof the Ninth International Conference on Artificial Life, pages563–568. MIT Press.

Klyubin, A. S., Polani, D., and Nehaniv, C. L. (2005a). All elsebeing equal be empowered. In Advances in Artificial Life,European Conference on Artificial Life (ECAL 2005), volume3630 of LNAI, pages 744–753. Springer.

Klyubin, A. S., Polani, D., and Nehaniv, C. L. (2005b).Empowerment: A universal agent-centric measure of control.In Proc. IEEE Congress on Evolutionary Computation, 2-5September 2005, Edinburgh, Scotland (CEC 2005), pages128–135. IEEE.

Klyubin, A. S., Polani, D., and Nehaniv, C. L. (2008). Keep youroptions open: An information-based driving principle forsensorimotor systems. PLoS ONE, 3(12):e4018.


Laughlin, S. B. (2001). Energy as a constraint on the coding andprocessing of sensory information. Current Opinion inNeurobiology, 11:475–480.

Laughlin, S. B., de Ruyter van Steveninck, R. R., and Anderson,J. C. (1998). The metabolic cost of neural information.Nature Neuroscience, 1(1):36–41.

Nehaniv, C. L., Polani, D., Olsson, L. A., and Klyubin, A. S.(2007). Information-theoretic modeling of sensory ecology:Channels of organism-specific meaningful information. InLaubichler, M. D. and Muller, G. B., editors, ModelingBiology: Structures, Behaviour, Evolution, The Vienna Seriesin Theoretical Biology, pages 241–281. MIT press.

Overmier, J. B. and Seligman, M. E. P. (1967). Effects ofinescapable shock upon subsequent escape and avoidanceresponding. Journal of Comparative and PhysiologicalPsychology, 63:28–33.

Paul, C. (2006). Morphological computation: A basis for theanalysis of morphology and control requirements. Roboticsand Autonomous Systems, 54(8):619–630.


Pfeifer, R. and Bongard, J. (2007). How the Body Shapes the WayWe think: A New View of Intelligence. Bradford Books.

Polani, D. (2009). Information: Currency of life? HFSP Journal,3(5):307–316.

Polani, D. (2011). An informational perspective on how theembodiment can relieve cognitive burden. In Proc. IEEESymposium Series in Computational Intelligence 2011 —Symposium on Artificial Life, pages 78–85. IEEE.

Polani, D., Nehaniv, C., Martinetz, T., and Kim, J. T. (2006).Relevant information in optimized persistence vs. progenystrategies. In Rocha, L. M., Bedau, M., Floreano, D.,Goldstone, R., Vespignani, A., and Yaeger, L., editors, Proc.Artificial Life X, pages 337–343.

Radvansky, G. A., Krawietz, S. A., and Tamplin, A. K. (2011).Walking through doorways causes forgetting: Furtherexplorations. The Quarterly Journal of ExperimentalPsychology, 64(8):1632–1645.


Saerens, M., Achbany, Y., Fuss, F., and Yen, L. (2009).Randomized shortest-path problems: Two related models.Neural Computation, 21:2363–2404.

Salge, C. (2013). Block’s world. Presented at GSO 2013.

Schmidhuber, J. (1991). A possibility for implementing curiosityand boredom in model-building neural controllers. In Meyer,J. A. and Wilson, S. W., editors, Proc. of the InternationalConference on Simulation of Adaptive Behavior: FromAnimals to Animats, pages 222–227. MIT Press/BradfordBooks.

Seligman, M. E. P. and Maier, S. F. (1967). Failure to escapetraumatic shock. Journal of Experimental Psychology, 74:1–9.

Singh, S., Barto, A. G., and Chentanez, N. (2005). Intrinsicallymotivated reinforcement learning. In Proceedings of the 18thAnnual Conference on Neural Information Processing Systems(NIPS), Vancouver, B.C., Canada.

Steels, L. (2004). The autotelic principle. In Iida, F., Pfeifer, R.,Steels, L., and Kuniyoshi, Y., editors, Embodied Artificial


Intelligence: Dagstuhl Castle, Germany, July 7-11, 2003,volume 3139 of Lecture Notes in AI, pages 231–242. SpringerVerlag, Berlin.

Still, S. and Precup, D. (2012). An information-theoretic approachto curiosity-driven reinforcement learning. Theory inBiosciences, 131(3):139–148.

Stratonovich, R. (1965). On value of information. Izvestiya ofUSSR Academy of Sciences, Technical Cybernetics, 5:3–12.

Tishby, N. and Polani, D. (2011). Information theory of decisionsand actions. In Cutsuridis, V., Hussain, A., and Taylor, J.,editors, Perception-Action Cycle: Models, Architecture andHardware, pages 601–636. Springer.

Touchette, H. and Lloyd, S. (2000). Information-theoretic limits ofcontrol. Phys. Rev. Lett., 84:1156.

Touchette, H. and Lloyd, S. (2004). Information-theoretic approachto the study of control systems. Physica A, 331:140–172.

van Dijk, S. and Polani, D. (2011). Grounding subgoals ininformation transitions. In Proc. IEEE Symposium Series in


Computational Intelligence 2011 — Symposium on AdaptiveDynamic Programming and Reinforcement Learning, pages105–111. IEEE.

van Dijk, S. and Polani, D. (2013). Informationalconstraints-driven organization in goal-directed behavior.Advances in Complex Systems, 16(2-3). Published online, 30.April 2013, DOI:10.1142/S0219525913500161.

van Dijk, S. G., Polani, D., and Nehaniv, C. L. (2010). What doyou want to do today? relevant-information bookkeeping ingoal-oriented behaviour. In Proc. Artificial Life, Odense,Denmark, pages 176–183.

Vergassola, M., Villermaux, E., and Shraiman, B. I. (2007).’infotaxis’ as a strategy for searching without gradients.Nature, 445:406–409.

Wissner-Gross, A. D. and Freer, C. E. (2013). Causal entropicforcing. Physics Review Letters, 110(168702).


information theory in intelligent decision making · 2015. 3. 5. · daniel polani adaptive systems...

Documents