multivariate information bottleneck

.

Multivariate Information Bottleneck

Noam SlonimPrinceton UniversityLewis-Sigler Institute for Integrative Genomics

Nir FriedmanNaftali TishbyHebrew UniversitySchool of Computer Science and Engineering

2

Multivariate Information Bottleneck - Preview

- A general framework for specifying a new family of clustering problems

- Almost all of these problems, are not treated by standard clustering approaches

- Insights and demonstrations why these problems are important

- A general optimal solution for all these problems, based on a single Information Theoretic principle

- Applications for text analysis, gene expression data and more...

3

Multivariate IB – introduction

1X

-Original IB: Compressing one variable while preserving the information about some other single variable

2X

1T

2X

21 X,XP 21 X,TP

4

Multivariate IB – introduction (cont.)

-However, we could think of other problems, e.g. symmetric compression:

Question: How to formulate and solve all such problems under one unifying principle?

2T

1X

2X

1T 21 X,XP

5

(a few words about …) Bayesian Networks

-A Bayes net over (X1,…,Xn) is a DAG G in which vertices correspond to the random variables

Gii

in PaXPX,...,XP 1

- P(X1,…,Xn) is consistent with G iff each Xi is independent of all the other (non-descendant) variables, given its parents Pai

4X

2X 3X

1X

142 XX,XInd

6

Multi-information and Bayes nets

-The information (X1,…,Xn) contains about each other is captured by:

nX,...,X n

nnn

XP...XPX,...,XP

logX,...,XP X,...,XI1 1

111

-If P(X1,…,Xn) is consistent with G then:

i

Giin

G Pa;XIX,...,XI 1

7

Original IB through Bayes net formulation

11 X,TI

21 X,TI

1X 2X

1T outin GG IIXTPL :Minimize

New generalized formulation:

ConstX;TIX;TI :Minimize 2111 Which in this case means:

2111 X;XIX;TII inG

1X 2X

1T

inG

Constant

What compresses what

1X 2X

1T

outG

21 X;TII outG

What predicts what

8

Alternative formulation: preliminaries

)QP(KLmin)GP(KL GQ

For a given DAG G, define: P

For P which is consistent with Gin:outin GGout II)GP(KL

Real multi-info in P(X,T) Multi-info as though P(X,T)is consistent with Gout

12

Beyond the original IB[Slonim, Friedman, Tishby]

n432 X ... X X X X 1 n432 X ... X X X X 1

k32 T ... T T T 1

Gin dependencies(minimize)

Gout dependencies(maximize)

Compression (Bottleneck) variables

Input variables Input variables

)PaT(P inGjj

Parameters

13

A simple example: Symmetric IB

2T

2211 X;TIX;TII inG

1X 2X

1T

inG What compresses what

1X 2X

1T

outG

21 T;TII outG

What predicts what

2T

212211 T;TIX;TIX;TIL :Minimize

14

A multivariate formal optimal solution

)PaP(t w.r.t I-I Minimizing inoutin Gjj

GG

jGjG

j

jGjj t,Padexp

),Z(Pa)P(t

)PaP(t in

inin

121211 tTPxTPDt,xd KL

-Where now d(Paj,tj) is a generalized (KL) distortion measure…

- For example, in symmetric IB:

15

Multivariate IB algorithms – example for aIB[Slonim, Friedman, Tishby, 2002]

W1 W2 W3 W4 W5 ................ WN

W1 W2 W3,W4 W5 .......... WN

W1,W2...WN

W1 W2 W3 W4 W5 ................ WN

W1 W2 W3,W4 W5 .......... WN

W1,W2...WN

afterbeforej,j LLtt rl

rlJS, tTP,tTPDttd~ rl

121211

rlrlrlj,jjjj,j ttd

~tPtPtt

-Which pair to merge?

-Where now is a generalized (JS) distortion measure…

rlj,j ttd

~

- For example, in symmetric aIB:

16

Symmetric aIB compression: documents, words

60

65

70

75

80

85

90

Test2 Test4 Test5

Original aIBSymmetric aIB

- Accuracy of symmetric aIB vs. original aIB over 3 small datasets:

Word clusters provide a more robust representation…

17

Symmetric IB through Deterministic Annealing

Data: 20,000 messages from 20 different discussion groups [Lang, 95]

W – a word in the corpusC – the class (newsgroup) of the message

P(W=‘bible’,C=‘alt.atheism’): Probability that choosing a random position in the corpus would select the word ‘bible’ in a message of the newsgroup (class) ‘alt.atheism’…

)C,W(Plog

Words

Classes

18


N

ewsg

roup

Word

19


alt.atheismrec.autosrec.motorcyclesrec.sport.*sci.medsci.spacesoc.religion.christiantalk.politics.*

comp.*misc.forsalesci.cryptsci.electronics

carturkishgameteamjesusgunhockey…

xfileimageencryptionwindowdosmac…

New

sgro

up

Word

P(TC,TW)

21


New

sgro

up

word

P(TC,TW)

23


New

sgro

up

Word

P(TC,TW)

24


New

sgro

up

Word

P(TC,TW)

25


New

sgro

up

Wordatheistschristianityjesusbiblesinfaith…

alt.atheismsoc.religion.christiantalk.religion.misc

P(TC,TW)

26

Symmetric aIB compression: genes, samples

Data: Gene expression of 500 “informative” genes Vs. 72 Leukemia samples (Golub et al, 1999)

Genes

Samples

)SG(Plog

27

Symmetric aIB compression: genes, samples

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ALLB-cellhosp1

ALLB-cellhosp1

ALLT-cellhosp1Male

BMB-cell

BMB-cell

AML AMLhosp2

AMLhosp3

10 Geneclusters

8 Sample clusters

X00437_s_atM12886_atX76223_s_atM59807_atU23852_s_atD00749_s_atU89922_s_atX03934_atU50743_atM21624_atM28826_atM37271_s_atX59871_atX14975_atM16336_s_atL05148_atM28825_at

)TT(P GS

Data after symmetric aIB compression:

28

Another example: parallel IB

- Consider a document collection with different topics, and different writing styles:

topic4topic4topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic2topic2

topic2topic2

topic2topic2topic2topic2

topic2topic2

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

Science

Science

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

29

Another example: parallel IB (cont.)

topic2topic2

topic2topic2

topic2topic2

topic2topic2

topic2topic2

topic2topic2

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

Topic1 Topic2 Topic3 Topic4

-One possible “legitimate” partition is by the topic:

30

Another example: parallel IB (cont.)

-And another possible “legitimate” partition is by the writing style:

topic1topic1

topic3topic3

topic2topic2

topic3topic3

topic4topic4

topic1topic1

topic4topic4

topic1topic1

topic2topic2

topic2topic2

topic4topic4

topic1topic1

topic3topic3

topic1topic1

topic1topic1

topic3topic3

topic4topic4

topic1topic1

topic2topic2

topic3topic3

topic1topic1

topic3topic3

topic2topic2

topic4topic4

topic4topic4

Style1 Style2 Style3

There might be more than one“legitimate” partition…

31

Parallel IB: solution

2212211 X;T,TIX;TIX;TIL :Minimize

2T

2211 X;TIX;TII inG

1X 2X

1T

inG Minimize dependencies

1X 2X

1T

outG

)X;T,T(II outG221

Maximize dependencies

2T

))]T,tX(P)T,xX(P(D[E)t,x(d KL)XT(P 2122121112

Effective distortion:

32

Parallel sIB: Text analysis results

-Data: ~1,500 “documents” taken from E. R. Burroughs: The Beasts of Tarzan & The Gods of Mars

R. Kipling: The Jungle Book & Rewards and Fairies

- X1 corresponds to “documents”, X2 corresponds to words

32542

1254

4061

2315

T2,bT2,a

Burroughs

Kipling3670Rewards and Fairies

2550The Jungle Book

0407The Gods of Mars

2315The Beasts of Tarzan

T1,bT1,a

33

Parallel sIB :Gene Expression data results

- Data: Gene expression of 500 “informative” genes Vs. 72 Leukemia samples (Golub et al, 1999)

- X1 corresponds to samples, X2 corresponds to genes

.72.64<PS>

90T-cell

380B-cell

470ALL

223AML

T1,bT1,a

.66.71

90

137

1037

1114

T2,bT2,a

.76.53

63

326

389

1312

T3,bT3,a

.69.70

72

1820

2522

1213

T4,bT4,a

34

Another Example: Triplet IB

-Consider the following sequence data:

s(1) s(2) s(3) … s(t-1) s(t) s(t+1) …

-Can we extract features s.t. their combination is informative about a symbol between them?

Xp Xm Xn

Tp Tn

35

Triplet IB: solution

mnpnnpp X;T,TβIX;TIX;TIL :Minimize

nnppG X;TIX;TII in

nT

pX

pT

inG Minimize dependencies

mX nX pX

pT

outG

)X;T,I(TI mnpGout

Maximize dependencies

nT

mX nX

36

Triplet IB Data

(E. R. Burroughs, “Tarzan the Terrible”)

“… As Tarzan ascended the platform his eyes narrowed angrily at thesight which met them… ‘’What means this?” he cried angrily…”

1st word in triplet

Xp

2nd word in triplet

Xm

3rd word in triplet

Xn

Xm = {apemans, apes, eyes, girl, great, jungle, tarzan, time, two, way}

Data: Tarzan and the Jewels of Opar, Tarzan of the Apes, Tarzan the Terrible, Tarzan the Untamed, The Beasts of Tarzan, The Jungle Tales of Tarzan, The Return of Tarzan

Joint distribution P(Xp,Xm,Xn) of dimension 90 x 10 x 233

37

Triplet sIB: Text analysis results

- Given Xp and Xn, two schemes to predict middle word:Xm = argmax P( xm’ | tp,tn )

- Test on a NEW sequence, “The son of Tarzan”:

22%28%55%53%Average

21%28%81%60%Way (101)

8%11%92%41%Two (148)

26%48%82%70%Time (145)

25%40%67%41%Tarzan (48)

24%27%54%49%Jungle (241)

48%50%92%92%Great (219)

1%5%30%43%Girl (240)

28%32%81%83%Eyes (177)

14%17%26%43%Apes(78)

Xp, XnTp, TnXp, XnTp, TnXm

Precision (%) Recall (%)

Xm = argmax P( xm’ | xp,xn )

38

Summary

- The IB method is a principled framework, for extracting “informative” structure out of a joint distribution P(X1,X2).

- The Multivariate IB extends this framework to extract “informative” structure from more complex joint distributions, P(X1,…,Xn), in various ways.

- This enables us to define and solve a new family of optimization problems, under a single unifying Information Theoretic principle.

- References: www.cs.huji.ac.il/~noamm

- “Clustering” conceals a family of distinct problems which deserve special consideration. The multivariate IB framework enables to define these sub-problems, solve them, and demonstrate their importance.

multivariate information bottleneck

Documents