universal scaling of semantic information revealed from ib word clusters or human language as...

Universal Scaling of Semantic Universal Scaling of Semantic InformationInformation

Revealed from IB word clustersRevealed from IB word clustersoror

Human language as optimal biological adaptationHuman language as optimal biological adaptation

Naftali TishbySchool of Computer Science & Engineering &School of Computer Science & Engineering &

Interdisciplinary Center for Neural ComputationInterdisciplinary Center for Neural ComputationThe Hebrew University, Jerusalem, IsraelThe Hebrew University, Jerusalem, Israel

http://www.cs.huji.ac.il/~tishby

Workshop on Machine Learning in Natural Language ProcessingWorkshop on Machine Learning in Natural Language ProcessingCRI, Haifa UniversityCRI, Haifa University

December 2006December 2006

Outline:Outline: Language – Language – a window into our cognitive processinga window into our cognitive processing

What can we learnWhat can we learn from word statistics? from word statistics? How can weHow can we quantify quantify itit?? Is there a Is there a “correct level” “correct level” of description of description ??

Information BottleneckInformation Bottleneck (IB) (IB) and the representation of relevanceand the representation of relevance Finding Approximate sufficient statistics Finding Approximate sufficient statistics

Words, documents and Words, documents and meaningmeaning… … Trading complexity and accuracyTrading complexity and accuracy

ScalingScaling of semantic information of semantic information Possible models: Possible models: small world propertiessmall world properties

0 1 2 3 4 5 6

x 104

0

4000

6000

Number of observed words

What are words?What are words?• acquired persistent neural activity associated with perception and cognitive functions• appear in every language in a regular power-law sub-linear rate

Nu

mb

er o

f d

iffe

ren

t w

ord

s

Log number of words

Lo

g n

um

be

r o

f d

iffe

ren

t w

ord

s

8.5 9 9.5 10 10.5 117.5

8

8.5

9

9.5

data

y = 0.64x + 2.07

10000

8000

2000

english

0 1000 2000 30000

1

2

3x 10

4

# docs

# di

ffer

ent

wor

ds

4 5 6 7 88

8.5

9

9.5

10

10.5

log(# docs)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

0 2 4 6

x 105

0

1

2

3x 10

4

# words

# di

ffer

ent

wor

ds

10 11 12 13 148

8.5

9

9.5

10

10.5

log(# words)

log(

# di

ffer

ent

wor

ds)


data

y = 0.55x + 5.92

data

y = 0.56x + 2.81

hebrew

0 500 10000

5000

10000

# docs

# di

ffer

ent

wor

ds

4 5 6 77.5

8

8.5

9

9.5

log(# docs)

log(

# di

ffer

ent

wor

ds)


0 2 4 6

x 104

0

2000

4000

6000

8000

10000

# words

# di

ffer

ent

wor

ds

8 9 10 117.5

8

8.5

9

9.5

log(# words)

log(

# di

ffer

ent

wor

ds)


data

y = 0.65x + 4.57

data

y = 0.64x + 2.07

korean

0 500 1000 15000

2

4

6x 10

4

# docs

# di

ffer

ent

wor

ds

4 5 6 7 88.5

9

9.5

10

10.5

11

log(# docs)

log(

# di

ffer

ent

wor

ds)


0 0.5 1 1.5 2

x 105

0

1

2

3

4

5x 10

4

# words

# di

ffer

ent

wor

ds

9 10 11 12 138.5

9

9.5

10

10.5

11

log(# words)

log(

# di

ffer

ent

wor

ds)


data

y = 0.77x + 5.13

data

y = 0.70x + 2.21

Rank – Frequency of wordsRank – Frequency of words

Words exhibit “scale-free” statistics- Words exhibit “scale-free” statistics- Zipf’s lawZipf’s law

0 2 4 6 8 10-11

-10

-9

-8

-7

-6

-5

-4

-3

log Rank

log

F

req

ue

nc

y

Hebrew Zipf curve

How are words/languages How are words/languages generated?generated? Basic observations:Basic observations:

• serve for serve for communicationcommunication and representation and representation• adapt to variableadapt to variable world world statisticsstatistics • collectivecollective (social) entity (social) entity • acquired continuouslyacquired continuously (individually and collectively)(individually and collectively)

Competition Competition between comm. efficiency between comm. efficiency and adaptability / learnabilityand adaptability / learnability

Complexity

Acc

ura

cy

Possible Models/representations

Limited dataLimited data

Bounded Bounded

ComputationComputation

Complexity – Accuracy Tradeoff

Can we quantify it…?

When there is a (relevant) prediction or distortion measure

Accuracy good predictions (low distortion/error)

Complexity long minimal description (optimal codes)

A general tradeoff between distortion and compression:

Information Theory

What can we learn from word co-occurrence...?

Audio Health www Drug noise Dos Doctor ...

Topic1 12 0 0 0 8 0 0 ...

Topic2 0 9 2 11 1 0 6 ...

Topic3 0 10 1 6 0 0 20 ...

Topic4 9 1 0 0 7 0 1 ...

Topic5 0 3 9 0 1 10 0 ...

Topic6 1 11 0 6 0 1 7 ...

Topic7 0 0 8 0 2 12 2 ... Topic8 15 0 1 1 10 0 0 ...

Topic9 0 12 1 16 0 1 12 ...

Topic10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

We need to index the max number of non-overlapping green blobs inside the blue blob:

(mutual information!)

XX̂)x|x̂(p

)ˆ|(2 XXnH

)(2 XnH

)ˆ,()ˆ|()( 22/2 XXnIXXnHXnH

Representation and Mutual Representation and Mutual InformationInformation

IB: an Information TheoreticIB: an Information Theoretic Principle PrincipleFor extracting For extracting RelevantRelevant structure structure

The minimal representation of X that keeps as much information about another variable, Y, as possible.

Generalizes the classical notion of “sufficient statistics”. ( , )

ˆ

I X YX Y

X

)ˆ,( XXI

),ˆ( YXI

ˆ( | )ˆ ˆ( , ) ( , )p x xMin I X X I X Y

The Self Consistent EquationsSelf Consistent Equations Marginal:

Markov condition:

Bayes’ rule:

x

xpxxpxp )()|ˆ()ˆ(

x

xxpxypxyp )ˆ|()|()ˆ|(

)|ˆ()ˆ(

)()ˆ|( xxp

xp

xpxxp

0)|ˆ(

)]|ˆ([

xxp

xxpL

)ˆ,(exp

),(

)ˆ()|ˆ( xxD

xZ

xpxxp KL

The emerged effective distortioneffective distortion measure:

y

KLKL

xyp

xypxyp

xypxypDxxD

)ˆ|(

)|(log)|(

)ˆ|(|)|(ˆ,

• Regular if is absolutely continuous w.r.t.

• Small if predicts y as well as x:

)ˆ|( xyp )|( xyp

x̂

yx

yx

xyp

xxp

xyp

)ˆ|(

)|ˆ(

)|(

ˆ

The Information BottleneckInformation Bottleneck Algorithm

)ˆ,()ˆ,(min

),(logminminmin

)|ˆ(),ˆ(),ˆ|(

)|ˆ()ˆ()ˆ|(

xxDXXI

xZ

KLxxpxpxyp

xxpxpxyp

xtt

tx

t

KLt

t

tt

xxpxypxyp

xxpxpxp

xxDxZ

xpxxp

)ˆ|()|()ˆ|(

)|ˆ()()ˆ(

)ˆ,(exp),(

)ˆ()|ˆ(1

“free energy”

Can be calculated analytically for Markov chains, Gaussian processes, etc., and numerically in general.

IY

IX

IC1Y (IC1

X)

IC2Y (IC2

X)

IC3Y (IC3

X)

The limit is always the convexenvelope of increasing complexityInformation Curves

Naftali Tishby ACAI-99 20

Words and topics again...

Audio Health www Drug noise Dos Doctor ...

Topic1 12 0 0 0 8 0 0 ...

Topic2 0 9 2 11 1 0 6 ...

Topic3 0 10 1 6 0 0 20 ...

Topic4 9 1 0 0 7 0 1 ...

Topic5 0 3 9 0 1 10 0 ...

Topic6 1 11 0 6 0 1 7 ...

Topic7 0 0 8 0 2 12 2 ... Topic8 15 0 1 1 10 0 0 ...

Topic9 0 12 1 16 0 1 12 ...

Topic10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

Simple Example

Audio Noise Health Drug Doctor www Dos ....

Doc1 12 8 0 0 0 0 0 ...

Doc4 9 7 1 0 1 0 0 ...

Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ...

Doc3 0 0 10 6 20 1 0 ...

Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

Audio Noise Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

A new compact representation

The document clusters preserve the relevant

information between the documents and words

Analyzing Co-Occurrence Tables

Topics

WordsTopics-Words counts matrix

Words

The exact same counts matrix after permutation

Topics

Word clusters

TopicClusters

The eord clusters provide a compact representation that preserve the informationabout the topics

Quantified by Mutual Information

21 2

1212112121

X,X )X(P

)XX(Plog)XX(P)X(P )XX(H)X(H)X;X(I

The distinctionsinside each clusterAre less relevant forpredicting the class

WordsIrrelevant

distinctions

Symmetric IB through Deterministic Annealing

alt.atheismrec.autosrec.motorcyclesrec.sport.*sci.medsci.spacesoc.religion.christiantalk.politics.*

comp.*misc.forsalesci.cryptsci.electronics

carturkishgameteamjesusgunhockey…

xfileimageencryptionwindowdosmac…

New

sgro

up

Word

P(TC,TW)


New

sgro

up

word

comp.graphicscomp.os.ms-windows.misccomp.windows.x

comp.sys.ibm.pc.hardwarecomp.sys.mac.hardwaremisc.forsalesci.cryptsci.electronics

windowsimagewindowjpeggraphics…

encryptiondbideescrowmonitor…

P(TC,TW)


New

sgro

up

word

P(TC,TW)


New

sgro

up

word

alt.atheismrec.sport.baseballrec.sport.hockeysoc.religion.christiantalk.politics.mideasttalk.religion.misc

rec.autosrec.motorcyclessci.medsci.spacetalk.politics.gunstalk.politics.misc

armenianturkishjesushockeyisraeliarmenians…

carqgunbikefbihealth…

P(TC,TW)


New

sgro

up

Word

P(TC,TW)


New

sgro

up

Wordatheistschristianityjesusbiblesinfaith…

alt.atheismsoc.religion.christiantalk.religion.misc

P(TC,TW)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

We observe Semantic Scaling

-3.1 -3 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1-7

-6.5

-6

-5.5

-5

-4.5

y = 1.92*x - 0.866

data 1 linear

),(

),ˆ(

YXI

YXIIY

)(/),ˆ( XHXXII X

X

Y

X

Y

I

I

I

I

1

1

92.1)1(1 XY IcI

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(T;X)/H(X)

I(T

;Y)/

I(X

;Y)

20NG Noam data

20NG russian data

Simplified Chinese2.09

Traditional Chinese1.73

Dutch2.3

French2.22

Hebrew1.63

Italian2.35

Japanese1.42

Portuguese2.9

Spanish1.89

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0-8

-7

-6

-5

-4

-3

-2

-1

0

log(1-I(T;X)/H(X))

log

(1-I

(T;Y

)/I(

X;Y

))

Chinese SimplifiedChinese Traditional

Dutch

French

HebrewItalian

Japanese

Korean

PorgutueseSpanish

English 20NG Jose

English UTF

English ReutersEnglish 20NG Noam

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5-14

-12

-10

-8

-6

-4

-2

log(1-I(T;X)/H(X))

log

(1-I

(T;Y

)/I(

X;Y

))

Random selection of 200 words

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

)();( ˆˆ XcHYXIXX

Can we understand it?

)ˆ|;( XYXI

),(

),ˆ(

YXI

YXIIY

)(/),ˆ( XHXXII X

)ˆ|(

)ˆ|;(

XXH

XYXI

H

I

X

Y

Any subset of Any subset of the language the language has the same has the same exponent! exponent!

)ˆ|( XXH

)();( ˆˆ XcHYXIXX

But what does it tell about Language?

XY

XY

HI

XXH

H

XYXI

I

loglog

)ˆ|()ˆ|;(

)ˆ|(

)ˆ|;(

XXH

XYXI

H

I

X

Y

““Efficiency of the words”: Efficiency of the words”:

Log-ratio of added Log-ratio of added

Word EntropyWord Entropy

that is transferred tothat is transferred to

Meaningful InformationMeaningful Information

Language appears to have Language appears to have constantconstant word efficiency! word efficiency!

~ 2~ 2

Possible Explanations?Possible Explanations? Power laws are too common to mean anything… Power laws are too common to mean anything…

Zipf’s law and similar… Zipf’s law and similar… “never trust linear log-log plots…”“never trust linear log-log plots…”

It’s It’s a property of my Analysisa property of my Analysis, not of Language, not of Language How do I know that its not all in How do I know that its not all in the way we clusterthe way we cluster the the

words?words?

Words are generated at a Words are generated at a Constant level of Constant level of Ambiguity:Ambiguity: words are generated at awords are generated at a constant rate, constant rate, depending depending

only on the concept (occurred) only on the concept (occurred) ambiguity in ambiguity in usage usage irrespective of vocabulary size or domainirrespective of vocabulary size or domain

Small worldSmall world (scale free) properties of word (scale free) properties of word acquisition…acquisition…

Many Thanks to…Many Thanks to…

Bill BialekBill Bialek Fernando PereiraFernando Pereira Noam SlonimNoam Slonim

Dmitry DavidovDmitry Davidov Amir NavotAmir Navot Josemine MagdalenJosemine Magdalen

Banter Co. (z”l)Banter Co. (z”l)

universal scaling of semantic information revealed from ib word clusters or human language as...

Documents

mutual information slide

information theory slide

number of words

generalized baalgorithm

u information bottleneck

number of observed words

u language

rank frequency of words