universal scaling of semantic information revealed from ib word clusters or human language as...
TRANSCRIPT
Universal Scaling of Semantic Universal Scaling of Semantic InformationInformation
Revealed from IB word clustersRevealed from IB word clustersoror
Human language as optimal biological adaptationHuman language as optimal biological adaptation
Naftali TishbySchool of Computer Science & Engineering &School of Computer Science & Engineering &
Interdisciplinary Center for Neural ComputationInterdisciplinary Center for Neural ComputationThe Hebrew University, Jerusalem, IsraelThe Hebrew University, Jerusalem, Israel
http://www.cs.huji.ac.il/~tishby
Workshop on Machine Learning in Natural Language ProcessingWorkshop on Machine Learning in Natural Language ProcessingCRI, Haifa UniversityCRI, Haifa University
December 2006December 2006
Outline:Outline: Language – Language – a window into our cognitive processinga window into our cognitive processing
What can we learnWhat can we learn from word statistics? from word statistics? How can weHow can we quantify quantify itit?? Is there a Is there a “correct level” “correct level” of description of description ??
Information BottleneckInformation Bottleneck (IB) (IB) and the representation of relevanceand the representation of relevance Finding Approximate sufficient statistics Finding Approximate sufficient statistics
Words, documents and Words, documents and meaningmeaning… … Trading complexity and accuracyTrading complexity and accuracy
ScalingScaling of semantic information of semantic information Possible models: Possible models: small world propertiessmall world properties
0 1 2 3 4 5 6
x 104
0
4000
6000
Number of observed words
What are words?What are words?• acquired persistent neural activity associated with perception and cognitive functions• appear in every language in a regular power-law sub-linear rate
Nu
mb
er o
f d
iffe
ren
t w
ord
s
Log number of words
Lo
g n
um
be
r o
f d
iffe
ren
t w
ord
s
8.5 9 9.5 10 10.5 117.5
8
8.5
9
9.5
data
y = 0.64x + 2.07
10000
8000
2000
english
0 1000 2000 30000
1
2
3x 10
4
# docs
# di
ffer
ent
wor
ds
4 5 6 7 88
8.5
9
9.5
10
10.5
log(# docs)
log(
# di
ffer
ent
wor
ds)
first 100 docs are not displayed
0 2 4 6
x 105
0
1
2
3x 10
4
# words
# di
ffer
ent
wor
ds
10 11 12 13 148
8.5
9
9.5
10
10.5
log(# words)
log(
# di
ffer
ent
wor
ds)
first 100 docs are not displayed
data
y = 0.55x + 5.92
data
y = 0.56x + 2.81
hebrew
0 500 10000
5000
10000
# docs
# di
ffer
ent
wor
ds
4 5 6 77.5
8
8.5
9
9.5
log(# docs)
log(
# di
ffer
ent
wor
ds)
first 100 docs are not displayed
0 2 4 6
x 104
0
2000
4000
6000
8000
10000
# words
# di
ffer
ent
wor
ds
8 9 10 117.5
8
8.5
9
9.5
log(# words)
log(
# di
ffer
ent
wor
ds)
first 100 docs are not displayed
data
y = 0.65x + 4.57
data
y = 0.64x + 2.07
korean
0 500 1000 15000
2
4
6x 10
4
# docs
# di
ffer
ent
wor
ds
4 5 6 7 88.5
9
9.5
10
10.5
11
log(# docs)
log(
# di
ffer
ent
wor
ds)
first 100 docs are not displayed
0 0.5 1 1.5 2
x 105
0
1
2
3
4
5x 10
4
# words
# di
ffer
ent
wor
ds
9 10 11 12 138.5
9
9.5
10
10.5
11
log(# words)
log(
# di
ffer
ent
wor
ds)
first 100 docs are not displayed
data
y = 0.77x + 5.13
data
y = 0.70x + 2.21
Rank – Frequency of wordsRank – Frequency of words
Words exhibit “scale-free” statistics- Words exhibit “scale-free” statistics- Zipf’s lawZipf’s law
0 2 4 6 8 10-11
-10
-9
-8
-7
-6
-5
-4
-3
log Rank
log
F
req
ue
nc
y
Hebrew Zipf curve
How are words/languages How are words/languages generated?generated? Basic observations:Basic observations:
• serve for serve for communicationcommunication and representation and representation• adapt to variableadapt to variable world world statisticsstatistics • collectivecollective (social) entity (social) entity • acquired continuouslyacquired continuously (individually and collectively)(individually and collectively)
Competition Competition between comm. efficiency between comm. efficiency and adaptability / learnabilityand adaptability / learnability
Complexity
Acc
ura
cy
Possible Models/representations
Limited dataLimited data
Bounded Bounded
ComputationComputation
Complexity – Accuracy Tradeoff
Can we quantify it…?
When there is a (relevant) prediction or distortion measure
Accuracy good predictions (low distortion/error)
Complexity long minimal description (optimal codes)
A general tradeoff between distortion and compression:
Information Theory
What can we learn from word co-occurrence...?
Audio Health www Drug noise Dos Doctor ...
Topic1 12 0 0 0 8 0 0 ...
Topic2 0 9 2 11 1 0 6 ...
Topic3 0 10 1 6 0 0 20 ...
Topic4 9 1 0 0 7 0 1 ...
Topic5 0 3 9 0 1 10 0 ...
Topic6 1 11 0 6 0 1 7 ...
Topic7 0 0 8 0 2 12 2 ... Topic8 15 0 1 1 10 0 0 ...
Topic9 0 12 1 16 0 1 12 ...
Topic10 1 0 9 0 1 11 2 ...
... ... ... ... ... ... ... ... ...
We need to index the max number of non-overlapping green blobs inside the blue blob:
(mutual information!)
XX̂)x|x̂(p
)ˆ|(2 XXnH
)(2 XnH
)ˆ,()ˆ|()( 22/2 XXnIXXnHXnH
Representation and Mutual Representation and Mutual InformationInformation
IB: an Information TheoreticIB: an Information Theoretic Principle PrincipleFor extracting For extracting RelevantRelevant structure structure
The minimal representation of X that keeps as much information about another variable, Y, as possible.
Generalizes the classical notion of “sufficient statistics”. ( , )
ˆ
I X YX Y
X
)ˆ,( XXI
),ˆ( YXI
ˆ( | )ˆ ˆ( , ) ( , )p x xMin I X X I X Y
The Self Consistent EquationsSelf Consistent Equations Marginal:
Markov condition:
Bayes’ rule:
x
xpxxpxp )()|ˆ()ˆ(
x
xxpxypxyp )ˆ|()|()ˆ|(
)|ˆ()ˆ(
)()ˆ|( xxp
xp
xpxxp
0)|ˆ(
)]|ˆ([
xxp
xxpL
)ˆ,(exp
),(
)ˆ()|ˆ( xxD
xZ
xpxxp KL
The emerged effective distortioneffective distortion measure:
y
KLKL
xyp
xypxyp
xypxypDxxD
)ˆ|(
)|(log)|(
)ˆ|(|)|(ˆ,
• Regular if is absolutely continuous w.r.t.
• Small if predicts y as well as x:
)ˆ|( xyp )|( xyp
x̂
yx
yx
xyp
xxp
xyp
)ˆ|(
)|ˆ(
)|(
ˆ
The Information BottleneckInformation Bottleneck Algorithm
)ˆ,()ˆ,(min
),(logminminmin
)|ˆ(),ˆ(),ˆ|(
)|ˆ()ˆ()ˆ|(
xxDXXI
xZ
KLxxpxpxyp
xxpxpxyp
xtt
tx
t
KLt
t
tt
xxpxypxyp
xxpxpxp
xxDxZ
xpxxp
)ˆ|()|()ˆ|(
)|ˆ()()ˆ(
)ˆ,(exp),(
)ˆ()|ˆ(1
“free energy”
The emergent effective distortion measure:
)ˆ|(|)|(ˆ, xypxypDxxD KLKL
)ˆ(xp )|ˆ( xxp
)ˆ|( xypGeneralizedBA-algorithm
Can be calculated analytically for Markov chains, Gaussian processes, etc., and numerically in general.
IY
IX
IC1Y (IC1
X)
IC2Y (IC2
X)
IC3Y (IC3
X)
The limit is always the convexenvelope of increasing complexityInformation Curves
Naftali Tishby ACAI-99 20
Words and topics again...
Audio Health www Drug noise Dos Doctor ...
Topic1 12 0 0 0 8 0 0 ...
Topic2 0 9 2 11 1 0 6 ...
Topic3 0 10 1 6 0 0 20 ...
Topic4 9 1 0 0 7 0 1 ...
Topic5 0 3 9 0 1 10 0 ...
Topic6 1 11 0 6 0 1 7 ...
Topic7 0 0 8 0 2 12 2 ... Topic8 15 0 1 1 10 0 0 ...
Topic9 0 12 1 16 0 1 12 ...
Topic10 1 0 9 0 1 11 2 ...
... ... ... ... ... ... ... ... ...
Simple Example
Audio Noise Health Drug Doctor www Dos ....
Doc1 12 8 0 0 0 0 0 ...
Doc4 9 7 1 0 1 0 0 ...
Doc8 15 10 0 1 0 1 0 ...
Doc2 0 1 9 11 6 2 0 ...
Doc3 0 0 10 6 20 1 0 ...
Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...
Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...
... ... ... ... ... ... ... ... ...
Audio Noise Health Drug Doctor www Dos ...
Cluster1 36 25 1 1 1 1 0 ...
Cluster2 1 1 42 39 45 4 2 ...
Cluster3 1 4 3 0 4 26 33 ...
... ... ... ... ... ... ... ... ...
A new compact representation
The document clusters preserve the relevant
information between the documents and words
Analyzing Co-Occurrence Tables
Topics
WordsTopics-Words counts matrix
Words
The exact same counts matrix after permutation
Topics
Word clusters
TopicClusters
The eord clusters provide a compact representation that preserve the informationabout the topics
Quantified by Mutual Information
21 2
1212112121
X,X )X(P
)XX(Plog)XX(P)X(P )XX(H)X(H)X;X(I
The distinctionsinside each clusterAre less relevant forpredicting the class
WordsIrrelevant
distinctions
Symmetric IB through Deterministic Annealing
alt.atheismrec.autosrec.motorcyclesrec.sport.*sci.medsci.spacesoc.religion.christiantalk.politics.*
comp.*misc.forsalesci.cryptsci.electronics
carturkishgameteamjesusgunhockey…
xfileimageencryptionwindowdosmac…
New
sgro
up
Word
P(TC,TW)
Symmetric IB through Deterministic Annealing
New
sgro
up
word
comp.graphicscomp.os.ms-windows.misccomp.windows.x
comp.sys.ibm.pc.hardwarecomp.sys.mac.hardwaremisc.forsalesci.cryptsci.electronics
windowsimagewindowjpeggraphics…
encryptiondbideescrowmonitor…
P(TC,TW)
Symmetric IB through Deterministic Annealing
New
sgro
up
word
P(TC,TW)
Symmetric IB through Deterministic Annealing
New
sgro
up
word
alt.atheismrec.sport.baseballrec.sport.hockeysoc.religion.christiantalk.politics.mideasttalk.religion.misc
rec.autosrec.motorcyclessci.medsci.spacetalk.politics.gunstalk.politics.misc
armenianturkishjesushockeyisraeliarmenians…
carqgunbikefbihealth…
P(TC,TW)
Symmetric IB through Deterministic Annealing
New
sgro
up
Word
P(TC,TW)
Symmetric IB through Deterministic Annealing
New
sgro
up
Word
P(TC,TW)
Symmetric IB through Deterministic Annealing
New
sgro
up
Wordatheistschristianityjesusbiblesinfaith…
alt.atheismsoc.religion.christiantalk.religion.misc
P(TC,TW)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
We observe Semantic Scaling
-3.1 -3 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1-7
-6.5
-6
-5.5
-5
-4.5
y = 1.92*x - 0.866
data 1 linear
),(
),ˆ(
YXI
YXIIY
)(/),ˆ( XHXXII X
X
Y
X
Y
I
I
I
I
1
1
92.1)1(1 XY IcI
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
I(T;X)/H(X)
I(T
;Y)/
I(X
;Y)
20NG Noam data
20NG russian data
Simplified Chinese2.09
Traditional Chinese1.73
Dutch2.3
French2.22
Hebrew1.63
Italian2.35
Japanese1.42
Portuguese2.9
Spanish1.89
-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0-8
-7
-6
-5
-4
-3
-2
-1
0
log(1-I(T;X)/H(X))
log
(1-I
(T;Y
)/I(
X;Y
))
Chinese SimplifiedChinese Traditional
Dutch
French
HebrewItalian
Japanese
Korean
PorgutueseSpanish
English 20NG Jose
English UTF
English ReutersEnglish 20NG Noam
-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5-14
-12
-10
-8
-6
-4
-2
log(1-I(T;X)/H(X))
log
(1-I
(T;Y
)/I(
X;Y
))
Random selection of 200 words
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
)();( ˆˆ XcHYXIXX
Can we understand it?
)ˆ|;( XYXI
),(
),ˆ(
YXI
YXIIY
)(/),ˆ( XHXXII X
)ˆ|(
)ˆ|;(
XXH
XYXI
H
I
X
Y
Any subset of Any subset of the language the language has the same has the same exponent! exponent!
)ˆ|( XXH
)();( ˆˆ XcHYXIXX
But what does it tell about Language?
XY
XY
HI
XXH
H
XYXI
I
loglog
)ˆ|()ˆ|;(
)ˆ|(
)ˆ|;(
XXH
XYXI
H
I
X
Y
““Efficiency of the words”: Efficiency of the words”:
Log-ratio of added Log-ratio of added
Word EntropyWord Entropy
that is transferred tothat is transferred to
Meaningful InformationMeaningful Information
Language appears to have Language appears to have constantconstant word efficiency! word efficiency!
~ 2~ 2
Possible Explanations?Possible Explanations? Power laws are too common to mean anything… Power laws are too common to mean anything…
Zipf’s law and similar… Zipf’s law and similar… “never trust linear log-log plots…”“never trust linear log-log plots…”
It’s It’s a property of my Analysisa property of my Analysis, not of Language, not of Language How do I know that its not all in How do I know that its not all in the way we clusterthe way we cluster the the
words?words?
Words are generated at a Words are generated at a Constant level of Constant level of Ambiguity:Ambiguity: words are generated at awords are generated at a constant rate, constant rate, depending depending
only on the concept (occurred) only on the concept (occurred) ambiguity in ambiguity in usage usage irrespective of vocabulary size or domainirrespective of vocabulary size or domain
Small worldSmall world (scale free) properties of word (scale free) properties of word acquisition…acquisition…
Many Thanks to…Many Thanks to…
Bill BialekBill Bialek Fernando PereiraFernando Pereira Noam SlonimNoam Slonim
Dmitry DavidovDmitry Davidov Amir NavotAmir Navot Josemine MagdalenJosemine Magdalen
Banter Co. (z”l)Banter Co. (z”l)