word context enthropy
DESCRIPTION
MapredceTRANSCRIPT
-
Word Context Entropy
Kenneth Heafield
Google Inc
January 16, 2008
Example code from Hadoop 0.13.1 used under the Apache License Version 2.0
and modified for presentation. Except as otherwise noted, the content of this
presentation is licensed under the Creative Commons Attribution 2.5 License.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 1 / 15
-
1 ProblemContextEntropy
2 ImplementationStreaming EntropyReducer Sorting
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 2 / 15
-
Problem
Word Weighting
Idea
Measure how specific a word is
Applications
Query refinement
Automatic tagging
Example
Specificairplane whistle purple pangolin5.0 4.2 5.3 1.6
Generica is any from9.8 9.6 8.7 9.6
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 3 / 15
-
Problem Context
Neighbors
Idea
Non-specific words appear in random contexts.
Example
A bug in the code is worth two in the documentation.
A complex system that works is invariably found to have evolved froma simple system that works.
A computer scientist is someone who fixes things that arent broken.
Im still waiting for the advent of the computer science groupie.
If Id known computer science was going to be like this, Id never havegiven up being a rock n roll star.
A bug, complex, from, simple, computer, being, rock
Computer A, scientist, the, science, known, science
Quotes copied from fortune computers fileKenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 4 / 15
-
Problem Context
Neighbors
Idea
Non-specific words appear in random contexts.
Example
A bug in the code is worth two in the documentation.
A complex system that works is invariably found to have evolved froma simple system that works.
A computer scientist is someone who fixes things that arent broken.
Im still waiting for the advent of the computer science groupie.
If Id known computer science was going to be like this, Id never havegiven up being a rock n roll star.
A bug, complex, from, simple, computer, being, rock
Computer A, scientist, the, science, known, science
Quotes copied from fortune computers fileKenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 4 / 15
-
Problem Context
Neighbors
Idea
Non-specific words appear in random contexts.
Example
A bug in the code is worth two in the documentation.
A complex system that works is invariably found to have evolved froma simple system that works.
A computer scientist is someone who fixes things that arent broken.
Im still waiting for the advent of the computer science groupie.
If Id known computer science was going to be like this, Id never havegiven up being a rock n roll star.
A bug, complex, from, simple, computer, being, rock
Computer A, scientist, the, science, known, science
Quotes copied from fortune computers fileKenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 4 / 15
-
Problem Context
Context Distribution
Boston City England MA University0
0.1
0.2
0.3
0.4 Cambridge
Ambiguous
Closer to equal
Boston City England MA University0
0.1
0.2
0.3
0.4 Attleboro
Just a city
Spiked
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 5 / 15
-
Problem Entropy
Entropy
Definition
Measures how uncertain a random variable N is:Entropy (N) =
n
p (N = n) log2 p (N = n)
Properties
Minimized at 0 when only one outcome is possible
Maximized at log2 k when k outcomes are equally probable
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 6 / 15
-
Problem Entropy
Context Distribution Entropy
Boston City England MA University0
0.1
0.2
0.3
0.4 Cambridge n p p log2 pBoston 0.1 0.332City 0.2 0.464England 0.3 0.521MA 0.2 0.464University 0.2 0.464
Entropy 2.246
Boston City England MA University0
0.1
0.2
0.3
0.4 Attleboro x p p log2 pBoston 0.2 0.464City 0.3 0.521England 0.1 0.332MA 0.3 0.521University 0.1 0.332
Entropy 2.171
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 7 / 15
-
Problem Summary
Summary
Goal
Measure how specific a word is
Approach
1 Count the surrounding words
2 Normalize to make a probability distribution
3 Evaluate entropy
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 8 / 15
-
Implementation
All At Once
Implementation
Mapper outputs key word and value neighbor .
Reducer 1 Counts each neighbor using a hash table.2 Normalizes counts.3 Computes entropy and outputs key word , value entropy .
Example Reduce
Values City, Boston, City, MA, England, City, EnglandHash Table City 3, Boston 1, MA 1, England 2Normalize City 37 , Boston 17 , MA 17 , England 27Entropy City.523, Boston.401, MA.341, England.401
Issues
- Too many neighbors of the to fit in memory.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 9 / 15
-
Implementation
All At Once
Implementation
Mapper outputs key word and value neighbor .
Reducer 1 Counts each neighbor using a hash table.2 Normalizes counts.3 Computes entropy and outputs key word , value entropy .
Example Reduce
Values City, Boston, City, MA, England, City, EnglandHash Table City3, Boston1, MA1, England2Normalize 3, 1, 1, 2Entropy .523, .401, .341, .401
Issues
- Too many neighbors of the to fit in memory.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 9 / 15
-
Implementation
All At Once
Implementation
Mapper outputs key word and value neighbor .
Reducer 1 Counts each neighbor using a hash table.2 Normalizes counts.3 Computes entropy and outputs key word , value entropy .
Example Reduce
Values City, Boston, City, MA, England, City, EnglandHash Table City3, Boston1, MA1, England2Normalize 3, 1, 1, 2Entropy .523, .401, .341, .401
Issues
- Too many neighbors of the to fit in memory.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 9 / 15
-
Implementation
Two Phases
Implementation
1 Count
Mapper outputs key (word , neighbor) and empty value.Reducer counts values.
Then it outputs key word and value count.
2 Entropy
Mapper is Identity. All counts for word go one Reducer.Reducer buffers counts, normalizes, and computes entropy.
Issues
+ Entropy Reducer needs only counts in memory.
- There can still be a lot of counts.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 10 / 15
-
Implementation
Two Phases
Implementation
1 Count
Mapper outputs key (word , neighbor) and empty value.Reducer counts values.
Then it outputs key word and value count.
2 Entropy
Mapper is Identity. All counts for word go one Reducer.Reducer buffers counts, normalizes, and computes entropy.
Issues
+ Entropy Reducer needs only counts in memory.
- There can still be a lot of counts.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 10 / 15
-
Implementation Streaming Entropy
Streaming Entropy
Observation
Normalization and entropy can be computed simultaneously.
Entropy (N) = n
p (N = n) log2 p (N = n) (1)
= n
c(n)
t(log2 c(n) log2 t) (2)
= log2 t n
(c(n)
tlog2 c(n)
)(3)
= log2 t 1
t
n
c(n) log2 c(n) (4)
Moral
Provided counts c(n), Reducer need only remember total t and a sum.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 11 / 15
-
Implementation Streaming Entropy
Two Phases with Streaming
Implementation
1 Count
Mapper outputs key (word , neighbor) and empty value.Reducer counts values.
Then it outputs key word and value count.
2 Entropy
Mapper is Identity. All counts for word go one Reducer.Reducer computes streaming entropy.
Issues
+ Constant memory Reducer.
- Not enough disk to store counts thrice on HDFS.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 12 / 15
-
Implementation Streaming Entropy
Two Phases with Streaming
Implementation
1 Count
Mapper outputs key (word , neighbor) and empty value.Reducer counts values.
Then it outputs key word and value count.
2 Entropy
Mapper is Identity. All counts for word go one Reducer.Reducer computes streaming entropy.
Issues
+ Constant memory Reducer.
- Not enough disk to store counts thrice on HDFS.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 12 / 15
-
Implementation Reducer Sorting
Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The
Sort
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15
-
Implementation Reducer Sorting
Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The
Sort
Word NeighborA BirdA PlaneA PlaneA PlaneA TheA TheAlice BobFoo BarThe Problem
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15
-
Implementation Reducer Sorting
Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The
Sort
Word Neighbor Reducer Output
A Bird ReduceA,1
A PlaneReduceA,3A Plane
A Plane
A The ReduceA,2A The
Alice Bob ReduceAlice,1
Foo Bar ReduceFoo, 1
The Problem ReduceThe, 1
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15
-
Implementation Reducer Sorting
Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The
Sort
Word Neighbor Call OutputA Bird Reduce
(A;1)A Plane
ReduceA PlaneA Plane
(A;1,3)A The ReduceA The
(A;1,3,2)Alice Bob Reduce (A;1,3,2)
(Alice;1)Foo Bar Reduce (Alice;1)
(Foo;1)The Problem Reduce (Foo;1)
(The;1)close() Close (The;1)
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15
-
Implementation Reducer Sorting
Recall Streaming Entropy
Observation
Normalization and entropy can be computed simultaneously.
Entropy (N) = n
p (N = n) log2 p (N = n) (5)
= n
c(n)
t(log2 c(n) log2 t) (6)
= log2 t n
(c(n)
tlog2 c(n)
)(7)
= log2 t 1
t
n
c(n) log2 c(n) (8)
Moral
Computing t and a sum can be done in parallel.
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 14 / 15
-
Implementation Reducer Sorting
Word Neighbor Call OutputA Bird Reduce
(A;1)A Plane
ReduceA PlaneA Plane
(A;1,3)A The ReduceA The
(A;1,3,2)Alice Bob Reduce (A;1,3,2)
(Alice;1)Foo Bar Reduce (Alice;1)
(Foo;1)The Problem Reduce (Foo;1)
(The;1)close() Close (The;1)
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 15 / 15
-
Implementation Reducer Sorting
Word Neighbor Call OutputA Bird Reduce
(A,1,0)A Plane
ReduceA PlaneA Plane
(A,4,4.7)A The ReduceA The
(A,6,6.7)Alice Bob Reduce (A,(6,6.7))
(Alice,1,0)Foo Bar Reduce (Alice,(1,0))
(Foo,1,0)The Problem Reduce (Foo,(1,0))
(The,1,0)close() Close (The,(1,0))
Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 15 / 15
ProblemContextEntropy
ImplementationStreaming EntropyReducer Sorting