word context enthropy

25
Word Context Entropy Kenneth Heafield Google Inc January 16, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 1 / 15

Upload: soumya

Post on 15-Sep-2015

13 views

Category:

Documents


0 download

DESCRIPTION

Mapredce

TRANSCRIPT

  • Word Context Entropy

    Kenneth Heafield

    Google Inc

    January 16, 2008

    Example code from Hadoop 0.13.1 used under the Apache License Version 2.0

    and modified for presentation. Except as otherwise noted, the content of this

    presentation is licensed under the Creative Commons Attribution 2.5 License.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 1 / 15

  • 1 ProblemContextEntropy

    2 ImplementationStreaming EntropyReducer Sorting

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 2 / 15

  • Problem

    Word Weighting

    Idea

    Measure how specific a word is

    Applications

    Query refinement

    Automatic tagging

    Example

    Specificairplane whistle purple pangolin5.0 4.2 5.3 1.6

    Generica is any from9.8 9.6 8.7 9.6

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 3 / 15

  • Problem Context

    Neighbors

    Idea

    Non-specific words appear in random contexts.

    Example

    A bug in the code is worth two in the documentation.

    A complex system that works is invariably found to have evolved froma simple system that works.

    A computer scientist is someone who fixes things that arent broken.

    Im still waiting for the advent of the computer science groupie.

    If Id known computer science was going to be like this, Id never havegiven up being a rock n roll star.

    A bug, complex, from, simple, computer, being, rock

    Computer A, scientist, the, science, known, science

    Quotes copied from fortune computers fileKenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 4 / 15

  • Problem Context

    Neighbors

    Idea

    Non-specific words appear in random contexts.

    Example

    A bug in the code is worth two in the documentation.

    A complex system that works is invariably found to have evolved froma simple system that works.

    A computer scientist is someone who fixes things that arent broken.

    Im still waiting for the advent of the computer science groupie.

    If Id known computer science was going to be like this, Id never havegiven up being a rock n roll star.

    A bug, complex, from, simple, computer, being, rock

    Computer A, scientist, the, science, known, science

    Quotes copied from fortune computers fileKenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 4 / 15

  • Problem Context

    Neighbors

    Idea

    Non-specific words appear in random contexts.

    Example

    A bug in the code is worth two in the documentation.

    A complex system that works is invariably found to have evolved froma simple system that works.

    A computer scientist is someone who fixes things that arent broken.

    Im still waiting for the advent of the computer science groupie.

    If Id known computer science was going to be like this, Id never havegiven up being a rock n roll star.

    A bug, complex, from, simple, computer, being, rock

    Computer A, scientist, the, science, known, science

    Quotes copied from fortune computers fileKenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 4 / 15

  • Problem Context

    Context Distribution

    Boston City England MA University0

    0.1

    0.2

    0.3

    0.4 Cambridge

    Ambiguous

    Closer to equal

    Boston City England MA University0

    0.1

    0.2

    0.3

    0.4 Attleboro

    Just a city

    Spiked

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 5 / 15

  • Problem Entropy

    Entropy

    Definition

    Measures how uncertain a random variable N is:Entropy (N) =

    n

    p (N = n) log2 p (N = n)

    Properties

    Minimized at 0 when only one outcome is possible

    Maximized at log2 k when k outcomes are equally probable

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 6 / 15

  • Problem Entropy

    Context Distribution Entropy

    Boston City England MA University0

    0.1

    0.2

    0.3

    0.4 Cambridge n p p log2 pBoston 0.1 0.332City 0.2 0.464England 0.3 0.521MA 0.2 0.464University 0.2 0.464

    Entropy 2.246

    Boston City England MA University0

    0.1

    0.2

    0.3

    0.4 Attleboro x p p log2 pBoston 0.2 0.464City 0.3 0.521England 0.1 0.332MA 0.3 0.521University 0.1 0.332

    Entropy 2.171

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 7 / 15

  • Problem Summary

    Summary

    Goal

    Measure how specific a word is

    Approach

    1 Count the surrounding words

    2 Normalize to make a probability distribution

    3 Evaluate entropy

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 8 / 15

  • Implementation

    All At Once

    Implementation

    Mapper outputs key word and value neighbor .

    Reducer 1 Counts each neighbor using a hash table.2 Normalizes counts.3 Computes entropy and outputs key word , value entropy .

    Example Reduce

    Values City, Boston, City, MA, England, City, EnglandHash Table City 3, Boston 1, MA 1, England 2Normalize City 37 , Boston 17 , MA 17 , England 27Entropy City.523, Boston.401, MA.341, England.401

    Issues

    - Too many neighbors of the to fit in memory.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 9 / 15

  • Implementation

    All At Once

    Implementation

    Mapper outputs key word and value neighbor .

    Reducer 1 Counts each neighbor using a hash table.2 Normalizes counts.3 Computes entropy and outputs key word , value entropy .

    Example Reduce

    Values City, Boston, City, MA, England, City, EnglandHash Table City3, Boston1, MA1, England2Normalize 3, 1, 1, 2Entropy .523, .401, .341, .401

    Issues

    - Too many neighbors of the to fit in memory.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 9 / 15

  • Implementation

    All At Once

    Implementation

    Mapper outputs key word and value neighbor .

    Reducer 1 Counts each neighbor using a hash table.2 Normalizes counts.3 Computes entropy and outputs key word , value entropy .

    Example Reduce

    Values City, Boston, City, MA, England, City, EnglandHash Table City3, Boston1, MA1, England2Normalize 3, 1, 1, 2Entropy .523, .401, .341, .401

    Issues

    - Too many neighbors of the to fit in memory.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 9 / 15

  • Implementation

    Two Phases

    Implementation

    1 Count

    Mapper outputs key (word , neighbor) and empty value.Reducer counts values.

    Then it outputs key word and value count.

    2 Entropy

    Mapper is Identity. All counts for word go one Reducer.Reducer buffers counts, normalizes, and computes entropy.

    Issues

    + Entropy Reducer needs only counts in memory.

    - There can still be a lot of counts.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 10 / 15

  • Implementation

    Two Phases

    Implementation

    1 Count

    Mapper outputs key (word , neighbor) and empty value.Reducer counts values.

    Then it outputs key word and value count.

    2 Entropy

    Mapper is Identity. All counts for word go one Reducer.Reducer buffers counts, normalizes, and computes entropy.

    Issues

    + Entropy Reducer needs only counts in memory.

    - There can still be a lot of counts.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 10 / 15

  • Implementation Streaming Entropy

    Streaming Entropy

    Observation

    Normalization and entropy can be computed simultaneously.

    Entropy (N) = n

    p (N = n) log2 p (N = n) (1)

    = n

    c(n)

    t(log2 c(n) log2 t) (2)

    = log2 t n

    (c(n)

    tlog2 c(n)

    )(3)

    = log2 t 1

    t

    n

    c(n) log2 c(n) (4)

    Moral

    Provided counts c(n), Reducer need only remember total t and a sum.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 11 / 15

  • Implementation Streaming Entropy

    Two Phases with Streaming

    Implementation

    1 Count

    Mapper outputs key (word , neighbor) and empty value.Reducer counts values.

    Then it outputs key word and value count.

    2 Entropy

    Mapper is Identity. All counts for word go one Reducer.Reducer computes streaming entropy.

    Issues

    + Constant memory Reducer.

    - Not enough disk to store counts thrice on HDFS.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 12 / 15

  • Implementation Streaming Entropy

    Two Phases with Streaming

    Implementation

    1 Count

    Mapper outputs key (word , neighbor) and empty value.Reducer counts values.

    Then it outputs key word and value count.

    2 Entropy

    Mapper is Identity. All counts for word go one Reducer.Reducer computes streaming entropy.

    Issues

    + Constant memory Reducer.

    - Not enough disk to store counts thrice on HDFS.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 12 / 15

  • Implementation Reducer Sorting

    Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The

    Sort

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15

  • Implementation Reducer Sorting

    Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The

    Sort

    Word NeighborA BirdA PlaneA PlaneA PlaneA TheA TheAlice BobFoo BarThe Problem

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15

  • Implementation Reducer Sorting

    Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The

    Sort

    Word Neighbor Reducer Output

    A Bird ReduceA,1

    A PlaneReduceA,3A Plane

    A Plane

    A The ReduceA,2A The

    Alice Bob ReduceAlice,1

    Foo Bar ReduceFoo, 1

    The Problem ReduceThe, 1

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15

  • Implementation Reducer Sorting

    Word NeighborA PlaneFoo BarA BirdA PlaneA TheAlice BobThe ProblemA PlaneA The

    Sort

    Word Neighbor Call OutputA Bird Reduce

    (A;1)A Plane

    ReduceA PlaneA Plane

    (A;1,3)A The ReduceA The

    (A;1,3,2)Alice Bob Reduce (A;1,3,2)

    (Alice;1)Foo Bar Reduce (Alice;1)

    (Foo;1)The Problem Reduce (Foo;1)

    (The;1)close() Close (The;1)

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 13 / 15

  • Implementation Reducer Sorting

    Recall Streaming Entropy

    Observation

    Normalization and entropy can be computed simultaneously.

    Entropy (N) = n

    p (N = n) log2 p (N = n) (5)

    = n

    c(n)

    t(log2 c(n) log2 t) (6)

    = log2 t n

    (c(n)

    tlog2 c(n)

    )(7)

    = log2 t 1

    t

    n

    c(n) log2 c(n) (8)

    Moral

    Computing t and a sum can be done in parallel.

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 14 / 15

  • Implementation Reducer Sorting

    Word Neighbor Call OutputA Bird Reduce

    (A;1)A Plane

    ReduceA PlaneA Plane

    (A;1,3)A The ReduceA The

    (A;1,3,2)Alice Bob Reduce (A;1,3,2)

    (Alice;1)Foo Bar Reduce (Alice;1)

    (Foo;1)The Problem Reduce (Foo;1)

    (The;1)close() Close (The;1)

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 15 / 15

  • Implementation Reducer Sorting

    Word Neighbor Call OutputA Bird Reduce

    (A,1,0)A Plane

    ReduceA PlaneA Plane

    (A,4,4.7)A The ReduceA The

    (A,6,6.7)Alice Bob Reduce (A,(6,6.7))

    (Alice,1,0)Foo Bar Reduce (Alice,(1,0))

    (Foo,1,0)The Problem Reduce (Foo,(1,0))

    (The,1,0)close() Close (The,(1,0))

    Kenneth Heafield (Google Inc) Word Context Entropy January 16, 2008 15 / 15

    ProblemContextEntropy

    ImplementationStreaming EntropyReducer Sorting