the horizon isn't found in a dictionary : identifying emerging word senses and identities in...

August 25-27, 2015 Crazy Futures III 1 Ted Pedersen Department of Computer Science University of Minnesota, Duluth [email protected] The horizon isn't found in a dictionary : Identifying emerging word senses and identities in raw text

Ted PedersenDepartment of Computer ScienceUniversity of Minnesota, Duluth

[email protected]://

The horizon isn't found in a dictionary : Identifying

emerging word senses and identities in raw text

A winding road● Dictionaries

● A powerful lens to look back, but not to the future

● Lexicographers● While making dictionaries, engage in a kind of horizon

scanning– What new words or senses are emerging?

● Natural Language Processing● Can we automate the task of the lexicographer?● Can identify emerging words, senses, and identities?

● Wonderful for looking back!● Is that really a word?● How do you spell it?● What does it mean?● When was a word first used?● When did that sense of a word emerge?

● Not particularly predictive● But, the people who create dictionaries are horizon scanners, always looking for new words and senses● Lexicographers● Or … computer programs? (NLP)

● Go back to at least 2300 BCE● Early on were bilingual word lists

● Useful for trade, warfare

● Idea of monolingual dictionary developed later● In English, 1604

Descriptive or Prescriptive● Descriptive

● Document how the language is used● Use determines meaning● English – OED

● Prescriptive● Define how the language should be used● Experts decide● English – early Webster● French Academy – create words to replace Anglicisms

English Lexicography ● 1604 - A Table Alphabeticall, by Robert Cawdrey, approx

2,500 entries● 1755 - The Dictionary of the English Language, by Samuel

Johnson, approx 42,000 entries.● 1828 – American Dictionary of the English Language, by

Noah Webster, approx 70,000 entires● 1928 - Oxford English Dictionary, 4 volumes, approx

400,000 entries● 1989 – Oxford English Dictionary (2nd ed), 10 volumes,

600,000 entries

Table Alphabeticall (1604)A Table Alphabeticall, conteyning and teaching the true writing, and vnderstanding of hard vsuall English wordes, borrowed from the Hebrew, Greeke, Latine, or French. & c.

With the interpretation thereof by plaine English words, gathered for the benefit & helpe of Ladies, Gentlewomen, or any other vnskilfull persons.

Whereby they may the more easilie and better vnderstand many hard English wordes, which they shall heare or read in Scriptures, Sermons, or elswhere, and also be made able to vse the same aptly themselues.

Legere, et non intelligere, neglegere est.

As good not read, as not to vnderstand.

Table Alphabeticall (1604)● A Table Alphabeticall of Hard Usual English Words

● Developed by Robert Cawdrey● 120 pages, 2,543 entries

● Short definitions, synonyms● Doesn't include multiple senses for a word●


combustible, easily burnt combustion, burning or consuming with fire. comedie, (k) stage play, comicall, handled merily like a comedie commemoration, rehearsing or remebring [fr] commencement, a beginning or entrance comet, (g) a blasing starre comentarie, exposition of any thing commerce, fellowship, entercourse of merchandise. commination, threatning, or menacing, commiseration, pittie commodious, profitable, pleasant, fit, commotion, rebellion, trouble, or disquietnesse. communicate, make partaker, or giue part vnto [fr] communaltie, common people, or comon-wealth communion, (* synonyms *) fellow-communitie, ship. (* synonyms end *) compact, ioyned together, or an agreement. compassion, pitty, fellow-feeling compell, to force, or constraine compendious, short, profitable

Table Alphabeticall (1604)

● The First English Dictionary● Not clear why words included or not

● Hard?● Introspection

● Quickly superseded

Page 15: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 15

A Dictionary of the English Language (1755)

● Written by Samuel Johnson (Dr. Johnson)● Worked alone (with six copyists)

● Nearly 43,000 entries● 2,300 pages● 100,000 illustrative quotes from literature●

● Sometimes biased, long-winded, inconsistent● A delight really...

Method● Decided not to build upon previous works● Carried out a perusal of English literature● Studied 2,000 books from 500 authors going back 200 years● Entries based on the past● Selected quotations to show language in


Page 17: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 17

The Inimitable Dr. Johnson● Lexicographer: A writer of dictionaries; a harmless drudge that busies himself in tracing the original, and detailing the signification of words.

● Oats: A grain, which in England is generally given to horses, but in Scotland appears to support the people.

● To worm: To deprive a dog of something, nobody knows what, under his tongue, which is said to prevent him, nobody knows why, from running mad.

oats● Oats. n.s. [aten, Saxon.] A grain, which in England is generally

given to horses, but in Scotland supports the people.● It is of the grass leaved tribe; the flowers have no petals, and are

disposed in a loose panicle: the grain is eatable. The meal makes tolerable good bread. Miller.

● The oats have eaten the horses. Shakespeare.● It is bare mechanism, no otherwise produced than the turning of a wild

oatbeard, by the insinuation of the particles of moisture. Locke.● For your lean cattle, fodder them with barley straw first, and the oat

straw last. Mortimer's Husbandry.● His horse's allowance of oats and beans, was greater than the journey

required. Swift.

A Dictionary of the English Language (1755)

● A monumental work● Set precedents for dictionaries that live on today● Systematic study of published literature for

words and senses● Illustrate senses with quotations● 1700 of Dr. Johnson's definitions remain in OED


Noah Webster● A tireless advocate for American English● “Blue Backed Speller” (1783, 1804, 1806)

● Proposed Americanized spellings● Widely used in schools in 1800s

● Dissertations on the English Language (1789)● An American standard needed to be developed

Noah Webster

● A Compendius Dictionary of the English Language (1806)● 28,000 entries● Intended to improve, Americanize

Dr. Johnson's dictionary

Noah Webster

● An American Dictionary of the English Language (1828)● 70,000 entries● 1864 Unabridged edition had

114,000 entries

Improving on Dr. Johnson?OAT, n.

A plant of the genus Avena, and more usually, the seed of the plant. The word is commonly used in the plural, oats. This plant flourishes best in cold latitudes, and degenerates in the warm. The meal of this grain, oatmeal, forms a considerable and very valuable article of food for man in Scotland, and every where oats are excellent food for horses and cattle.

Page 28: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

An American DictionaryIt is not only important, but, in a degree necessary, that the people of this country, should have an American Dictionary of the English Language; for, although the body of the language is the same as in England, and it is desirable to perpetuate that sameness, yet some differences must exist. Language is the expression of ideas; and if the people of one country cannot preserve an identity of ideas, they cannot retain an identity of language. Now an identity of ideas depends materially upon a sameness of things or objects with which the people of the two countries are conversant. But in no two portions of the earth, remote from each other, can such identity be found. Even physical objects must be different. But the principal differences between the people of this country and of all others, arise from different forms of government, different laws, institutions and customs.

Noah Webster

● An American Dictionary of the English Language (1828)● 70,000 words● Not a great success at the time

Oxford English Dictionary● OED began in 1857 as a revision of Dr. Johnson's dictionary● Improve coverage, quality of entries,

consistency, remove biases● Envisioned as a 10 year project● Was also a response to perception that other

European languages were more advanced with their dictionaries

Oxford English Dictionary

● Work began in 1857, first publication in 1884, first edition in 1928 (71 years later)● James Murray, Chief Editor of OED,

1879 – 1915

Page 32: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 33


● Invite English readers to contribute words

● Read, and whenever they see a word of interest used in an illustrative context, write it on a slip of paper and send it to OUP

● Word, quotation, citation, reference

Page 35: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 35

First edition 1928

● 10 volumes, 15,490 pages● 414,800 entries● 2,000 contributors

● 5 million submitted quotations● 1.86 million used

Second Edition 1989

● 20 volumes, 21,730 pages● Weighs 137 pounds

● 658,000 words● 2.43 million quotations

Page 37: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 38

Page 40: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 41

Page 43: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

But...good news

● Duck face is entering dictionaries● Oxford Dictionaries online● Urban dictionary

● OED sets high bar for inclusion● What words are being used today

that will find their way into OED?

And now...NLP?

● OED tells us when a word or sense was first used

● What if we could automatically recognize new words or senses going forward?

● What if we could recognize people or organizations (identities) that were to be significant?

New words, emerging senses, new identities

● Scan sources of interest and look for words or terms that have not occurred previously, and that reach some level of regularity and frequency

● Once you have a few candidates, you can start to investigate further

Page 46: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 46


● Identify interesting or significant words, phrases, or names

● Group the occurrences of this “interesting thing” into senses

● Differentiate among the senses

● Concordances● Measures of Association● Clustering

● First order co-occurrences● Second order co-occurrences

Page 48: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

August 25-27, 2015 Crazy Futures III 48


● KWIC – Key Word in Context● A basic tool for lexicographers, and many other language users● Long history with religious scholars

● Shows a target word surrounded by some amount of context on either side

Concordance● Can ponder different usages of a word in context, sort and rearrange them, compare and contrast, come to understand distinctions in meaning

● The goal may be to group the contexts in the concordance into groups or clusters, where each cluster uses the target word in the same sense● ...Much like a lexicographer

Page 51: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

● How to recognize similar entries in a concordance?● Collocations with the target word

– All entries using “burnt offering” likely to be using same sense (of offering)

● Same or similar words co-occur in context– All entries that also include “priest” may be


Page 52: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

● Can be recognized via frequency ● May be identified in a large corpus via measures of association ● Do these two words occur together

significantly more often than expected by chance?

Page 53: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Measures of Association● Compare the frequency of a pair of words

with the value that would be expected if they were independent

● p(w1,w2) = p(w1)*p(w2) ??● If the frequency of the pair is not what would

be expected, then this pair is not considered interesting (but is instead just a chance occurrence)

Page 55: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Measures of Association

● Log-likelihood ratio (ll)

● Mutual Information (tmi)

● Pearson's chi-squared test (x2)

● Pointwise Mutual Information (pmi)

● Poisson-Stiring (ps)

● Fisher's Exact Test (leftFisher)

● Jaccard Coefficient (jaccard)

● Odds Ratio (odds)

● Dice Coefficient (dice)

● T-score (tscore)

Page 56: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Log likelihood ratio

Page 57: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Observed versus Expected

● p(w_1,w_2) = n_11 / n_++

● p(w_1) = n_1+ / n_++, p(w2) = n_+1 / n_++

● m_11 = (n_1+ * n_+1) / n_++

● Generalizes to m_ij


W1 n_11 n_12 n_1+

NOT W1 n_21 n_22 n_2+

n_+1 n_+2 n_++

Page 58: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Exampleoffering NOT


burnt n_11 = 184m_11 = 2.47

n_12 = 125m_12 = 306.53


NOT burnt n_21 = 364m_21 = 505.60

n_22 = 67,944m_22 = 62,802.40


548 68,069 68,617

● Do n_ij and m_ij diverge enough to reject the model of independence?

● According to log-likelihood they do …

Page 59: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Features● Collocations – words that occur together more often than expected by chance● Can indicate sense reliably when target word


● Co-occurrences – words that occur near the target word (but not adjacent) ● Useful for differentiating among senses,

especially when several are involved

Page 60: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Word Sense Discrimination

● Feed a cold, starve a fever.● It is always cold in Minnesota.● The soup was cold and watery. ● Cold and flu season is upon us.

Page 61: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Word Sense Discrimination

● Feed a cold, starve a fever.● Cold and flu season is upon us.

● It is always cold in Minnesota.● The soup was cold and watery.

Page 62: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

First Order Representations● CTX1 : Feed a cold, starve a fever.

cold feed fever starve

CTX1 1 1 1 1

Page 63: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

First order methods● Following bag-of-words, text classification

● Represent each target word context with a binary vector that shows which features occur within

● Collocations, co-occurrences

● Results in a context by word matrix (where each row is an instance to be clustered)

● Cluster

Page 64: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

First Order Representations

● CTX1 : Feed a cold, starve a fever.

● CTX4 : Cold and flu season is upon us.

cold feed fever flu season starve upon

CTX11 1 1 0 0 1 0

CTX4 1 0 0 1 1 0 1

Page 65: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

First order representations● Works well enough if you have moderate to

large numbers of larger contexts

● and a relatively consistent vocabulary...

– and a bit of luck...● Success in supervised text classification

problems doesn't always transfer over to unsupervised arena

Page 66: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

What drives us crazy...

● fever and flu have much in common ...● But, just can't see it here..

cold feed fever flu season starve upon

CTX11 1 1 0 0 1 0

CTX4 1 0 0 1 1 0 1

CTX1 : Feed a cold, starve a fever.CTX4 : Cold and flu season is upon us.

Page 67: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Look to the second order...● You shall know a word by the company it keeps (JR

Firth, 1957)

● Words have friends

– Cold is a friend of fever and flu● Friends share friends and hang outs

– Fever and flu share some friends that aren't friends with cold

● 2nd order co-occurrences with cold (f of f)– Fever and flu hang out in places without cold

● 2nd order “locations” of cold

Page 68: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Look to the second order...● Fever and flu have some of the same friends...

● His fever caused his temperature to spike.

● The flu brings on a rise in body temperature.

● Fever and flu hang out together...

● Although influenza (the flu) is not considered serious by many parents, the very high fever that it can cause is a cause of blindness and even death in children.

● Second order features can be derived from the target word contexts, or from other (unannotated) data

Page 69: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

LSI, LSA, and Schütze● Unsupervised methods

● Input Contexts, Output Clusters of Contexts

● Influential

● Context representation a key distinction

● Alternatives to first order features

● They look to the second order...

– LSI/LSA – where do you find your word friends?

– Schütze - who do your word friends hang out with?

Page 70: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Second order representations

● CTX1 : Feed a cold, starve a fever...

● Create co-occurrence vectors for all non-stop words : feed, starve, fever

● Replace words in CTX1 with those vectors

● Average together and replace CTX1 with that new averaged vector

● Do the same with all other target word contexts, then cluster

Page 71: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Second order representations

● CTX1 : Feed a cold, starve a fever.

● CTX4 : Cold and flu season is upon us. ● Nothing matches in first order representation,

but in second order if fever and flu ...

● both occur with temperature, then there is some similarity between CTX1 and CTX4

● both occur in document 12432, then there is some similarity between CTX1 and CTX4

Page 72: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Method● Collect contexts with a given target word

● Identify lexical features within the contexts

● Use these to represent contexts using first or second order features

● Perform SVD or other dimensionality reduction

● Cluster

● Number of clusters automatically discovered

● Generate a label for each cluster

Page 73: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

First order features● Represent contexts with binary vectors that

show which features occur in the context

● Results in a context by word matrix (where each row is an instance to be clustered)

● Cluster

Page 74: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Second order co-occurrences

● Use bigram features to create a word by word co-occurrence matrix

● SVD or dimensionality reduction

● Replace each word in a target word context with the corresponding co-occurrence vector

● Average all of the word vectors together to represent the context

● Do this for each target word context, cluster

Page 75: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

A note on word embeddings● Word embeddings are a recently popular idea where a vector is created for a word based on co-occurrence or other kinds of language information

● 2nd order features as shown here can be seen as a fairly direct sort of word embedding● word2vec is a widely used tool

Page 76: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

second order locations (LSI/LSA)

● Transpose first order representation so that it becomes word by context

● Perform SVD (LSA recommendation)

● Represent contexts to be clustered by replacing each word in a target word context with the corresponding word vector

● Average all of the word vectors together to represent the context

Page 77: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Clustering● Repeated Bisections

● Starts by clustering all contexts in one cluster, then repeatedly partitioning (in two) to optimize the criterion function

● Partitioning done via k-means with k=2

● I2 criterion function

● Finds average pairwise similarity between each context in the cluster and the centroid, sums across all clusters to find value

Page 78: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Cluster stopping● Find k where criterion function stops improving

● PK2 (Hartigan, 1975) takes ratio of criterion function of successive pairs of k

● PK3 takes ratio of twice the criterion function at k divided by product of (k-1) and (k+1)

● PK2 and PK3 stop when these ratios are within 1 std of 1

● Gap Statistic (Tibshirani, 2001) compares observed data with reference sample of noise, find k with greatest divergence from noise

Page 79: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Cluster labeling● Clusters made up of contexts that use the target

word in a particular sense

● Find top N most associated bigrams that are unique to that cluster (discriminating features) and top N that are most associated without regard to which cluster they are in (descriptive features)

● Use standard measures of association like log-likelihood, etc.

● Definition via a few well chosen bigrams

Page 80: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

The result?

● Contexts that contain a particular target word

● Organized by sense, where each cluster contains contexts used in approximately the same sense

Page 81: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

● Much like word senses, except they apply to names

● Many distinct individuals have the same name● How do we differentiate among them?

Same techniques can be used.

Page 82: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

● Might also be interested in new words for old ideas● How similar are the contexts in

which these new words are being used (with old contexts)

Page 83: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

● Might also be interested in new words for old ideas● How similar are the contexts in which

these new words are being used (with old contexts)

● Or different words for the same idea● Can use same technqiues to recognize

Page 84: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

The Future of Word Sense Discrimination

● Automatically identifying senses by clustering contexts continues to improve

● Automatically creating definitions remains challenging, but fascinating problem in its own right● Given a cluster of contexts, create a definition that

captures why these contexts are in the same cluster● Related task at Semeval-2015

Page 85: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

The Future ofWord Sense Discrimination

● Once a definition has been created, use that to position the new sense in a WordNet or ontology● Related task at Semeval-2016

Page 86: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Conclusion● Dictionaries look backwards, and only include words once they have a good chance of long-term acceptance

● The process by which dictionaries are created can be seen as a kind of horizon scanning● New words, new senses● Standards for inclusion in OED very high

Page 87: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

● These techniques can be used to spot emerging words, senses and identities in raw text

● These can be harbingers of future trends

Page 88: The horizon isn't found in a  dictionary : Identifying emerging word senses and identities in raw text

Thank you!

● Measures of Association●

● Word Sense Discrimination●

LSI, LSA, and Schütze● LSI : Deerwester, S., et al. (1988) Improving Information

Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pp. 36–40.

● LSA : Landauer, T. K., and Dumais, S. T. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

● Schütze : Schütze, H. (1998) Automatic word sense discrimination. Computational Linguistics, 24(1), pp. 97-123.

● SenseClusters :