ngram models bahareh sarrafzadeh winter 2010. agenda ngrams – language modeling – evaluation of...

61
Ngram Models Bahareh Sarrafzadeh Winter 2010

Upload: shanon-alexander

Post on 28-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Ngram Models

Bahareh SarrafzadehWinter 2010

Page 2: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Agenda

• Ngrams– Language Modeling– Evaluation of LMs

• Markov Models– Stochastic Process– Markov Chain

• Text Classification– Ngram-based Approach

Page 3: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

NGram

Page 4: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

What is an N-Gram?

• A subsequence of n items from a given sequence

• Items:– Phonemes– Syllables– Letters– Words

• Number of Items:– Unigram, Bigram, Trigram, ...

Page 5: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

N-Gram - Examples• 3-Grams

– ceramics collectables collectibles (55)– ceramics collectables fine (130)– ceramics collected by (52)– ceramics collectible pottery (50)– ceramics collectibles cooking (45)

• 4-Grams– serve as the incoming (92)– serve as the incubator (99)– serve as the independent (794)– serve as the index (223)– serve as the indication (72)– serve as the indicator (120)

Page 6: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

N-Gram Model

• A Probabilistic Model for Predicting the next Item in such a sequence.

• Why do we want to Predict Words?– Chatbots– Speech recognition– Handwriting recognition/OCR– Spelling correction– Author attribution– Plagiarism detection– ...

Page 7: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

N-Gram Model

• Models Sequences, esp. NL, using the Statistical Properties of N-Grams

• Idea: Shannon– given a sequence of letters (e.g. "for ex"), what is

the likelihood of the next letter? – From training data, derive a probability

distribution for the next letter given a history of size n.

Page 8: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

N-Gram Model

• Predicts xi based on xi – 1, xi – 2 , ..., xi – n:

• NGram Independence Assumption:– word is affected only by its “prior local context” (last

few words)– Advantages:

• Massively simplifies the problem of learning the language model

• because of the open nature of language, it is common to group words unknown to the language model together

),...,,|( 21 niiii xxxxP −−−

Page 9: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Language Models

• A statistical language model assigns a probability to a sequence of m words by means of a probability distribution

• Applications in NLP:– speech recognition,– machine translation,– part-of-speech tagging,– parsing,– information retrieval.

Page 10: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

• The goal of Statistical Language Modeling is to build a statistical language model that can estimate the distribution of natural language as accurate as possible.

Page 11: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

A bad language model

Page 12: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

A bad language model

Page 13: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

A bad language model

Page 14: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

A bad language model

Page 15: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

What happened?

• A Language model is a probability distribution over word sequences

– P(“And nothing but the truth”) 0.001

– P(“And nuts sing on the roof”) 0

Page 16: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

How language models work?

• Hard to compute P(“And nothing but the truth”)

• Step 1: Decompose probability

but the”)nd nothingP(“truth|a

but”) d nothing P(“the|an) d nothing” P(“but|anP(“And”)

he truth) hing but tP(“And not

×××=

Page 17: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Language Models - Simplification

• Estimating the probability of sequences can become difficult in corpora– Arbitrary long phrases or sentences– Data sparseness– Overfitting

• Solution: Models are often approximated using smoothed N-gram models.

Page 18: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

• In an n-gram model, the probability of observing the sentence w1,...,wm is approximated as:

• The conditional probability can be calculated from n-gram frequency counts:

Ngram Modeling of a Language

),...,(

),,...,(),...,|(

1)1(

1)1(1)1(

−−−

−−−−−− =

ini

iiniinii

wwcountwwwcount

wwwP

∏∏=

−−−

=

− ≈=m

i

inii

m

i

iim wwwPwwwPwwP1

1)1(

1

111 ),...,|(),...,|(),...,(

Prediction History

Page 19: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Example

• Assume each word depends only on the previous two words (Trigram Assumption)

but the”) P(“truth|

he”) hing but tth and not whole truP(“truth|

)thing but” P(“the|no

hing but”)th and not whole truP(“the|

≈…

≈…

Page 20: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Smoothing

• It is useful to assign small probabilities to unseen n-mers.

• For example, for 3-grams we add 2 “dummy“ words (such as ‘.’) to the beginning of each sentence, we have:

)|(.)...|(..)|()...( 2112121 −−= nnnn wwwPwwPwPwwwP

Page 21: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Graphical Representation

. . .

. . .

. . .

1-gram

2-gram. . .

n-gramPrevious (n-1)-gram

Page 22: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Use of Log Probabilities

• Multiplying a large number of probabilities gives a very small result (close to zero)

• So in order to avoid floating-point underflow, we should use logarithms of the probabilities in the model.

Page 23: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Evaluation

• Extrinsic– The language model is embedded in a wider

application:• Slow• Specific to the application

• Intrinsic– The language model is evaluated directly using

some measure, such as Perplexity

Page 24: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Perplexity

• Perplexity is a measure of the size of the set of words from which the next word is chosen given that we observe the history of spoken words.

• The perplexity of a LM depends on the domain of discourse.

Page 25: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Perplexity: Intuition

• Ask a speech recognizer to recognize digits “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10

• Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000

• Perplexity is weighted equivalent branching factor.

Page 26: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Perplexity: Is lower better?

• Remarkable fact: the true model for data has the lowest possible perplexity

• Lower the perplexity, the closer we are to the true model.

Page 27: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Markov Model

Page 28: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Markov Property – Markov Process

• “the future is independent of the past given the present.”

• A stochastic process has the Markov property if the conditional probability distribution of future states of the process depend only upon the present state.

• A process with this property is called Markov process.

Page 29: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Markov Chain• We have a set of states, S = {s1, s2, ... , sr}.• The process starts in one of these states and

moves successively from one state to another. • Each move is called a step.• If the chain is currently in state si, then it moves

to state sj at the next step with a probability denoted by pij .

• This probability does not depend upon which states the chain was in before the currentstate

Page 30: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Order m – Markov Chain

• A Markov chain of order m (or a Markov chain with memory m) where m is finite, is a process in which the future state depends on the past m states.

Page 31: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Text Generation using Markov Chains

• Markov processes can also be used to generate superficially "real-looking" text given a sample document

• These processes are also used by spammers to inject real-looking hidden paragraphs into emails to get these messages past spam filters.

Page 32: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

• Shannon considers a series of Markov chain approximations to English prose.

• For example, he presents first a simulation where the words are chosen independently but with appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

Page 33: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

• He then notes the increased similarity to ordinary English text when the words are chosen as a Markov chain, in which case he obtains

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITERTHAT THE CHARACTER OF THIS POINT IS THEREFOREANOTHER METHOD FOR THE LETTERS THAT THE TIME OFWHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Page 34: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Garkov!

Page 35: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Text Classification using

NGram

Page 36: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Text Classification

• A fundamental kind of document processing• A content based assignment of one or more

predefined categories to free texts.• Approaches:– Supervised– Unsupervised– Semisupervised

Page 37: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Main Tasks

1. Feature Construction / Selection– Extracting Representative Features• Words- Frequency• Context of Words – Set of Words• Spare Phrases – Neighbour Words• Word Ngrams - Frequency

2. Learning Phase– Binary Classifiers– M-ary Classifiers

Page 38: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Learning Algorithms

• Decision Trees• Naive Bayes• KNN• Neural Networks• Support Vector Machines

Page 39: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Ngram based Text Classification

• Features:– N-grams

• Values:– N-grams Frequencies

• Similarity measure– Of various types

Page 40: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Classifier’s Characteristics• The categorization must work reliably in spite

of textual errors.• The categorization must be efficient,

consuming as little storage and processing time as possible.

• The categorization must be able to recognize when a given document does not match any category, or when it falls between two categories.

Page 41: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Overall Approach

• Start with a set of pre-existing text categories (such as subject domains)

• Generate a set of N-gram frequency profiles to represent each of the categories.

• When a new document arrives for classification, the system first computes its N-gram frequency profile.

• It then compares this profile against the profiles for each of the categories using an easily calculated distance measure.

• The system classifies the document as belonging to the category having the smallest distance.

Page 42: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

N-gram Frequency Statistics

• Each word occurs in human languages with a different frequency.

• One of the most common ways of expressing this idea: Zipf’s Law

Page 43: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Zipf’s Law

• The nth most common word in a human language text occurs with a frequency inversely proportional to n:

• there is always a set of words which dominates most of the other words of the language in terms of frequency of use.

r

kf =

Page 44: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Zipf’s Law

• The most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word ...

• This is true for:– Languages, – Subject – specific words

Page 45: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Zipf’s Law: Example

• For example, in the Brown Corpus "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences,

• The second-place word "of" accounts for slightly over 3.5% of words,

• Followed by "and" (about 2%)• Only 135 vocabulary items are needed to

account for half the Brown Corpus.

Page 46: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Zipf’s Law Applies to Lots of Things

• frequency of accesses to web pages • sizes of settlements • income distribution amongst individuals • size of earthquakes• words in the English language

Page 47: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

word frequency in Wikipedia

Page 48: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Zipf’s Law: Classification

• Zipf’s Law implies that classifying documents with N-gram frequency statistics will not be very sensitive to cutting off the distributions at a particular rank.

• It also implies that if we are comparing documents from the same category they should have similar N-gram frequency distributions.

Page 49: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Document Representation

• Documents were represented, by their N-gram frequency profiles:– The list of N-grams ordered by the number of

occurrences in the given document.– It simply describes the Zipfian distribution of N-

grams in the document.

Page 50: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Generating N-Gram FrequencyProfiles

• Split the text into separate tokens • Scan down each token, generating all possible

N-grams• Hash into a table to find the counter for the N-

gram, and increment it. • When done, output all N-grams and their

counts.• Sort those counts into reverse order by the

number of occurrences.

Page 51: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Comparing and Ranking N-GramFrequency Profiles

• Take two N-gram profiles• Calculate a simple rank-order statistic :– E.g. “out-of-place” measure

Page 52: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Language Classification

• Most writing systems support more than one language.

• Given a text that uses a particular writing system, it is necessary to determine the language in which it is written before further processing is possible.

Page 53: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Lexicon-based Approach

• Keep a lexicon for each possible language• Look up every word in the sample text to see

in which lexicon it falls• The lexicon that contains the most words from

the sample indicates which language was used

• Is it a good approach?

Page 54: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Challenges

• Building or Obtaining a Representative Lexicon is not easy!

• For the highly inflected languages, – A much larger lexicon– Some language-specific morphological processing

required• Spelling errors (e.g. as the result of an OCR

process), will disrupt the lexicon lookup process

Page 55: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Ngram-based Approach

• Basic idea: Identify N-grams whose occurrence in a document gives strong evidence for / against identification of a text as belonging to a particular language

• N-gram frequency profile technique can be used to classify document according to their language

Page 56: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Requirements

• No lexicon• No Morphological Processing rules

• A good number of sample texts (10K to 20K bytes)

• Calculating the N-gram frequency profiles

Page 57: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Advantages

• Modest Computational and Storage requirements

• Very effective• Simple• No Semantic or Content analysis required

(apart from the N-gram frequency profile)

Page 58: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Subject Classification

• The same text categorization approach• Extended to a multi-language database• Overall:– A training set is obtained– N-gram frequencies are calculated for each class– N-gram frequencies are calculated for a new article– An overall distance measure between profiles is

computed– The article is assigned to the category which

minimizes this distance

Page 59: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

N-grams: Summary

• Very simple but effective• Resistant to Textual Errors• No Text Preprocessing• Language Independent

Page 60: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

References• P. Brown, et al, “Class-Based n-gram Models of Natural

Language”, Association for Computational Linguistics, 1992• V. Keseljy, N. Cercone et al, “N-gram-based author profiles

for authorship attribution”, 2003• W. B. Cavnar, J. M. Trenkle, “N-gram-based text

categorization”, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994

• P. Náther, “N-gram based Text Categorization”, Diploma thesis, 2005

• J. Henke, “Statistical Inference: n-gram Models over Sparse Data”, TDM Seminar

• J. Goodman, “The State of The Art in Language Modeling ”, Microsoft Research, Speech Technology Group

• http://homepages.inf.ed.ac.uk/lzhang10/slm.html

Page 61: Ngram Models Bahareh Sarrafzadeh Winter 2010. Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain

Thank You!