maxent (loglinear) models - overview

MaxEnt ModelOverview

Anantharaman Narayana Iyer

30 Jan 2015

MaxEnt Classifier

• This is a powerful model that has equivalence to logistic regression

• Many NLP problems can be reformulated as classification problems• E.g. Language Modelling, Tagging Problems

• MaxEnt is widely used for various text processing tasks.

• Task is to estimate the probability of a class given the context

• The term context may refer to a single word or group of words

• In a large text corpora contains information on the cooccurrence of classes and specific contexts

Problem: MaxEnt (Refer paper by Ratnaparkhi)• Let p(a, b) be the probability of class a occurring with context b

• Given the sparsity of words in b and also limited training data, it is not possible to completely specify p(a, b)

• Given the sparse evidence about a’s and b’s our goal is to estimate the probability model p(a, b)

MaxEnt principle

Representing Evidence• One way to represent evidence is to encode useful facts as features

and to impose constraints on the values of those feature expectations

• A feature is a binary valued function (indicator function):

• 𝑓𝑖: ε ⟶ 0, 1

• Given k features the constraints have the form:• Expectation value of the model for the feature fj = Observed Expectation

value for the feature fj

• 𝑥∈𝜖 𝑝 𝑥 𝑓𝑗 x = 𝑥∈𝜖 𝑝 𝑥 𝑓𝑗 x

• The principle of maximum entropy requires:

Motivating Problems for Log-linear Models• Language Model: Given the context (that is, words w1, w2, …, wi-1 ) predict the word wi

• Consider the examples:• A natural number (i.e. 1, 2, 3, 4, 5, 6, etc.) is called a prime number (or a prime) if it has exactly two positive divisors, 1

and the number itself. Natural numbers greater than 1 that are not prime are called composite.

• Asked about the speculation that he may be inducted into the Cabinet, Parrikar said, “I can comment on it only after meeting the Prime Minister. Let the Prime Minister who has invited me comment”

• Prime Minister Narendra Modi is likely to expand his Cabinet on Sunday, according to Times Now

• “The prime focus of this release of our product is to simplify the user interface”

• N-gram models• Uses the context of previous (n-1) words to predict the nth word

• A trigram model approach uses 2 previous words

• Sometimes the accuracy can be improved if other features of the input are taken in to consideration as opposed to using only a very limited context

• The n-gram LM techniques are not flexible enough to include additional features, such as the total length of sentence, presence of certain specific words, identity of the author etc. Note: One might include extra features like author’s name etc and compute conditional probabilities but such extensions to the conventional trigram approach becomes quickly unwieldy

• Log-linear models can be used to include the additional features and improve the performance

The general problem

• We have an input domain X• For example: A sequence of words

• There is a finite label set Y• For example: The space of all possible words – that is the vocabulary

• Our goal is to determine P(y|x) for any x, y where x is in the input space and y is in the space of labels• For example: Given an input sentence (that is x, a sequence of words),

determine the next word in the sequence - that is P(wi | w1..wn)

Feature Vector

• A feature is a function fk(x, y) ∈ ℝ

• Often the features used in Log-linear models for typical NLP applications are binary functions that are also called indicator functions: fk(x, y) ∈ {0, 1}

• If we have m features then a feature vector f(x, y) ∈ ℝ𝑚

• The number and choice of features for a given input is arbitrary. The system developer can design these with an intuition of the problem space he is addressing.

Features in Log-Linear Models• Features are pieces of elementary pieces of evidence that link aspects

of what we observe x with a label y that we want to predict (Ref: C Manning)

• A feature is a function with a bounded real value𝑓: 𝑋 ∗ 𝑌 → ℝ

• Example:• Consider a sentence: “Gandhi was born on 2 October 1869 in Porbandar”

• f1(x, y) = [y = PERSON and wi = isCapitalized and wi+1 = (“was” | “is”) and wi+2 = VERB]

• f2(x, y) = [y = LOCATION and wi = isCapitalized and wi+1 = (“was” | “is”) and wi+2

= VERB]

• f3(x, y) = [y = DATE and wi = CD and wi-1 = (“on”) and wi-2 = VERB]

Feature Vector Representations• Consider the examples:

• A natural number (i.e. 1, 2, 3, 4, 5, 6, etc.) is called a prime number (or a prime) if it has exactly two positive divisors, 1 and the number itself. Natural numbers greater than 1 that are not prime are called composite.

• Asked about the speculation that he may be inducted into the Cabinet, Parrikar said, “I can comment on it only after meeting the Prime Minister. Let the PrimeMinister who has invited me comment”

• Prime Minister Narendra Modi is likely to expand his Cabinet on Sunday, according to Times Now

• “The prime focus of this release of our product is to simplify the user interface”

• Exercise:• What are the possible features we may consider for

representing the Trigram LM problem?

• How do we extend this set of trigram features in to a more powerful set of features?

Parameter Vector

• Given the feature vector f(x, y) ∈ ℝ𝑚 we can define the parameter vector v ∈ ℝ𝑚

• Each (x, y) is mapped to a score which is the dot product of the parameter vector and the feature vector:

𝑣. 𝑓 𝑥, 𝑦 =

𝑘=1

𝑚

𝑣𝑘 𝑓𝑘

Log-linear model - definition

• Let the Input domain X and label space Y

• Our goal is to determine P(y|x)

• A feature is a function: 𝑓: 𝑋 × 𝑌 → ℝ

• We have m features that constitute a feature vector: 𝑓 𝑥, 𝑦 ∈ ℝ𝑚

• We also have the parameter vector: 𝑣 ∈ ℝ𝑚

• We define the log-linear model as:

𝒑 𝒚 𝒙; 𝒗 =𝒆𝒗.𝒇 𝒙,𝒚

𝒚′∈𝒀

𝒆𝒗.𝒇 𝒙,𝒚′

Refer: Coursera Notes Prof Michael Collins

maxent (loglinear) models - overview

Technology