nlp for social media: language identification ii and text...

Post on 12-Jul-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NLP for Social Media: Language Identification II andText Normalization

Pawan Goyal

CSE, IITKGP

August 3-4, 2016

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 1 / 48

LI: Supervised Approaches

InputA document d

A fixed set of classes C = {c1,c2, . . . ,cn}A training set of m hand-labeled documents (d1,c1), . . . ,(dm,cm)

OutputA learned classifier γ : d→ c

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 2 / 48

LI: Supervised Approaches

InputA document d

A fixed set of classes C = {c1,c2, . . . ,cn}A training set of m hand-labeled documents (d1,c1), . . . ,(dm,cm)

OutputA learned classifier γ : d→ c

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 2 / 48

Supervised Machine Learning

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 3 / 48

Bayes’ rule for documents and classes

For a document d and a class c

P(c|d) = P(d|c)P(c)P(d)

Naïve Bayes Classifier

cMAP = argmaxc∈C

P(c|d)

= argmaxc∈C

P(d|c)P(c)

= argmaxc∈C

P(x1,x2, . . . ,xn|c)P(c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 4 / 48

Bayes’ rule for documents and classes

For a document d and a class c

P(c|d) = P(d|c)P(c)P(d)

Naïve Bayes Classifier

cMAP = argmaxc∈C

P(c|d)

= argmaxc∈C

P(d|c)P(c)

= argmaxc∈C

P(x1,x2, . . . ,xn|c)P(c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 4 / 48

Naïve Bayes classification assumptions

P(x1,x2, . . . ,xn|c)

Bag of words assumptionAssume that the position of a word in the document doesn’t matter

Conditional Independence

Assume the feature probabilities P(xi|cj) are independent given the class cj.

P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)

cNB = argmaxc∈C

P(c)∏x∈X

P(x|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48

Naïve Bayes classification assumptions

P(x1,x2, . . . ,xn|c)

Bag of words assumptionAssume that the position of a word in the document doesn’t matter

Conditional Independence

Assume the feature probabilities P(xi|cj) are independent given the class cj.

P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)

cNB = argmaxc∈C

P(c)∏x∈X

P(x|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48

Naïve Bayes classification assumptions

P(x1,x2, . . . ,xn|c)

Bag of words assumptionAssume that the position of a word in the document doesn’t matter

Conditional Independence

Assume the feature probabilities P(xi|cj) are independent given the class cj.

P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)

cNB = argmaxc∈C

P(c)∏x∈X

P(x|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48

Naïve Bayes classification assumptions

P(x1,x2, . . . ,xn|c)

Bag of words assumptionAssume that the position of a word in the document doesn’t matter

Conditional Independence

Assume the feature probabilities P(xi|cj) are independent given the class cj.

P(x1,x2, . . . ,xn|c) = P(x1|c) ·P(x2|c) . . .P(xn|c)

cNB = argmaxc∈C

P(c)∏x∈X

P(x|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 5 / 48

Learning the model parameters

Maximum Likelihood Estimate

P̂(cj) =doc− count(C = cj)

Ndoc

P̂(wi|cj) =count(wi,cj)

∑w∈V

count(w,cj)

Problem with MLESuppose in the training data, we haven’t seen one of the words (say pure) in agiven language.

P̂(pure|Hindi) = 0

cNB = argmaxc

P̂(c)∏x∈X

P̂(xi|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 6 / 48

Learning the model parameters

Maximum Likelihood Estimate

P̂(cj) =doc− count(C = cj)

Ndoc

P̂(wi|cj) =count(wi,cj)

∑w∈V

count(w,cj)

Problem with MLESuppose in the training data, we haven’t seen one of the words (say pure) in agiven language.

P̂(pure|Hindi) = 0

cNB = argmaxc

P̂(c)∏x∈X

P̂(xi|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 6 / 48

Learning the model parameters

Maximum Likelihood Estimate

P̂(cj) =doc− count(C = cj)

Ndoc

P̂(wi|cj) =count(wi,cj)

∑w∈V

count(w,cj)

Problem with MLESuppose in the training data, we haven’t seen one of the words (say pure) in agiven language.

P̂(pure|Hindi) = 0

cNB = argmaxc

P̂(c)∏x∈X

P̂(xi|c)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 6 / 48

Laplace (add-1) smoothing

P̂(wi|c) =count(wi,c)+1

∑w∈V

(count(w,c)+1)

=count(wi,c)+1

( ∑w∈V

(count(w,c))+ |V|

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 7 / 48

A worked out example

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 8 / 48

A worked out example: No smoothing

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 9 / 48

A worked out example: Smoothing

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 10 / 48

Character n-gram based Approach

Input: A word w (e.g., khiprata)

Features: character n-grams (n=2 to 5)Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48

Character n-gram based Approach

Input: A word w (e.g., khiprata)

Features: character n-grams (n=2 to 5)

Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48

Character n-gram based Approach

Input: A word w (e.g., khiprata)

Features: character n-grams (n=2 to 5)

Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48

Character n-gram based Approach

Input: A word w (e.g., khiprata)

Features: character n-grams (n=2 to 5)Classifier: Naïve Bayes, Max-Ent, SVMs

Prob (kshiprata is Sanskrit) » Prob (kshiprata is English)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48

Character n-gram based Approach

Input: A word w (e.g., khiprata)

Features: character n-grams (n=2 to 5)Classifier: Naïve Bayes, Max-Ent, SVMsProb (kshiprata is Sanskrit) » Prob (kshiprata is English)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 11 / 48

LangID Tools

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 12 / 48

Using langid.py

https://github.com/saffsd/langid.pySupports 97 languages

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 13 / 48

Word-level Language Labeling

Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .Output: Y = Y1 (label for X1), Y2 (label for X2),. . .Such that p(Y|X) is maximized

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48

Word-level Language Labeling

Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .

Output: Y = Y1 (label for X1), Y2 (label for X2),. . .Such that p(Y|X) is maximized

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48

Word-level Language Labeling

Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .Output: Y = Y1 (label for X1), Y2 (label for X2),. . .

Such that p(Y|X) is maximized

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48

Word-level Language Labeling

Modeling as a Sequence Prediction ProblemGiven X: X1 = Modi,X2 = ke,. . .Output: Y = Y1 (label for X1), Y2 (label for X2),. . .Such that p(Y|X) is maximized

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 14 / 48

Conditional Random Fields: Modelling the ConditionalDistribution

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 15 / 48

Conditional Random Fields: Modelling the ConditionalDistribution

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 15 / 48

Conditional Random Fields: Feature Functions

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 16 / 48

Feature Functions

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 17 / 48

Conditional Random Fields: Distribution

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 18 / 48

Features for word level Language Identification

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 19 / 48

Lexical Normalization

Characteristics of Text in Social MediaSocial media text contains varying levels of “noise” (lexical, syntactic andotherwise), e.g.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 20 / 48

Lexical Normalization

Characteristics of Text in Social MediaSocial media text contains varying levels of “noise” (lexical, syntactic andotherwise), e.g.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 20 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy?

no

Length restrictions? not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions? not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions?

not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions? not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions? not primarily

Text input method-driven?

to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions? not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions? not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)?

yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

Why is Social Media Text “Bad”?

Eisenstein [2013] identified the following possible contributing factors to“badness” in social media text:

Lack of literacy? no

Length restrictions? not primarily

Text input method-driven? to some degree, yes

Pragmatics (mimicking prosodic effects etc. in speech)? yeeees

Eisenstein, What to do about bad language on the internet, NAACL-HLT, 2013

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 21 / 48

What can be done about it?

Lexical normalizationTranslate expressions into their canonical form

IssuesWhat are the candidate tokens for normalization?

To what degree do we allow normalization?

What is the canonical form of a given expression? (e.g., aint)

Is lexical normalization always appropriate? (e.g., bro)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48

What can be done about it?

Lexical normalizationTranslate expressions into their canonical form

IssuesWhat are the candidate tokens for normalization?

To what degree do we allow normalization?

What is the canonical form of a given expression? (e.g., aint)

Is lexical normalization always appropriate? (e.g., bro)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48

What can be done about it?

Lexical normalizationTranslate expressions into their canonical form

IssuesWhat are the candidate tokens for normalization?

To what degree do we allow normalization?

What is the canonical form of a given expression? (e.g., aint)

Is lexical normalization always appropriate? (e.g., bro)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48

What can be done about it?

Lexical normalizationTranslate expressions into their canonical form

IssuesWhat are the candidate tokens for normalization?

To what degree do we allow normalization?

What is the canonical form of a given expression? (e.g., aint)

Is lexical normalization always appropriate? (e.g., bro)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48

What can be done about it?

Lexical normalizationTranslate expressions into their canonical form

IssuesWhat are the candidate tokens for normalization?

To what degree do we allow normalization?

What is the canonical form of a given expression? (e.g., aint)

Is lexical normalization always appropriate? (e.g., bro)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 22 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Task Definition

One standard definitionrelative to some standard tokenization

consider only OOV tokens as candidates for normalization

allow only one-to-one word substitutions

Assumptions/corrolaries of the task definition:not possible to normalize in-vocabulary tokens, e.g. their

not possible to normalise the multiword tokens, e.g., ttyl

ignore Twitter-specific entities, e.g., obama, #mandela, bit.ly/1iRqm

assume a unique correct “norm” for each token

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 23 / 48

Spelling Errors

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 24 / 48

Understanding unintentional spelling errors

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 25 / 48

Edit Distance

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 26 / 48

What about spelling errors in Social Media?

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 27 / 48

The case of ‘Tomorrow’

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 28 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)

psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)

today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowels

message→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characters

tomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Phonetic substitution (phoneme)psycho→ syco, then→ den

Phonetic substitution (syllable)today→ 2day, see→ c

Deletion of vowelsmessage→ mssg, about→ abt

Deletion of repeated characterstomorrow→ tomorow

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 29 / 48

Patterns or Compression Operators

Truncation (deletion of tails)

introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common Abbreviations

Kharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciation

going to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetition

Funny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Patterns or Compression Operators

Truncation (deletion of tails)introduction→ intro, evaluation→ eval

Common AbbreviationsKharagpur→ kgp, text back→ tb

Informal pronunciationgoing to→ gonna

Emphasis by repetitionFunny→ fuuuunnnnnyyyyyy

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 30 / 48

Successive Application of Operators

Because→ cause (informal usage)

cause→ cauz (phonetic substitution)

cauz→ cuz (vowel deletion)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 31 / 48

Successive Application of Operators

Because→ cause (informal usage)

cause→ cauz (phonetic substitution)

cauz→ cuz (vowel deletion)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 31 / 48

Successive Application of Operators

Because→ cause (informal usage)

cause→ cauz (phonetic substitution)

cauz→ cuz (vowel deletion)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 31 / 48

Categorisation of non-standard words in English Twitter

Most non-standard words in sampled tweets are morphophonemic variations.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 32 / 48

Categorisation of non-standard words in English Twitter

Most non-standard words in sampled tweets are morphophonemic variations.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 32 / 48

Token-based Approach (Han and Baldwin, 2011)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 33 / 48

Token-based Approach (Han and Baldwin, 2011)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 33 / 48

Token-based Approach (Han and Baldwin, 2011)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 34 / 48

Token-based Approach (Han and Baldwin, 2011)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 35 / 48

Pre-processing

Filter out any Twitter-specific tokens (user-mentions, hashtags, RT, etc.)and URLs

Identify all OOV words relative to a standard spelling dictionary (aspell)

For OOV words, shorten any repetitions of 3+ letters to 2 letters

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 36 / 48

Pre-processing

Filter out any Twitter-specific tokens (user-mentions, hashtags, RT, etc.)and URLs

Identify all OOV words relative to a standard spelling dictionary (aspell)

For OOV words, shorten any repetitions of 3+ letters to 2 letters

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 36 / 48

Pre-processing

Filter out any Twitter-specific tokens (user-mentions, hashtags, RT, etc.)and URLs

Identify all OOV words relative to a standard spelling dictionary (aspell)

For OOV words, shorten any repetitions of 3+ letters to 2 letters

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 36 / 48

Candidate Generation

Generation via edit distance over letters (Tc) and phonemes (Tp).

This allows to generate “earthquake” for words such as earthquick.

Candidates with Tc ≤ 2∨Tp ≤ 1 were taken, further filtered usingfrequency to take the top 10% of candidates.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 37 / 48

Candidate Generation

Generation via edit distance over letters (Tc) and phonemes (Tp).

This allows to generate “earthquake” for words such as earthquick.

Candidates with Tc ≤ 2∨Tp ≤ 1 were taken, further filtered usingfrequency to take the top 10% of candidates.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 37 / 48

Candidate Generation

Generation via edit distance over letters (Tc) and phonemes (Tp).

This allows to generate “earthquake” for words such as earthquick.

Candidates with Tc ≤ 2∨Tp ≤ 1 were taken, further filtered usingfrequency to take the top 10% of candidates.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 37 / 48

Detection of Ill-formed words

Detection based on candidate context fitnessCorrect words should fit better with context than substitution candidates

Incorrect words should fit worse than substitution candidates

Basic Idea: Use Dependencies from corpus dataAn SVM classifier is trained based on dependencies, to indicate candidatecontext fitness.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 38 / 48

Detection of Ill-formed words

Detection based on candidate context fitnessCorrect words should fit better with context than substitution candidates

Incorrect words should fit worse than substitution candidates

Basic Idea: Use Dependencies from corpus dataAn SVM classifier is trained based on dependencies, to indicate candidatecontext fitness.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 38 / 48

Feature Representation using Dependencies

Build a dependency bank from existing corpora

Represent each dependency tuple as a word pair + positional index

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 39 / 48

Feature Representation using Dependencies

Build a dependency bank from existing corpora

Represent each dependency tuple as a word pair + positional index

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 39 / 48

SVM Training Data Generation

Use dependency bank directly as positive features

Automatically generate negative dependency features by replacing thetarget word with highly-ranked confusion candidates

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 40 / 48

SVM Training Data Generation

Use dependency bank directly as positive features

Automatically generate negative dependency features by replacing thetarget word with highly-ranked confusion candidates

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 40 / 48

SVM Training Data Generation

Use dependency bank directly as positive features

Automatically generate negative dependency features by replacing thetarget word with highly-ranked confusion candidates

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 40 / 48

Detecting ill-formed words

OOV words with candidates fitting the context (i.e., positive classificationoutputs) are probably ill-formed words

Threshold = 1→ lookin is considered to be an ill-formed word

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 41 / 48

Detecting ill-formed words

OOV words with candidates fitting the context (i.e., positive classificationoutputs) are probably ill-formed words

Threshold = 1→ lookin is considered to be an ill-formed word

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 41 / 48

Detecting ill-formed words

OOV words with candidates fitting the context (i.e., positive classificationoutputs) are probably ill-formed words

Threshold = 1→ lookin is considered to be an ill-formed word

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 41 / 48

Normalization Candidate Selection

For each ill-formed word and its possible correction candidates, the followingfeatures are considered for normalization:

Word Similarityletter and phoneme edit distance (ED)

prefix, suffix, and longest common subsequence

Context Supporttrigram language model score

dependency score (weighted dependency count, derived from thedetection step)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 42 / 48

Normalization Candidate Selection

For each ill-formed word and its possible correction candidates, the followingfeatures are considered for normalization:

Word Similarityletter and phoneme edit distance (ED)

prefix, suffix, and longest common subsequence

Context Supporttrigram language model score

dependency score (weighted dependency count, derived from thedetection step)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 42 / 48

Type-based approach

ObservationThe longer the ill-formed word, the more likely there is a unique normalizationcandidate

ApproachConstruct a dictionary of (lexical variant, standard form) pair for longer wordtypes (character length ≥ 4) of moderate frequency (≥ 16)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 43 / 48

Type-based approach

ObservationThe longer the ill-formed word, the more likely there is a unique normalizationcandidate

ApproachConstruct a dictionary of (lexical variant, standard form) pair for longer wordtypes (character length ≥ 4) of moderate frequency (≥ 16)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 43 / 48

Type-based approach

ObservationThe longer the ill-formed word, the more likely there is a unique normalizationcandidate

ApproachConstruct a dictionary of (lexical variant, standard form) pair for longer wordtypes (character length ≥ 4) of moderate frequency (≥ 16)

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 43 / 48

Type-based Approach (Han et al. (2012)

Construct the dictionary based on distributional similarity + string similarity

Input: Tokenised English tweetsExtract (OOV, IV) pairs based on distributional similarity

Re-rank the extracted pairs by string similarity

OutputA list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs forinclusion in the normalisation lexicon.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 44 / 48

Type-based Approach (Han et al. (2012)

Construct the dictionary based on distributional similarity + string similarity

Input: Tokenised English tweetsExtract (OOV, IV) pairs based on distributional similarity

Re-rank the extracted pairs by string similarity

OutputA list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs forinclusion in the normalisation lexicon.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 44 / 48

Type-based Approach (Han et al. (2012)

Construct the dictionary based on distributional similarity + string similarity

Input: Tokenised English tweetsExtract (OOV, IV) pairs based on distributional similarity

Re-rank the extracted pairs by string similarity

OutputA list of (OOV, IV) pairs ordered by string similarity; select the top-n pairs forinclusion in the normalisation lexicon.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 44 / 48

An Example

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 45 / 48

Context Modelling

Components/parameters of the methodcontext wondow size: ±1, ±2, ±3

context word sensitivity: bag-of-words vs. positional indexing

context word representation: unigram, bigram or trigram

context word filtering: all tokens vs. only dictionary words

context similarity: KL divergence, Jensen-Shannon divergence, Cosinesimilarity, Euclidean distance

Tune parameters relative to (OOV,IV) pair develepment data

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 46 / 48

Context Modelling

Components/parameters of the methodcontext wondow size: ±1, ±2, ±3

context word sensitivity: bag-of-words vs. positional indexing

context word representation: unigram, bigram or trigram

context word filtering: all tokens vs. only dictionary words

context similarity: KL divergence, Jensen-Shannon divergence, Cosinesimilarity, Euclidean distance

Tune parameters relative to (OOV,IV) pair develepment data

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 46 / 48

Rerank pairs by string similarity

(OOV,IV) pairs derived by distributional similarity:

Get the top-ranked pairs as lexicon entries:

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 47 / 48

Rerank pairs by string similarity

(OOV,IV) pairs derived by distributional similarity:

Get the top-ranked pairs as lexicon entries:

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 47 / 48

Main References

Han, Bo, and Timothy Baldwin. “Lexical normalisation of short textmessages: Makn sens a# twitter.” Proceedings of the 49th AnnualMeeting of the Association for Computational Linguistics: HumanLanguage Technologies-Volume 1. Association for ComputationalLinguistics, 2011.

Han, Bo, Paul Cook, and Timothy Baldwin. “Automatically constructing anormalisation dictionary for microblogs.” Proceedings of the 2012 jointconference on empirical methods in natural language processing andcomputational natural language learning. Association for ComputationalLinguistics, 2012.

Pawan Goyal (IIT Kharagpur) NLP for Social Media: Language Identification II and Text NormalizationAugust 3-4, 2016 48 / 48

top related